Writing efficient numerical code in D
Introduction
This post gives a brief overview of some of mir-algorithm’s modules, which provide iteration tools allowing convenient and fast processing of multidimensional data. It also presents some examples of usage, to demonstrate how this module can be easily utilized to significantly improve performance of numerical code written in D.
This article assumes the reader has a cursory understanding of the ndslice package. If not, please consult this before returning here.
What does mir-algorithm offer?
mir-algorithm is a suite packed with structures and algorithms for multidimensional processing, amongst which are some of the algorithms often seen in functional style programming, such as map, reduce, etc. There is a good amount of such algorithms already implemented in Phobos, D’s standard library, that are being used with great success for some time now. What is special about mir-algorithm’s variant, is that it is integrated seamlessly with rest of the ndslice package, which makes for elegantly flowing processing pipelines.
LLVM acceleration
One key component that makes code based on mir-algorithm blazingly fast, is to compile it with
LDC, the LLVM based D compiler[1].
Iteration algorithms in ndslice have been specially tailored to help LDC auto-vectorize computation
kernels written by the end user, and also to apply unsafe floating point operations, turned on with
@fastmath
attribute in LDC.
For more info on LDC’s optimization techniques, you can check out
this great article by Johan Engelen.
Application
In past few months, Mir team has been actively refactoring implementation details of DCV, computer vision library written in D, by replacing critical processing parts written in loops, with mir-algorithm equivalents. With minimal effort, we’ve managed to make code slightly cleaner, but more importantly - a lot faster!
All measuring presented in this post are made using following configuration:
Model Identifier | MacBookPro13,1 |
CPU | Intel Core i5-6360U @ 2.0 GHz |
L3 Cache | 4 MB |
Memory | 8 GB 1867 MHz LPDDR3 |
OS | macOS Sierra 10.12.1 |
And here are the highlights from benchmarking comparison between before and after these refactorings:
Algorithm | Previous Runtime [s] | Current Runtime [s] | Speedup [%] |
---|---|---|---|
harrisCorners (3x3) | 1.62469 | 0.278579 | 483 |
harrisCorners (5x5) | 4.40641 | 0.328159 | 1242 |
shiTomasiCorners (3x3) | 1.5839 | 0.223794 | 607 |
shiTomasiCorners (5x5) | 4.42253 | 0.297106 | 1388 |
extractCorners | 3.16413 | 0.355564 | 789 |
gray2rgb | 0.441354 | 0.008918 | 4849 |
hsv2rgb | 0.433122 | 0.051392 | 742 |
rgb2gray | 0.262186 | 0.031813 | 724 |
rgb2hsv | 0.365969 | 0.065572 | 458 |
convolution 1D (3) | 0.124888 | 0.067486 | 85 |
convolution 1D (5) | 0.159795 | 0.068881 | 131 |
convolution 1D (7) | 0.206059 | 0.075361 | 173 |
convolution 2D (3x3) | 0.767058 | 0.120216 | 538 |
convolution 2D (5x5) | 1.94106 | 0.360809 | 437 |
convolution 2D (7x7) | 3.71955 | 0.865524 | 329 |
convolution 3D (3x3) | 2.09103 | 0.374006 | 459 |
convolution 3D (5x5) | 5.54736 | 1.07421 | 416 |
bilateralFilter (3x3) | 6.11875 | 1.77848 | 244 |
bilateralFilter (5x5) | 16.7187 | 4.59703 | 263 |
calcGradients | 2.2448 | 0.506101 | 343 |
calcPartialDerivatives | 0.428318 | 0.14152 | 202 |
canny | 4.10899 | 0.75824 | 441 |
filterNonMaximum | 0.477543 | 0.038968 | 1125 |
nonMaximaSupression | 0.588455 | 0.084436 | 596 |
remap | 0.22578 | 0.062089 | 263 |
warp | 0.235169 | 0.063821 | 268 |
Speedups are massive - average in this set is 676.7%, or if written as multiplier: mean(previous / current) = 7.7x
.
But as shown below, changes made to the algorithm implementations were trivial. For the complete benchmarking results,
please take a look at the pull request implementing these changes.
Disclaimer: Please keep in mind the DCV project is still far too young to be compared against proven computer vision toolkits such as OpenCV. Optimizations done here are showing the power of mir-algorithm, but if you dive into the implementation of these algorithms, you’ll notice most of them are implemented naively, without extensive optimizations. A future post will focus on separable filtering, followed by cache locality improvement.
Examples
We’d like to show few examples of mir-algorithm
, but first let’s take a look at the basic principle of replacing
loop-based code with pipelines efficiently. And later on we’ll see how it can be used in a bit more complex algorithms.
Basics
So, let’s first examine the basic principle of utilizing iteration algorithms. This principle is also the basis of that DCV refactoring we’ve mentioned. Say we have following code, written plainly in C-style loops:
@fastmath void someFunc(Slice!(Contiguous, [2], float*) image) {
for(size_t r; r < image.length!0; ++r) {
for(size_t c; c < image.length!1; ++c) {
// perform some processing on image pixel at [r, c]
}
}
}
This code can be rewritten like so:
import mir.ndslice.algorithm : each;
@fastmath void kernel(ref float e)
{
// perform that processing from inside those loops
}
image.each!(kernel);
So, instead of writing a function over the whole image, we could utilize each
to apply the given kernel function to each pixel. As said in the docs, each
iterates eagerly over the data. If processing should be rather evaluated lazily, we could utilize map
.
Convolution
To make the example more concrete, let’s examine how we would implement classic
image convolution with these algorithms. We’ll
write classic, C-style implementation, and its analogue with mir-algorithm. We will wrap both variants with
@fastmath
attribute, to be as fair as possible. Here is the most trivial C-style implementation:
@fastmath
void convLoop
(
Slice!(Contiguous, [2], float*) input,
Slice!(Contiguous, [2], float*) output,
Slice!(Contiguous, [2], float*) kernel
)
{
auto kr = kernel.length!0; // kernel row size
auto kc = kernel.length!1; // kernel column size
foreach (r; 0 .. output.length!0)
foreach (c; 0 .. output.length!1)
{
// take window to input at given pixel coordinate
Slice!(Canonical, [2], float*) window = input[r .. r + kr, c .. c + kc];
// calculate result for current pixel
float v = 0.0f;
foreach (cr; 0 .. kr)
foreach (cc; 0 .. kc)
v += window[cr, cc] * kernel[cr, cc];
output[r, c] = v;
}
}
Now let’s examine how this would be implemented using mir-algorithm:
static @fastmath float kapply(float v, float e, float k) @safe @nogc nothrow pure
{
return v + (e * k);
}
void convAlgorithm
(
Slice!(Contiguous, [2], float*) input,
Slice!(Contiguous, [2], float*) output,
Slice!(Contiguous, [2], float*) kernel
)
{
import mir.ndslice.algorithm : reduce;
import mir.ndslice.topology: windows, map;
auto mapping = input
// look at each pixel through kernel-sized window
.windows(kernel.shape)
// map each window to resulting pixel using convolution function
.map!((window) {
return reduce!(kapply)(0.0f, window, kernel);
});
// assign mapped results to the output buffer.
output[] = mapping[];
}
The pipeline version replaces two double loops with a few magic calls:
- windows: Convenient selector, allows us to look at each pixel through kernel-sized window. It is effectively replacing first two loops in c-style function, automatically giving us the window slice.
- map: mapping multidimensional slice by given lambda.
- reduce: apply reduce algorithm on each element of the window, multiplying it with convolution kernel (mask) values. This is replacing third and fourth loop from first function. This could also be the key for performance improvement, since its well suited to be auto-vectorized by LLVM.
Lazy evaluation?
As it is previously noted, map
is evaluated lazily. At the end of the convAlgorithm
function, we are evaluating the mapping function to the data, and assigning resulting values to the output buffer.
If instead we had needed lazy evaluated convolution, we could have just returned mapping
value from the function,
so we could evaluate it lazily afterwards.
Comparison
Complete benchmark source is located in the Mir benchmark section. Let’s compile this program, and make a comparison:
dub run --build=release-nobounds --compiler=ldmd2 --single convolution.d
Output is:
Running ./convolution
loops = 1 sec, 649 ms, 99 μs, and 6 hnsecs
mir.ndslice.algorithm = 159 ms, 392 μs, and 9 hnsecs
So, for as little effort as this, we get ~10x speedup! And hopefully many would agree that the variant written with mir-algorithm is much cleaner and less error prone!
Zipped tensors
D offers two ways to zip multiple ranges: zip and
lockstep. Both of these functions provide easy-to-use syntax and are
very useful for common usage. Unfortunately, those are not so performance rewarding for multidimensional processing with
ndslice
. Instead, mir.ndslice.topology.zip
should
be used. This function zips two slices of the same structure (shape and strides). It offers same dimension-wise range
interface as Slice
does (and by that is compatible
with the rest of the mir-algorithm), and can perform faster than general purpose utilities Phobos’ zip
and lockstep
.
To explain this concept further, let’s examine following function:
@fastmath
void binarizationLockstep
(
Slice!(Contiguous, [2], float*) input,
float threshold,
Slice!(Contiguous, [2], float*) output
)
in
{
assert(output.shape == input.shape);
}
body
{
import mir.ndslice.topology : flattened;
import std.range : lockstep;
foreach(i, ref o; lockstep(input.flattened, output.flattened))
{
o = (i > threshold) ? float(1) : float(0);
}
}
So, this is a most basic binarization, based on given threshold. To zip those matrices using std.range.lockstep
, we first had to
flatten them with mir.ndslice.topology.flattened
, to
construct a vector, i.e. one-dimensional array, so it can be inserted into lockstep
, as classic D range.
Let’s replace std.range.lockstep
with mir.ndslice.topology.zip
, and see how that affects the implementation:
@fastmath @nogc nothrow @safe
void binarizationZip
(
Slice!(Contiguous, [2], float*) input,
float threshold,
Slice!(Contiguous, [2], float*) output
)
{
import mir.ndslice.algorithm : each;
import mir.ndslice.topology : zip;
zip(input, output).each!( (z) {
z.b = z.a > threshold ? 1.0f : 0.0f;
});
}
This one looks pretty much the same as previous one, with slight difference of utilizing mir.ndslice.toplogy.zip
instead of std.range.lockstep
.
Also, now there’s no need to flatten slices, since mir.ndslice.algorithm.each
works on multidimensional data. But if you take
one more look at the signature of these two functions you’ll notice a difference in annotated attibutes:
@fastmath
void binarizationLockstep(...);
@fastmath @nogc nothrow @safe
void binarizationZip(...);
std.range.lockstep
is not nothrow
, not @nogc
, and not @safe
, but mir.ndslice.topology.zip
is!
Now, let’s compare the impact of these changes on performance by running the benchmark program. The results are as follows:
Running ./binarization
lockstep = 169 ms, 871 μs, and 4 hnsecs
zip = 39 ms, 105 μs, and 9 hnsecs
So, mir.ndslice.topology.zip
gives us about 4.5x speedup. Also it is important to note this gives us interface compatible
with ndslice
iteration algorithms, and also gives us freedom to write nothrow @nogc @safe
code!
Conclusion
In these two examples we’ve achieved some nice performance improvements with very little effort by using mir-algorithm suite. We
have also seen the improvement could be even better if mir-algorithm solutions are applied to more complex code you might encounter
in numerical library such as DCV. We would argue that every newcomer to D, having an interest in numerical computing, should take
a close look at ndslice
and its submodules. And, we hope this post will inspire people to give it a spin, so we would have some
more projects built on top of it, growing our young scientific ecosystem in D!
Acknowledgements
Thanks to Ilya Yaroshenko, Sebastian Wilzbach, Andrei Alexandrescu and Johan Engelen for helping out with truly informative reviews!
[1] Mir works with LDC compilers of version 1.1.0 beta 5 and later.