Writing efficient numerical code in D

Introduction

This post gives a brief overview of some of mir-algorithm’s modules, which provide iteration tools allowing convenient and fast processing of multidimensional data. It also presents some examples of usage, to demonstrate how this module can be easily utilized to significantly improve performance of numerical code written in D.

This article assumes the reader has a cursory understanding of the ndslice package. If not, please consult this before returning here.

What does mir-algorithm offer?

mir-algorithm is a suite packed with structures and algorithms for multidimensional processing, amongst which are some of the algorithms often seen in functional style programming, such as map, reduce, etc. There is a good amount of such algorithms already implemented in Phobos, D’s standard library, that are being used with great success for some time now. What is special about mir-algorithm’s variant, is that it is integrated seamlessly with rest of the ndslice package, which makes for elegantly flowing processing pipelines.

LLVM acceleration

One key component that makes code based on mir-algorithm blazingly fast, is to compile it with LDC, the LLVM based D compiler^[1]. Iteration algorithms in ndslice have been specially tailored to help LDC auto-vectorize computation kernels written by the end user, and also to apply unsafe floating point operations, turned on with @fastmath attribute in LDC. For more info on LDC’s optimization techniques, you can check out this great article by Johan Engelen.

Application

In past few months, Mir team has been actively refactoring implementation details of DCV, computer vision library written in D, by replacing critical processing parts written in loops, with mir-algorithm equivalents. With minimal effort, we’ve managed to make code slightly cleaner, but more importantly - a lot faster!

All measuring presented in this post are made using following configuration:

Model Identifier	MacBookPro13,1
CPU	Intel Core i5-6360U @ 2.0 GHz
L3 Cache	4 MB
Memory	8 GB 1867 MHz LPDDR3
OS	macOS Sierra 10.12.1

And here are the highlights from benchmarking comparison between before and after these refactorings:

Algorithm	Previous Runtime [s]	Current Runtime [s]	Speedup [%]
harrisCorners (3x3)	1.62469	0.278579	483
harrisCorners (5x5)	4.40641	0.328159	1242
shiTomasiCorners (3x3)	1.5839	0.223794	607
shiTomasiCorners (5x5)	4.42253	0.297106	1388
extractCorners	3.16413	0.355564	789
gray2rgb	0.441354	0.008918	4849
hsv2rgb	0.433122	0.051392	742
rgb2gray	0.262186	0.031813	724
rgb2hsv	0.365969	0.065572	458
convolution 1D (3)	0.124888	0.067486	85
convolution 1D (5)	0.159795	0.068881	131
convolution 1D (7)	0.206059	0.075361	173
convolution 2D (3x3)	0.767058	0.120216	538
convolution 2D (5x5)	1.94106	0.360809	437
convolution 2D (7x7)	3.71955	0.865524	329
convolution 3D (3x3)	2.09103	0.374006	459
convolution 3D (5x5)	5.54736	1.07421	416
bilateralFilter (3x3)	6.11875	1.77848	244
bilateralFilter (5x5)	16.7187	4.59703	263
calcGradients	2.2448	0.506101	343
calcPartialDerivatives	0.428318	0.14152	202
canny	4.10899	0.75824	441
filterNonMaximum	0.477543	0.038968	1125
nonMaximaSupression	0.588455	0.084436	596
remap	0.22578	0.062089	263
warp	0.235169	0.063821	268

Speedups are massive - average in this set is 676.7%, or if written as multiplier: mean(previous / current) = 7.7x. But as shown below, changes made to the algorithm implementations were trivial. For the complete benchmarking results, please take a look at the pull request implementing these changes.

Disclaimer: Please keep in mind the DCV project is still far too young to be compared against proven computer vision toolkits such as OpenCV. Optimizations done here are showing the power of mir-algorithm, but if you dive into the implementation of these algorithms, you’ll notice most of them are implemented naively, without extensive optimizations. A future post will focus on separable filtering, followed by cache locality improvement.

Examples

We’d like to show few examples of mir-algorithm, but first let’s take a look at the basic principle of replacing loop-based code with pipelines efficiently. And later on we’ll see how it can be used in a bit more complex algorithms.

Basics

So, let’s first examine the basic principle of utilizing iteration algorithms. This principle is also the basis of that DCV refactoring we’ve mentioned. Say we have following code, written plainly in C-style loops:

@fastmath void someFunc(Slice!(Contiguous, [2], float*) image) {
    for(size_t r; r < image.length!0; ++r) {
        for(size_t c; c < image.length!1; ++c) {
            // perform some processing on image pixel at [r, c]
        }
    }
}

This code can be rewritten like so:

import mir.ndslice.algorithm : each;

@fastmath void kernel(ref float e)
{
    // perform that processing from inside those loops
}

image.each!(kernel);

So, instead of writing a function over the whole image, we could utilize each to apply the given kernel function to each pixel. As said in the docs, each iterates eagerly over the data. If processing should be rather evaluated lazily, we could utilize map.

Convolution

To make the example more concrete, let’s examine how we would implement classic image convolution with these algorithms. We’ll write classic, C-style implementation, and its analogue with mir-algorithm. We will wrap both variants with @fastmath attribute, to be as fair as possible. Here is the most trivial C-style implementation:

@fastmath
void convLoop
(
    Slice!(Contiguous, [2], float*) input,
    Slice!(Contiguous, [2], float*) output,
    Slice!(Contiguous, [2], float*) kernel
)
{
    auto kr = kernel.length!0; // kernel row size
    auto kc = kernel.length!1; // kernel column size
    foreach (r; 0 .. output.length!0)
        foreach (c; 0 .. output.length!1)
        {
            // take window to input at given pixel coordinate
            Slice!(Canonical, [2], float*) window = input[r .. r + kr, c .. c + kc];

            // calculate result for current pixel
            float v = 0.0f;
            foreach (cr; 0 .. kr)
                foreach (cc; 0 .. kc)
                    v += window[cr, cc] * kernel[cr, cc];
            output[r, c] = v;
        }
}

Now let’s examine how this would be implemented using mir-algorithm:

static @fastmath float kapply(float v, float e, float k) @safe @nogc nothrow pure
{
    return v + (e * k);
}

void convAlgorithm
(
    Slice!(Contiguous, [2], float*) input,
    Slice!(Contiguous, [2], float*) output,
    Slice!(Contiguous, [2], float*) kernel
)
{
    import mir.ndslice.algorithm : reduce;
    import mir.ndslice.topology: windows, map;

    auto mapping = input
        // look at each pixel through kernel-sized window
        .windows(kernel.shape)
        // map each window to resulting pixel using convolution function
        .map!((window) {
            return reduce!(kapply)(0.0f, window, kernel);
        });

    // assign mapped results to the output buffer.
    output[] = mapping[];
}

The pipeline version replaces two double loops with a few magic calls:

windows: Convenient selector, allows us to look at each pixel through kernel-sized window. It is effectively replacing first two loops in c-style function, automatically giving us the window slice.
map: mapping multidimensional slice by given lambda.
reduce: apply reduce algorithm on each element of the window, multiplying it with convolution kernel (mask) values. This is replacing third and fourth loop from first function. This could also be the key for performance improvement, since its well suited to be auto-vectorized by LLVM.

Lazy evaluation?

As it is previously noted, map is evaluated lazily. At the end of the convAlgorithm function, we are evaluating the mapping function to the data, and assigning resulting values to the output buffer. If instead we had needed lazy evaluated convolution, we could have just returned mapping value from the function, so we could evaluate it lazily afterwards.

Comparison

Complete benchmark source is located in the Mir benchmark section. Let’s compile this program, and make a comparison:

dub run --build=release-nobounds --compiler=ldmd2 --single convolution.d

Output is:

Running ./convolution
                     loops = 1 sec, 649 ms, 99 μs, and 6 hnsecs
     mir.ndslice.algorithm = 159 ms, 392 μs, and 9 hnsecs

So, for as little effort as this, we get ~10x speedup! And hopefully many would agree that the variant written with mir-algorithm is much cleaner and less error prone!

Zipped tensors

D offers two ways to zip multiple ranges: zip and lockstep. Both of these functions provide easy-to-use syntax and are very useful for common usage. Unfortunately, those are not so performance rewarding for multidimensional processing with ndslice. Instead, mir.ndslice.topology.zip should be used. This function zips two slices of the same structure (shape and strides). It offers same dimension-wise range interface as Slice does (and by that is compatible with the rest of the mir-algorithm), and can perform faster than general purpose utilities Phobos’ zip and lockstep.

To explain this concept further, let’s examine following function:

@fastmath
void binarizationLockstep
(
    Slice!(Contiguous, [2], float*) input,
    float threshold,
    Slice!(Contiguous, [2], float*) output
)
in
{
    assert(output.shape == input.shape);
}
body
{
    import mir.ndslice.topology : flattened;
    import std.range : lockstep;
    foreach(i, ref o; lockstep(input.flattened, output.flattened))
    {
        o = (i > threshold) ? float(1) : float(0);
    }
}

So, this is a most basic binarization, based on given threshold. To zip those matrices using std.range.lockstep, we first had to flatten them with mir.ndslice.topology.flattened, to construct a vector, i.e. one-dimensional array, so it can be inserted into lockstep, as classic D range.

Let’s replace std.range.lockstep with mir.ndslice.topology.zip, and see how that affects the implementation:

@fastmath @nogc nothrow @safe
void binarizationZip
(
    Slice!(Contiguous, [2], float*) input,
    float threshold,
    Slice!(Contiguous, [2], float*) output
)
{
    import mir.ndslice.algorithm : each;
    import mir.ndslice.topology : zip;

    zip(input, output).each!( (z) {
        z.b = z.a > threshold ? 1.0f : 0.0f;
    });
}

This one looks pretty much the same as previous one, with slight difference of utilizing mir.ndslice.toplogy.zip instead of std.range.lockstep. Also, now there’s no need to flatten slices, since mir.ndslice.algorithm.each works on multidimensional data. But if you take one more look at the signature of these two functions you’ll notice a difference in annotated attibutes:

@fastmath
void binarizationLockstep(...);

@fastmath @nogc nothrow @safe
void binarizationZip(...);

std.range.lockstep is not nothrow, not @nogc, and not @safe, but mir.ndslice.topology.zip is!

Now, let’s compare the impact of these changes on performance by running the benchmark program. The results are as follows:

Running ./binarization
                  lockstep = 169 ms, 871 μs, and 4 hnsecs
                       zip = 39 ms, 105 μs, and 9 hnsecs

So, mir.ndslice.topology.zip gives us about 4.5x speedup. Also it is important to note this gives us interface compatible with ndslice iteration algorithms, and also gives us freedom to write nothrow @nogc @safe code!

Conclusion

In these two examples we’ve achieved some nice performance improvements with very little effort by using mir-algorithm suite. We have also seen the improvement could be even better if mir-algorithm solutions are applied to more complex code you might encounter in numerical library such as DCV. We would argue that every newcomer to D, having an interest in numerical computing, should take a close look at ndslice and its submodules. And, we hope this post will inspire people to give it a spin, so we would have some more projects built on top of it, growing our young scientific ecosystem in D!

Acknowledgements

Thanks to Ilya Yaroshenko, Sebastian Wilzbach, Andrei Alexandrescu and Johan Engelen for helping out with truly informative reviews!

[1] Mir works with LDC compilers of version 1.1.0 beta 5 and later.