Numeric age for D: Mir GLAS is faster than OpenBLAS and Eigen

This post presents performance benchmarks for general matrix-matrix multiplication between Mir GLAS, OpenBLAS, Eigen, and two closed source BLAS implementations from Intel and Apple.

OpenBLAS is the default BLAS implementation for most numeric and scientific projects, for example the Julia Programing Language and NumPy. The OpenBLAS Haswell computation kernels were written in assembler.

Mir is an LLVM-Accelerated Generic Numerical Library for Science and Machine Learning. It requires LDC (LLVM D Compiler) for compilation. Mir GLAS (Generic Linear Algebra Subprograms) has a single generic kernel for all CPU targets, all floating point types, and all complex types. It is written completely in D, without any assembler blocks. In addition, Mir GLAS Level 3 kernels are not unrolled and produce tiny binary code, so they put less pressure on the instruction cache in large applications.

Mir GLAS is truly generic comparing with C++ Eigen. To add a new architecture or target, an engineer just needs to extend one small GLAS configuration file. As of October 2016 configurations are available for the X87, SSE2, AVX, and AVX2 instruction sets.

Machine and software

CPU	2.2 GHz Core i7 (I7-4770HQ)
L3 Cache	6 MB
RAM	16 GB of 1600 MHz DDR3L SDRAM
Model Identifier	MacBookPro11,2
OS	OS X 10.11.6
Mir GLAS	0.18.0, single thread
OpenBLAS	0.2.18, single thread
Eigen	3.3-rc1, single thread (sequential configurations)
Intel MKL	2017.0.098, single thread (sequential configurations)
Apple Accelerate	OS X 10.11.6, single thread (sequential configurations)

Source code

The benchmark source code can be found here. It contains Mir vs a CBLAS implementation benchmark.

Mir GLAS has a native mir.ndslice interface. mir.ndslice is a development version of std.experimental.ndslice, which provides an N-dimensional equivalent of D’s built-in array slicing. GLAS uses Slice!(2, T*) for matrix representation. It is a plain structure composed of two lengths, two strides, and a pointer type of T*. GLAS calling convention can be easily used in any programming language with C ABI support.

// Performs: c := alpha a x b + beta c
gemm(alpha, a, b, beta, c);

On the other hand, CBLAS interface is unwieldy

void cblas_sgemm (
	const CBLAS_LAYOUT layout,
	const CBLAS_TRANSPOSE TransA,
	const CBLAS_TRANSPOSE TransB,
	const int M,
	const int N,
	const int K,
	const float alpha,
	const float *A,
	const int lda,
	const float *B,
	const int ldb,
	const float beta,
	float *C,
	const int ldc)

Environment variables to set single thread for cblas

openBLAS	OPENBLAS_NUM_THREADS=1
Accelerate (Apple)	VECLIB_MAXIMUM_THREADS=1
Intel MKL	MKL_NUM_THREADS=1

OpenBLAS and Intel MKL have sequential configurations. Sequential configuration is preferred for benchmarks.

Building Eigen

Eigen should be built with EIGEN_TEST_AVX and EIGEN_TEST_FMA flags:

mkdir build_dir
cd build_dir
cmake -DCMAKE_BUILD_TYPE=Release -DEIGEN_TEST_AVX=ON -DEIGEN_TEST_FMA=ON ..
make blas

Eigen 3.3-rc1 provides the Fortran BLAS interface. Netlib’s CBLAS library can be used for the benchmark to provide CBLAS interface on top of Eigen.

Results

There are four benchmarks, two charts per benchmark. The first chart represents absolute values, the second chart represents normalised values.

single precision numbers
double precision numbers
single precision complex numbers
double precision complex numbers

Higher is better.

Conclusion

Mir GLAS is significantly faster than OpenBLAS and Apple Accelerate for virtually all benchmarks and parameters, two times faster than Eigen and Apple Accelerate for complex matrix multiplication. Mir GLAS average performance equals to Intel MKL, which is the best for Intel CPUs. Due to its simple and generic architecture it can be easily configured for new targets.

Acknowledgements

Andrei Alexandrescu, Martin Nowak, Mike Parker, Johan Engelen, Ali Çehreli, Joseph Rushton Wakeling.