OpenBLAS is the default BLAS implementation for most numeric and scientific projects, for example the Julia Programing Language and NumPy. The OpenBLAS Haswell computation kernels were written in assembler.
Mir is an LLVM-Accelerated Generic Numerical Library for Science and Machine Learning. It requires LDC (LLVM D Compiler) for compilation. Mir GLAS (Generic Linear Algebra Subprograms) has a single generic kernel for all CPU targets, all floating point types, and all complex types. It is written completely in D, without any assembler blocks. In addition, Mir GLAS Level 3 kernels are not unrolled and produce tiny binary code, so they put less pressure on the instruction cache in large applications.
Mir GLAS is truly generic comparing with C++ Eigen. To add a new architecture or target, an engineer just needs to extend one small GLAS configuration file. As of October 2016 configurations are available for the X87, SSE2, AVX, and AVX2 instruction sets.
Machine and software
|CPU||2.2 GHz Core i7 (I7-4770HQ)|
|L3 Cache||6 MB|
|RAM||16 GB of 1600 MHz DDR3L SDRAM|
|OS||OS X 10.11.6|
|Mir GLAS||0.18.0, single thread|
|OpenBLAS||0.2.18, single thread|
|Eigen||3.3-rc1, single thread (sequential configurations)|
|Intel MKL||2017.0.098, single thread (sequential configurations)|
|Apple Accelerate||OS X 10.11.6, single thread (sequential configurations)|
The benchmark source code can be found here. It contains Mir vs a CBLAS implementation benchmark.
Mir GLAS has a native
mir.ndslice is a development version of
which provides an N-dimensional equivalent of D’s built-in array slicing.
Slice!(2, T*) for matrix representation. It is a plain structure
composed of two lengths, two strides, and a pointer type of
GLAS calling convention can be easily used in any programming language with C ABI support.
// Performs: c := alpha a x b + beta c gemm(alpha, a, b, beta, c);
On the other hand, CBLAS interface is unwieldy
void cblas_sgemm ( const CBLAS_LAYOUT layout, const CBLAS_TRANSPOSE TransA, const CBLAS_TRANSPOSE TransB, const int M, const int N, const int K, const float alpha, const float *A, const int lda, const float *B, const int ldb, const float beta, float *C, const int ldc)
Environment variables to set single thread for cblas
OpenBLAS and Intel MKL have sequential configurations. Sequential configuration is preferred for benchmarks.
Eigen should be built with
mkdir build_dir cd build_dir cmake -DCMAKE_BUILD_TYPE=Release -DEIGEN_TEST_AVX=ON -DEIGEN_TEST_FMA=ON .. make blas
Eigen 3.3-rc1 provides the Fortran BLAS interface. Netlib’s CBLAS library can be used for the benchmark to provide CBLAS interface on top of Eigen.
There are four benchmarks, two charts per benchmark. The first chart represents absolute values, the second chart represents normalised values.
- single precision numbers
- double precision numbers
- single precision complex numbers
- double precision complex numbers
Higher is better.
Mir GLAS is significantly faster than OpenBLAS and Apple Accelerate for virtually all benchmarks and parameters, two times faster than Eigen and Apple Accelerate for complex matrix multiplication. Mir GLAS average performance equals to Intel MKL, which is the best for Intel CPUs. Due to its simple and generic architecture it can be easily configured for new targets.
Andrei Alexandrescu, Martin Nowak, Mike Parker, Johan Engelen, Ali Çehreli, Joseph Rushton Wakeling.