Transformed density rejection sampling

In the first post a short introduction to non-uniform sampling was given. This post explains the Tinflex algorithm, which is implemented in mir.random.flex.

Idea

The main idea of the Flex algorithm is to split the density distribution into multiple intervals and sample from these constructed intervals using the composition method. For each interval a hat and squeeze function can be constructed. Thus with the approximated hat and squeeze function of a selected interval a value $x$ can be sampled using the rejection with inversion method.

For example, with efficiency $\rho = 1.8$ the normal distribution is split into six intervals with different hat and squeeze functions. The efficiency of the sampling can iteratively be improved by splitting the composition into more intervals and thus yielding a better approximation to the density function.

\rho = 1.8

, 6 intervals constructed

\rho = 1.25

, 10 intervals constructed

\rho = 1.1

, 14 intervals constructed

\rho = 1.01

, 44 intervals constructed

In the following we will quickly dive into the essential basics behind the Tinflex algorithm. In case your last math lesson was long ago, feel free to jump directly to the examples and come back when you want to understand the inner workings of the Tinflex algorithm in more details.

Tinflex transformations

Remember that the Tinflex algorithm splits the density distribution into intervals and constructs a hat and squeeze function for each interval. For the approximation of a hat or squeeze function, the Tinflex algorithm constructs linear functions for concave, convex intervals (or both if the interval contains only one inflection point). However as most distributions aren’t concave, but log-concave, Tinflex uses a family of transformations for the probability density function (pdf) as it operates in logspace and can thus “force” most distributions to be concave. The transformations can be grouped in two classes: (1) $c \neq 0$ and (2) $c = 0$ , which will subsequently be analyzed.

$c$ -transformations with $c \neq 0$

The general transformation function is $T(x) = x^c$ , however as its inverse function is only defined in $\mathbb{C}$ , its needs to be restricted to positive numbers only:

\displaystyle T_c(x) = sgn(c) \cdot x^c

With the usual inversion rules and if $c > 0$ we get:

\displaystyle \begin{aligned} y &= x^c \\ log(y) &= log(x^c) \\ log(y) &= c \cdot log(x) \\ \frac{log(y)}{c} &= log(x) \\ exp\left(\frac{log(y)}{c}\right) &= x \end{aligned}

which is $y^{1 / c}$ and if $c < 0$ we obtain:

\displaystyle \begin{aligned} y &= -x^c \\ log(y) &= log(-x^c) \\ log(y) &= -c \cdot log(x) \\ \frac{log(y)}{c} &= -log(x) \\ exp\left(\frac{log(y)}{c}\right) &= -x \end{aligned}

which is $(-y)^{1 / c}$ .

Thus in general the inverse function is defined as:

\displaystyle T_c^{-1}(x) = (sgn(c) \cdot x)^{\frac{1}{c}}

Two examples of $T_c$ transformations can be seen below:

x^2

(blue) and it’s inverse

x^{1/2}

(green)

-x^{-0.5}

(blue) and it’s inverse

(-x)^{1/-0.5}

(green)

Please note that (1) in the Tinflex paper and Mir implementation this case is split into multiple common cases to obtain simpler formulas and thus reduce numerical errors. Furthermore, (2) only with $c = 0$ the transformation is two-sided, thus for $c < 0$ the inverse transformation $T^{-1}(x)$ in $\mathbb{R}$ is only defined for $x < 0$ .

$c$ -transformations with $c = 0$

The special case for $c = 0$ needs to be made as division by zero isn’t defined, however for this case the natural logarithm and it’s inverse the exponential function can be used:

\displaystyle T_c(x) = log(x) \quad T_c^{-1}(x) = exp(x)

Exponential function (green) and it’s inverse the natural log (blue)

Input of Tinflex

Remember that $f(x)$ is the pdf, and is its transformation $\tilde{f}(x)$ is:

\displaystyle \tilde{f}(x) = T_c(f(x))

The Tinflex algorithm expects the log-density, which means that for $c = 0$ , no transformation needs to be applied. However, for $c \neq 0$ the inverse is needed. This mean we first need to apply the inverse $T_c^{-1}(x) = exp(x)$ and then apply the other $T_c$ transformation:

\displaystyle \begin{aligned} \tilde{f}(x) &= T_{c \neq 0}(T_0^{-1}(x))) \\ &= T_{c \neq 0}(exp(x))) \\ &= sgn(c) \cdot exp(x)^c \\ &= sgn(c) \cdot exp(x \cdot c) \end{aligned}

Linear functions

Construction of linear functions

For every interval with function $\tilde{f}$ (black), we define the three linear functions: its left tangent (orange), its right tangent (red), and its secant (blue):

Concave interval

Convex interval

As it can be seen in the graphs of the two major case categories (concave, convex) at least one linear function is majorizing $\tilde{f}$ (e.g. both tangents in the concave case), and at least one linear function is majorized by $\tilde{f}$ (e.g. the secant in the concave case). Botts et. al. (2011) distinguish between eight cases and prove that this observation of finding at least one linear hat and squeeze function is valid as long as there is at most one inflection point in the interval.

Definition of the linear functions

The following definition for linear functions is used:

\displaystyle \tilde{f}(x) = \alpha + \beta \cdot (x - x_0)

Remember that the transformation is applied with $T_c(x)$ and that the hat function majorizes the transformed pdf, whereas the transformed pdf majorizes the squeeze function.

Both the left and right tangent can be defined as follows, where $x_0$ is either $l$ or $r$ :

\displaystyle tan_{x_0}(x) = \tilde{f}(x_0) + \tilde{f}'(x_0) \cdot (x - x_0)

Moreover for the secant we can pick either $\ell$ or $r$ as root point, and thus only the definition of the slope $\beta$ is different:

\displaystyle sec(x) = \tilde{f}(\ell) + \frac{\tilde{f}(r) - \tilde{f}(\ell)}{r - \ell} \cdot (x - \ell)

With these definition we can integrate the linear function to calculate the area below the hat and squeeze function. Furthermore, using the their integral the inverse function for the cumulative density function can be calculated, which is used to map a uniform point $u$ to a point $x$ which is proportional to the density of $h(x)$ . These details can be found in Tinflex paper and in the Mir implementation. For mir.random.flex additional supplementary material is provided which explains the used integrations, inversions and approximations. In the following a more practical side of the Tinflex algorithm will be given by showing two examples for the construction of a random distribution with the Tinflex method.

Example 1: Dagum distribution

Assume you would want to draw random values from a non-common random distribution like e.g. the Dagum distribution which has the following probability density function:

\displaystyle f(x) = \frac{ap}{x} \left( \frac{ (\frac{x}{b})^{ap}}{\left((\frac{x}{a})^a + 1\right)^{p+1}} \right)

For simplicity of this example we fix $a$ , $b$ and $p$ to $1$ and thus obtain:

\displaystyle f(x) = \frac{1}{(x + 1)^2}

For the Tinflex algorithm the pdf function and its first and second derivative need to be given in log-space:

\displaystyle \tilde{f}(x) = log \left( \frac{1}{(x + 1)^2} \right)

Its first and second derivative after $x$ are:

\displaystyle \tilde{f}'(x) = - \frac{2}{x + 1} \quad \tilde{f}''(x) = \frac{2}{(x + 1)^2}

In case you haven’t used the differentiation syntax on Wolfram Alpha, you should familiarize yourself with it. For example, the correctness of the first derivate and second derivative can be quickly checked. Last but not least, the Tinflex algorithm requires that each interval can have at most one inflection point. Hence we need to check the plot of $\tilde{f}(x)$ for inflection points. However as there none if $x > 0$ , we can freely pick any interval as starting intervals and can thus construct the random distribution:

import std.math : log, pow;

alias S = double;

// log-transformed pdf + first two derivatives
auto f0 = (S x) => cast(S) log(1 / pow(x + 1, 2));
auto f1 = (S x) => -2 / (1 + x);
auto f2 = (S x) => 2 / pow(1 + x, 2);

// which c-transformation function to use
S c = 0.5;

// the range of the interval that should be sampled from
// if there are multiple inflection points, more points need to be given
S[] points = [0, 10, S.max];

// target efficiency of hatArea / squeezeArea
S rho = 1.1;

flex(f0, f1, f2, c, points, rho);

Consecutive histogram

Cumulative histogram

We can also have a look under the hood and inspect hat / squeeze plot of the Dagum distribution (hat: red, squeeze: green). For a clearer visibility only the range $[0, 3]$ is shown:

\rho = 1.5

, 4 intervals

\rho = 1.1

, 8 intervals

As it can be seen $\rho$ has a huge impact on the resulting speed, but not on the statistical quality. Similarly the predefined intervals have only a slight impact on the algorithm itself as e.g. they limit the left and right start border. Picking a $c$ -transformation can influence the numerical errors, construction and sampling performance and even the number of intervals. Hence usually a pre-definded transformation like $0$ (the “pure” log transformation) or $1$ is used. However there are some restrictions, for example for unbounded domains $c > -1$ is required, but $c$ also needs to be sufficiently small, so usually $-0.5$ is used.

Example 2: Gompertz distribution

Assume you would want to sample from the Gompertz distribution which has the following probability density function:

\displaystyle f(x) = b \eta e^{bx} \ e^{\eta} \ exp(-\eta e^{bx})

For simplicity, we set $\eta = 0.005$ and $b = 1.5$ and thus get

\displaystyle f(x) = 0.00753759 e^{-0.005 e^{1.5 x} + 1.5 x}

And consequently in log-space:

\displaystyle \tilde{f}(x) = log \left( 0.00753759 e^{-0.005 e^{1.5 x} + 1.5 x} \right)

Its first and second derivative after $x$ are:

\displaystyle \tilde{f}'(x) = 1.5-0.0075 e^{1.5 x} \quad \tilde{f}''(x) = -0.01125 e^{1.5 x}

Note that for log-density also no inflection point exists and thus we can define our sampling procedure:

import std.math : log, pow;

alias S = double;

// log-transformed pdf + first two derivatives
auto f0 = (S x) => cast(S) log( S(0.00753759) * exp(S(-0.005) * exp(S(1.5) * x) + S(1.5) * x));
auto f1 = (S x) => S(1.5) - S(0.0075) * exp(1.5 * x);
auto f2 = (S x) => S(-0.01125) * exp(S(1.5) * x);

// which c-transformation function to use
S c = 1.5;

// the range of the interval that should be sampled from
S[] points = [0, 6, S.max];

// target efficiency of hatArea / squeezeArea
S rho = 1.1;

flex(f0, f1, f2, c, points, rho);

Consecutive histogram

Cumulative histogram

We can also have a look under the hood and inspect hat / squeeze plot of the Gompertz distribution (hat: red, squeeze: green). For a clearer visibility only the range $[0, 6]$ is shown:

\rho = 1.5

, 8 intervals

\rho = 1.1

, 14 intervals

Benchmarks

Constructing intervals

In the following the R Tinflex implementation and the Mir Flex implementation are compared. For the benchmark the time to construct the (Tin)flex intervals with efficiency $\rho$ for $n$ iterations has been measured:

$\rho$	n	Flex	Tinflex
1.1	10K	122 ms	79.473s
1.01	1K	46 ms	21.229s
1.001	1K	169 ms	79.047s

Due to the enormous time consumption of the R implementation, the number of iterations has been reduced from 10K to 1K for higher $\rho$ . With the Mir Flex implementation a performance boost for the construction phase of the factor 500-600 in comparison to the R Tinflex implementation can be observed.

Not only has the D implementation been engineered very carefully with performance in mind, but it also uses several tricks like a BinaryHeap to pick the best interval to split, so that the number of constructed intervals is also lower:

$\rho$	Flex	Tinflex
1.5	8	9
1.1	15	20
1.01	45	49
1.001	141	175
1.0001	452	496
1.00001	1361	1824

The source of this benchmark is available online.

Sampling

A benchmark between multiple existing normal distribution samplers of sampling 10M random values has been made. All existing libraries (dstats, hap, atmosphere) use the Box-Muller transformation to sample random values. Moreover intervals with mir.random.flex with three different efficiencies $\rho$ ( $1.3, 1.1, 1.001$ ) were constructed. The last method is based on the WIP implementation of the Ziggurat algorithm, which is a specialized algorithm for monotone decreasing distributions.

Method	Avg. time	Std.dev. time
boxMuller.naive	850 ms	82 ms
boxMuller.dstats	801 ms	2 ms
boxMuller.hap	3618 ms	4 ms
boxMuller.atmos	934 ms	2 ms
flexNormal.slow	1406 ms	4 ms
flexNormal.medium	1206 ms	1 ms
flexNormal.fast	1124 ms	0 ms
ziggurat	357 ms	5 ms
clang++	2216 ms	6 ms
g++	941 ms	11 ms

While one should look with care at performance benchmarks as statistical quality is usually a higher goal than speed, it still can be seen that (1) investing more time in constructing better approximated intervals is worth the investment if more values need to be drawn and (2) there are specialized algorithm for very common random distributions that of course should be preferred over generalized methods.

However, after all the goal of the Flex algorithm isn’t to create the fastest random method, but a general applicable method that can be used to create samplers for new distributions quickly.

The source of this benchmark is directly available within the Mir library.

Where to go from here

The Tinflex algorithm is available via the Mir library in mir.random.flex. Apart from the examples listed in this blog, more examples are provided as part of Mir and can be conveniently executed with D’s package manager dub. If you want to dive deeper, you can also read the Tinflex paper, browse the Mir implementation or see the supplementary material for the Mir implementation in which the used equations will be explained.

Literature references

Devroye, Luc. “Sample-based non-uniform random variate generation.” Proceedings of the 18th conference on Winter simulation. ACM, 1986.
Hörmann, Wolfgang. “A rejection technique for sampling from T-concave distributions.” ACM Transactions on Mathematical Software (TOMS) 21.2 (1995): 182-193.
Leydold, Josef, et al. “An automatic code generator for nonuniform random variate generation.” Mathematics and Computers in Simulation 62.3 (2003): 405-412.
Botts, Carsten, Wolfgang Hörmann, and Josef Leydold. “Transformed density rejection with inflection points.” Statistics and Computing 23.2 (2013): 251-260.
Hörmann, Wolfgang, Josef Leydold, and Gerhard Derflinger. Automatic nonuniform random variate generation. Springer Science & Business Media, 2013.