:

## Generating Samples from Custom Probability Distributions (I)

Maple 14

Maple's Statistics package contains many predefined probability distributions; well-known ones such as the normal distribution and lesser-known ones such as the Gumbel distribution. For these distributions, we ship efficient algorithms that can quickly generate a large number of sample points. To generate a sample of size 106 of both of these distributions, and print the time it took to do this (in seconds), you can run the following:

`with(Statistics):t0 := time(); Sample(Normal(0, 1), 10^6); time()-t0;t0 := time(); Sample(Gumbel(0, 1/2), 10^6); time()-t0;`

(On my development machine, the times printed are 0.050 and 0.190 seconds, respectively.) But sometimes you need to use a probability distribution that is not predefined, and then you can define a custom distribution, for example as follows:

`f := t -> piecewise(t < 0, 0, t <= 1, t^2*(4/3 - t)*36/7, 0);dist := Distribution(PDF = f);X := RandomVariable(dist);`

The density of this random variable looks as follows: Generating a sample from this distribution is somewhat slower than with the two predefined distributions referred to above (0.830 seconds for a million points on my machine), but it's still quite fast. This blog post (in several parts) will be about the algorithm Maple uses to generate sample values from X; the three later parts are Generating Samples from Custom Probability Distributions (II), Generating Samples from Custom Probability Distributions (III), and Generating Samples from Custom Probability Distributions (IV).

There are a few generally applicable strategies for generating samples that have been described in literature; the one that Maple uses is a form of the accept/reject strategy. I won't call it an algorithm, because an important step is unspecified and up to the implementor. This step is to find a (typically piecewise continuous) function g(t) defined on the support of X (the set of points where f(t) > 0) that satisfies three requirements; the first two are

1.  g(t) >= f(t) for t in the support of X, and
2.  int(g(t), t = support(X)) is finite.

(Note that int(g(t), t = support(X)) >= int(f(t), t = support(X)) = 1.) Let's define G := int(g(t), t = support(X)), so that g(t)/G is a probability distribution function. Let Y be a random variable with g(t)/G as its PDF; the final requirement on g is that

3.  Y is easy to sample.

In order to generate a sample from X, we first generate a sample from Y. For every value t that is in this sample, we mark it as accepted with probability f(t) / g(t) and rejected otherwise; together, the accepted values are to be returned as a sample of X. For the example above, we can take This leads to G = 2 so that Y is uniformly distributed over the interval [0, 1]. We can generate a sample from Y as follows:

`samplePoints := Sample(Uniform(0, 1), 10^4);`

The process of accepting some of these values and rejecting others lends itself very well to a nice visualization. We will represent each of the sample values by a point in 2D, where the x-coordinate is the sample value t, and the y-coordinate is a uniformly distributed random value between 0 and g(t). By construction, these points are uniformly distributed over the area under the curve of g(t). (In the current example, that is trivial, but it also holds if Y is not a uniform random variable.) Finally the points that are under the graph of f(t) are accepted, and the others are rejected. This looks as follows in Maple.

`yCoordinates := Sample(Uniform(0, 2), 10^4);accepted, rejected := selectremove(i -> yCoordinates[i] < f(samplePoints[i]) end proc, [seq(1 .. 10000)]);myPointPlot := proc (xvalues::{list, rtable}, yvalues::{list, rtable}, color::name, label::name, \$)  return Statistics:-PointPlot(yvalues, ':-xcoords' = convert(xvalues, list),     ':-color' = color, ':-symbolsize' = 1, ':-legend' = label) end proc;plots:-display(myPointPlot(samplePoints[accepted], yCoordinates[accepted], green, 'accepted'),   myPointPlot(samplePoints[rejected], yCoordinates[rejected], red, 'rejected'));`

which shows this graph for one run on my machine: If I run

`nops(accepted), nops(rejected);`

then the values returned are 4979 and 5021, so we see that about half of the points were rejected (red). This is to be expected, because the area under the curve g(t) is 2, whereas the area under f(t) is 1. The green points form the final sample of accepted points that is to be returned, and visually, it is plausible that this indeed gives the correct distribution. But can we do this more efficiently?

The answer is of course, yes we can. We will see several improvements in the following parts of this blog post!

The content of this first part of the post can also be viewed in this worksheet: random-generation-1.mw . ﻿