A few days ago on the SAS/IML Support Community, there was an interesting discussion about how to simulate data from a truncated Poisson distribution. The SAS/IML user wanted to generate values from a Poisson distribution, but discard any zeros that are generated. This kind of simulation is known as an acceptance-rejection method (also known as rejection sampling) because you simulate data from a known distribution and accept some variates while rejecting others. For example, the following SAS/IML function generates N random values from the Poisson(λ) distribution, but rejects any values that are zero:
proc iml; /* Generate N observations from Poisson(lambda). Accept only positive variates */ start RandTPoisson( N, lambda); x = j(N,1); call randgen ( x, "POISSON", lambda); idx = loc(x>=1); /* find all positive values */ if ncol(idx)>0 then return( x[idx] ); /* return positive values */ else return( idx ); /* return empty matrix */ finish;
When you execute an acceptance-rejection algorithm, you might generate 1,000 variates, but accept a much smaller number. For example, the following statement asks for 1,000 variates, but only 376 are returned by the RandTPoisson function:
call randseed(1); N = 1000; /* ask for 1,000 */ lambda = 0.5; x = RandTPoisson(1000, lambda); print (nrow(x))[label="Num Obs"]; /* fewer are accepted */
On the discussion forum, the user had written a SAS/IML function that iterates inside a DO-WHILE loop until enough variates are generated. At each iteration, the algorithm tries to generate the remaining number of variates:
/* A typical implementation of acceptance-rejection method that uses a DO-WHILE loop and concatention. NOT EFFICIENT! */ start AcceptRejectRandTPoisson(N, lambda); r = N; do while (r > 0); Y = Y // RandTPoisson(r, lambda); r = N - nrow(Y); /* remaining observations */ end; return(Y); finish; /* test the function */ x = AcceptRejectRandTPoisson(1000, 0.5); print (nrow(x))[label="Num Obs"];
I often see accept-reject algorithms like this one, and in fact I have written a few myself. The algorithm works, but for many distributions you can write a more efficient algorithm that avoids the DO-WHILE loop. (In general, it can be challenging to write an efficient acceptance rejection algorithm in a vector language such as SAS/IML, R, or MATLAB.) As written, there are two efficiency problems with the DO-WHILE loop:
- Like Zeno's paradox, there is the problem of generating all N observations. The algorithm asks the RandTPoisson for some number of observations, but almost always get fewer observations than is requested.
- The program builds up the array dynamically in a loop, which is very inefficient.
In my forthcoming book, Simulating Data with SAS, I describe a more efficient implementation. The better way is to use statistical theory to predict how many observations you need to generate in order to obtain N that are accepted.
Today I'll show a "quick-and-dirty" method and next week I'll post a more precise statistical result.
For many acceptance-rejection algorithms, there is a well-known "base distribution" such as the normal or Poisson distribution. Observations are generated from the base distribution, and some are rejected. It is often possible to compute the probability that an observation is accepted. If so, you can avoid the inefficiencies in the DO-WHILE loop by simply calling the RandTPoisson function and requesting that it generate more observations than are needed.
Let's see how this works for the current example. The truncated Poisson distribution accepts only positive Poisson variates. You can use the CDF function, which is one of the four essential functions for statistical programmers, to compute the probability that a zero appears. That is the probability of rejection, so the probability of acceptance is computed as follows:
/* What is probability of a successful draw? */ p = 1 - cdf("Poisson", 0, lambda); /* 1 - P(X=0) */
If p is the probability of acceptance, then L = N / p is the number of Poisson variates that we should generate in order to obtain approximately N accepted values:
/* If you generate L = N/p variates, you expect N to be accepted */ L = ceil(N / p); x = RandTPoisson(L, lambda); print p[format=6.4] L (nrow(x))[label="Num Obs"]; /* remove the extraneous values, if necessary */ if nrow(x)>N then x = x[1:N];
The quick-and-dirty method worked for this one random sample. However, there is no guarantee of success. You expect to get N acceptable values, but sometimes you will obtain fewer.
There are two ways to increase the odds that you will obtain at least N accepted values. The non-statistical (but simple!) way is to use a multiplication factors. Instead of generating L = N / p variate, ask for, say, 20% more: L = 1.2 * N / p. That will usually work. However, a better way is to use statistics to be extremely confident that you will get at least N accepted variates. Next week I'll show the statistical technique.
Conclusion: In a vector language, do not generate random variates with a DO-WHILE loop. Instead, it is more efficient to call a vectorized algorithm and to ask for more variates than are needed. Some of the variates will be rejected, but you can often use statistics to compute the number of variates that you should generate so that at least N are accepted.
Please comment on the article here: The DO Loop