Generate binary outcomes with varying probability

January 16, 2013

(This article was originally published at The DO Loop, and syndicated at StatsBlogs.)

A while ago I saw a blog post on how to simulate Bernoulli outcomes when the probability of generating a 1 (success) varies from observation to observation. I've done this often in SAS, both in the DATA step and in the SAS/IML language. For example, when simulating data that satisfied a logistic regression model, the probability of success is a function of a linear combination of the explanatory variables. Accordingly, the probability of the binary response varies from observation to observation.

In vector languages such as SAS/IML, the key to efficient simulation is often avoiding writing a loop. In SAS/IML, you can use a clever trick to ensure that your simulation is efficient: call the Base SAS RAND function instead of the SAS/IML RANDGEN routine! This may seem strange to experienced SAS/IML programmers. Why would anyone call the RAND function from within SAS/IML? Isn't it the case that the RAND function returns a single scalar value, whereas a single call to the more efficient RANDGEN call fills an entire matrix with random values?

Well, yes, and no. If you call the RAND function from a SAS/IML program, the function returns a scalar value for each parameter that you specify as an argument. (This is generally true when calling Base SAS functions from SAS/IML software.) This means that the RAND function returns a vector of random values when you pass in a vector of parameters.

Let's see how this works. Suppose I want to sample observations from the Bernoulli distribution with probability of success p. For the first 10 observations I want to use p=0.05. For the next 10, I want to use p=0.5. For the last 10, p=0.95. The following SAS/IML statements create a 10 x 3 matrix whose columns contain the specified probabilities. The matrix is passed to the RAND function, which returns—tah-dah!—a 10 x 3 matrix of values where each column is a sample from a Bernoulli distribution with a different parameter.

proc iml;
prob = {0.05 0.5 0.95};
p = repeat(prob, 10);   /* repeat row 10 times */
call streaminit(321);   /* or call randseed(321); */
x = rand("Bernoulli", p);
print x;

Notice that the first column of the resulting matrix is all zeros because the probability of getting a one is only 0.05. The middle column (p=0.5) is roughly half zeros and half ones, and the third column (p=0.95) is almost all ones. There are no loops required.

As a more contrived example, suppose that the probability of some event occurring has a cyclical nature. This might model, for example, the probability of buying a seasonal item such as a winter coat or a swimsuit. The following SAS/IML program create a vector of parameters that vary sinusoidally between ± 0.9. Binary outcomes are generated according to the parameter vector:

t = do(0, 2*constant("PI"), 0.1);
r = 0.5 + 0.4*sin(2*t);
/*  simulate each observation from a different distribution */
y = rand("Bernoulli", r);

To visualize the probability and the outcomes, you can create a graph that shows the parameter values and overlays a scatter plot that shows whether each binary outcome was 0 or 1:

create Bern var {t r y}; append; close Bern;
proc sgplot data=Bern noautolegend;
series  x=t y=r;
scatter x=t y=y / Y2AXIS;
refline 0.5 / axis=y;
refline 1.57 3.14 4.71 / axis=x; /* pi/2, pi, 3pi/2, 2pi */ 
yaxis  label="Probability" min=0 max=1;
y2axis label="Outcome" values=(0 1);
xaxis display=(nolabel) valueshint values=(0 1.57 2 3.14 4 4.71 6);

In summary, you can call the RAND function from a SAS/IML program and pass in a vector of parameter values. The result is a vector of random values, where the ith value is drawn from a distribution with the ith parameter values.

Of course, you can also generate data like these by using the DATA step. Furthermore, in SAS/IML 12.1 and beyond, the RANDGEN function supports vectors of parameters.

tags: Efficiency, Sampling and Simulation, Statistical Programming, Tips and Techniques

Please comment on the article here: The DO Loop

Tags: , , , ,