# Category: Probability and Statistics

## Testing Cliff RNG with DIEHARDER

My previous post introduced the Cliff random number generator. The post showed how to find starting seeds where the generator will start out by producing approximately equal numbers. Despite this flaw, the generator works well by some criteria. I produced a file of 100 million 32-bit integers by multiplying the output values [1], which were […]

## Serious applications of a party trick

In a group of 30 people, it’s likely that two people have the same birthday. For a group of 23 the probability is about 1/2, and it goes up as the group gets larger. In a group of a few dozen people, it’s unlikely that anyone will have a particular birthday, but it’s likely that […]

## Per stirpes and random walks

If an inheritance is to be divided per stirpes, each descendant gets an equal share. If a descendant has died but has living descendants, his or her share is distributed by applying the rule recursively. Example For example, suppose a man had two children, Alice and Bob, and stipulates in his will that his estate […]

## Using one RNG to sample another

Suppose you have two pseudorandom bit generators. They’re both fast, but not suitable for cryptographic use. How might you combine them into one generator that is suitable for cryptography? Coppersmith et al [1] had a simple but effective approach which they call the shrinking generator. The idea is to use one bit stream to sample […]

## Random sampling from a file

I recently learned about the Linux command line utility shuf from browsing The Art of Command Line. This could be useful for random sampling. Given just a file name, shuf randomly permutes the lines of the file. With the option -n you can specify how many lines to return. So it’s doing sampling without replacement. […]

## Comparing Truncation to Differential Privacy

Traditional methods of data de-identification obscure data values. For example, you might truncate a date to just the year. Differential privacy obscures query values by injecting enough noise to keep from revealing information on an individual. Let’s compare two approaches for de-identifying a person’s age: truncation and differential privacy. Truncation First consider truncating birth date […]

## Random projection

Last night after dinner, the conversation turned to high-dimensional geometry. (I realize how odd that sentence sounds; I was with some unusual company.) Someone brought up the fact that two randomly chosen vectors in a high-dimensional space are very likely to be nearly orthogonal. This is a surprising but well known fact. Next the conversation […]

## A truly horrible random number generator

I needed a bad random number generator for an illustration, and chose RANDU, possibly the worst random number generator that was ever widely deployed. Donald Knuth comments on RANDU in the second volume of his magnum opus. When this chapter was first written in the late 1960’s, a truly horrible random number generator called RANDU […]

## the joy of stats [book review]

David Spiegelhalter‘s latest book, The Art of Statistics: How to Learn from Data, has made it to Nature Book Review main entry this week. Under the title “the joy of stats”,  written by Evelyn Lamb, a freelance math and science writer from Salt Lake City, Utah. (I noticed that the book made it to Amazon […]

## Regression, modular arithmetic, and PQC

Linear regression Suppose you have a linear regression with a couple predictors and no intercept term: β1×1 + β2×2 = y + ε where the x‘s are inputs, the β are fixed but unknown, y is the output, and ε is random error. Given n observations (x1, x2, y + ε), linear regression estimates the parameters β1 […]

## Entropy extractor used in μRNG

Yesterday I mentioned μRNG, a true random number generator (TRNG) that takes physical sources of randomness as input. These sources are independent but non-uniform. This post will present the entropy extractor μRNG uses to take non-uniform bits as input and produce uniform bits as output. We will present Python code for playing with the entropy extractor. (μRNG […]

## Exploring the sum-product conjecture

Quanta Magazine posted an article yesterday about the sum-product problem of Paul Erdős and Endre Szemerédi. This problem starts with a finite set of real numbers A then considers the size of the sets A+A and A*A. That is, if we add every element of A to every other element of A, how many distinct sums are there? If we […]

## Normal approximation to Laplace distribution?

I heard the phrase “normal approximation to the Laplace distribution” recently and did a double take. The normal distribution does not approximate the Laplace! Normal and Laplace distributions A normal distribution has the familiar bell curve shape. A Laplace distribution, also known as a double exponential distribution, it pointed in the middle, like a pole […]

## Probabilisitic Identifiers in CCPA

The CCPA, the California Privacy Protection Act, was passed last year and goes into effect at the beginning of next year. And just as the GDPR impacts businesses outside Europe, the CCPA will impact businesses outside California. The law specifically mentions probabilistic identifiers. “Probabilistic identifier” means the identification of a consumer or a device to a […]

## Varsity versus junior varsity sports

Last night my wife and I watched our daughter’s junior varsity soccer game. Several statistical questions came to mind. Larger schools tend to have better sports teams. If the talent distributions of a large school and a small school are the same, the larger school will have a better team because its players are the […]

## Unstructured data is an oxymoron

Strictly speaking, “unstructured data” is a contradiction in terms. Data must have structure to be comprehensible. By “unstructured data” people usually mean data with a non-tabular structure. Tabular data is data that comes in tables. Each row corresponds to a subject, and each column corresponds to a kind of measurement. This is the easiest data to […]

## Can I have the last four digits of your social?

Imagine this conversation. “Could you tell me your social security number?” “Absolutely not! That’s private.” “OK, how about just the last four digits?” “Oh, OK. That’s fine.” When I was in college, professors would post grades by the last four digits of student social security numbers. Now that seems incredibly naive, but no one objected […]