Suppose you want to test whether something you’re doing is having any effect. You take a few measurements and you compute the average. The average is different than what it would be if what you’re doing had no effect, but is the difference significant? That is, how likely is it that you might see the […]
Category: Probability and Statistics
Testing Rupert Miller’s suspicion
I was reading Rupert Miller’s book Beyond ANOVA when I ran across this line: I never use the Kolmogorov-Smirnov test (or one of its cousins) or the χ² test as a preliminary test of normality. … I have a feeling they are more likely to detect irregularities in the middle of the distribution than in […]
Estimating vocabulary size with Heaps’ law
Heaps’ law says that the number of unique words in a text of n words is approximated by V(n) = K nβ where K is a positive constant and β is between 0 and 1. According to the Wikipedia article on Heaps’ law, K is often between 10 and 100 and β is often between 0.4 […]
Estimating vocabulary size with Heaps’ law
Heaps’ law says that the number of unique words in a text of n words is approximated by V(n) = K nβ where K is a positive constant and β is between 0 and 1. According to the Wikipedia article on Heaps’ law, K is often between 10 and 100 and β is often between 0.4 […]
Testing Cliff RNG with DIEHARDER
My previous post introduced the Cliff random number generator. The post showed how to find starting seeds where the generator will start out by producing approximately equal numbers. Despite this flaw, the generator works well by some criteria. I produced a file of 100 million 32-bit integers by multiplying the output values [1], which were […]
Serious applications of a party trick
In a group of 30 people, it’s likely that two people have the same birthday. For a group of 23 the probability is about 1/2, and it goes up as the group gets larger. In a group of a few dozen people, it’s unlikely that anyone will have a particular birthday, but it’s likely that […]
Per stirpes and random walks
If an inheritance is to be divided per stirpes, each descendant gets an equal share. If a descendant has died but has living descendants, his or her share is distributed by applying the rule recursively. Example For example, suppose a man had two children, Alice and Bob, and stipulates in his will that his estate […]
Using one RNG to sample another
Suppose you have two pseudorandom bit generators. They’re both fast, but not suitable for cryptographic use. How might you combine them into one generator that is suitable for cryptography? Coppersmith et al [1] had a simple but effective approach which they call the shrinking generator. The idea is to use one bit stream to sample […]
Random sampling from a file
I recently learned about the Linux command line utility shuf from browsing The Art of Command Line. This could be useful for random sampling. Given just a file name, shuf randomly permutes the lines of the file. With the option -n you can specify how many lines to return. So it’s doing sampling without replacement. […]
Comparing Truncation to Differential Privacy
Traditional methods of data de-identification obscure data values. For example, you might truncate a date to just the year. Differential privacy obscures query values by injecting enough noise to keep from revealing information on an individual. Let’s compare two approaches for de-identifying a person’s age: truncation and differential privacy. Truncation First consider truncating birth date […]
Random projection
Last night after dinner, the conversation turned to high-dimensional geometry. (I realize how odd that sentence sounds; I was with some unusual company.) Someone brought up the fact that two randomly chosen vectors in a high-dimensional space are very likely to be nearly orthogonal. This is a surprising but well known fact. Next the conversation […]
A truly horrible random number generator
I needed a bad random number generator for an illustration, and chose RANDU, possibly the worst random number generator that was ever widely deployed. Donald Knuth comments on RANDU in the second volume of his magnum opus. When this chapter was first written in the late 1960’s, a truly horrible random number generator called RANDU […]
the joy of stats [book review]
David Spiegelhalter‘s latest book, The Art of Statistics: How to Learn from Data, has made it to Nature Book Review main entry this week. Under the title “the joy of stats”, written by Evelyn Lamb, a freelance math and science writer from Salt Lake City, Utah. (I noticed that the book made it to Amazon […]
Regression, modular arithmetic, and PQC
Linear regression Suppose you have a linear regression with a couple predictors and no intercept term: β1×1 + β2×2 = y + ε where the x‘s are inputs, the β are fixed but unknown, y is the output, and ε is random error. Given n observations (x1, x2, y + ε), linear regression estimates the parameters β1 […]
Entropy extractor used in μRNG
Yesterday I mentioned μRNG, a true random number generator (TRNG) that takes physical sources of randomness as input. These sources are independent but non-uniform. This post will present the entropy extractor μRNG uses to take non-uniform bits as input and produce uniform bits as output. We will present Python code for playing with the entropy extractor. (μRNG […]
Exploring the sum-product conjecture
Quanta Magazine posted an article yesterday about the sum-product problem of Paul Erdős and Endre Szemerédi. This problem starts with a finite set of real numbers A then considers the size of the sets A+A and A*A. That is, if we add every element of A to every other element of A, how many distinct sums are there? If we […]
Normal approximation to Laplace distribution?
I heard the phrase “normal approximation to the Laplace distribution” recently and did a double take. The normal distribution does not approximate the Laplace! Normal and Laplace distributions A normal distribution has the familiar bell curve shape. A Laplace distribution, also known as a double exponential distribution, it pointed in the middle, like a pole […]
Probabilisitic Identifiers in CCPA
The CCPA, the California Privacy Protection Act, was passed last year and goes into effect at the beginning of next year. And just as the GDPR impacts businesses outside Europe, the CCPA will impact businesses outside California. The law specifically mentions probabilistic identifiers. “Probabilistic identifier” means the identification of a consumer or a device to a […]
Varsity versus junior varsity sports
Last night my wife and I watched our daughter’s junior varsity soccer game. Several statistical questions came to mind. Larger schools tend to have better sports teams. If the talent distributions of a large school and a small school are the same, the larger school will have a better team because its players are the […]
Unstructured data is an oxymoron
Strictly speaking, “unstructured data” is a contradiction in terms. Data must have structure to be comprehensible. By “unstructured data” people usually mean data with a non-tabular structure. Tabular data is data that comes in tables. Each row corresponds to a subject, and each column corresponds to a kind of measurement. This is the easiest data to […]