Simulating identification by zip code, sex, and birthdate

As mentioned in the previous post, Latanya Sweeney estimated that 87% of Americans can be identified by the combination of zip code, sex, and birth date. We’ll do a quick-and-dirty estimate and a simulation to show that this result is plausible. There’s no point being too realistic with a simulation because the actual data that Sweeney used is even more realistic. We’re just showing that her result is reasonable.

Quick estimate

Suppose average life expectancy is around 78 years. If everyone is between 0 and 78, then there are 78*365 possible birth dates and twice that many combinations of birth date and sex.

What’s the average population of a US zip code? We’ll use 9,330 for reasons explained in [1].

We have 56,940 possible birth date and sex combinations for 9,330 people. There have to be many unused birth date and sex combinations in a typical zip code, and it’s plausible that many combinations will only be used once. We’ll run a Python simulation to see just how many we’d expect to be used one time.

Python simulation

The array demo below will keep track of the possible demographic values, i.e. combinations of birth date and sex. We’ll loop over the population of the zip code, randomly assigning everyone to a demographic value, then see what proportion of demographic values is only used once.

    from random import randrange
    from numpy import zeros
    zctasize = 9330
    demosize = 365*78*2
    demo = zeros(demosize)
    for _ in range(zctasize):
        d = randrange(demosize)
        demo[d] += 1
    unique = len(demo[demo == 1])

I ran this simulation 10 times and got values ranging from 84.3% to 85.7%.

Analytic solution

As Road White points out in the comments, you can estimate the number of unique demographics by a probability calculation.

Suppose there are z inhabitants in our zip code and d demographic categories. We’re assuming here (and above) that all demographic categories are equally likely, even though that’s not true of birth dates.

We start by looking at a particular demographic category. The probability that exactly one person falls in that category is

\frac{z}{d}\left(1 - \frac{1}{d}\right)^{z-1}

To find the expected number of demographic slots with exactly one entry we multiply by d, and to get the proportion of p of people this represents we divide by z.

\log p = (z-1)\log\left(1 - \frac{1}{d}\right) \approx - \frac{z}{d}

and so

p \approx \exp(-z/d)

which in our case is 84.8%, consistent with our simulation above.

Here we used the Taylor series approximation

\log(1 + x) = x + {\cal O}(x^2)

If z is of the same magnitude as d or smaller, then the error in the approximation above is O(1/d).

You could also work this as a Poisson distribution, then condition on the probability of a slot being occupied.

By the way, a similar calculation shows that the expected number of demographic slots containing two people is r exp(-r)/2 where rz/d. So while 85% can be uniquely identified, another 7% can be narrowed down to two possibilities.

Related posts

[1] I don’t know, but I do know the average population of a “zip code tabulation area” or ZCTA, and that’s 9,330 according to the data linked to here. As I discuss in that post, the Census Bureau reports population by ZTCA, not by zip code per se, for several reasons. Zip codes can cross state boundaries, they are continually adjusted, and some zip codes contain only post office boxes.