Three-digit zip codes and data privacy

Birth date, sex, and five-digit zip code are enough information to uniquely identify a large majority of Americans. See more on this here.

So if you want to deidentify a data set, the HIPAA Safe Harbor provision says you should chop off the last to digits of a zip code. And even though three-digit zip codes are larger than five-digit zip codes on average, some three-digit zip codes are still sparsely populated.

But if you use three-digit zip codes, and recode the sparsely populated zip3s as 000, then you’re OK, right?

Well, there’s still a problem if you also report state. Ordinarily a zip3 fits within one state, but not always.

Five digit zip codes are each entirely contained within a state as far as I know. But three-digit zip codes can straddle state lines. For example, about 200,000 people live in the three-digit zip code 834. The vast majority of these are in Idaho, but about 500 live in zip code 83414 which is in Wyoming.

On average, a three-digit zip code covers about 350,000 people. And so most of the time, the combination of zip3 and state covers 350,000 people. But in the example above, the combination of zip3 and state might narrow down to 500 people. In a group that small, birthday (just day of the year, not the full date) is enough to uniquely identify around 25% of the population. [1]

Related posts

[1] The 25% figure came from exp(-500/365). See this post for details.