Hashing names does not protect privacy

Secure hash functions are practically impossible to reverse, but only if the input is unrestricted.

If you generate 256 random bits and apply a secure 256-bit hash algorithm, an attacker wanting to recover your input can’t do much better than brute force, hashing 256-bit strings hoping to find one that matches your hash value. Even then, the attacker may have found a collision, another string of bits that happens to have the same hash value.

But you know the input comes from a restricted set, a set small enough to search by brute force, then hash functions are reversible. If I know that you’ve either hashed “yes” or “no,” then I can apply the hash function to both and see which one it was.

Hashing PII

Suppose someone has attempted to anonymize a data set by hashing personally identifying information (PII) such as name, phone number, etc. These inputs come from a small enough space that a brute force search is easy.

For instance, suppose someone has applied a cryptographic hash to first names. Then all an attacker needs to do is find a list of common names, hash them all, and see which hash values match. I searched for a list of names and found this, a list of the 1000 most popular baby girl and boy names in California in 2017.

The data set was compiled based on 473,441 births. Of those births, 366,039 had one of the 2,000 names. That is, 77% of the babies had one of the 1,000 most common names for their sex.

I wrote a little script to read in all the names and compute a SHA256 hash. The program took a fraction of a second to run. With the output of this program, I can’t re-identify every first name, but I could re-identify 77% of them, assuming my list of names is representative [1].

Now, for example, if I see the hash value

96001f228724d6a56db13d147a9080848103cf7d67bf08f53bda5d038777b2e6

in the data set, I can look this value up in my list of 2000 hash values and see that it is the hashed value of ACHILLES [2].

If you saw the hash value above and had no idea where it came from—it could be the hash of a JPEG image file for all you know—it would be hopeless to try to figure out what produced it. But if you know it’s the hash of a first name, it’s trivial to reverse.

Hashing SSNs

Hashing numbers is simpler but more computationally intense. I had to do a little research to find a list of names, but I know that social security numbers are simply 9-digit numbers. There are only a billion possible nine-digit numbers, so it’s feasible to hash them all to make a look-up table.

I tried this using a little Python script on an old laptop. In 20 seconds I was able to create a file with the hash values of a million SSNs, so presumably I could have finished the job in about five and a half hours. Of course it would take even less time with more efficient code or a more powerful computer.

Related posts

[1] Name frequencies change over time and space, but I imagine the data from California in 2017 would be adequate to identify most Americans of any age and in any state. Certainly this would not meet the HIPAA standard that “the risk is very small that the information could be used … to identify an individual.”

[2] The California baby name data set standardized all names to capital letters and removed diacritical marks. With a little extra effort, one could hash variations such as JOSE, JOSÉ, Jose, and José.