Notes on computing hash functions

A secure hash function maps a file to a string of bits in a way that is hard to reverse. Ideally such a function has three properties:

  1. pre-image resistance
  2. collision resistance
  3. second pre-image resistance

Pre-image resistance means that starting from the hash value, it is very difficult to infer what led to that output; it essentially requires a brute force attack, trying many inputs until something hashes to the given value.

Collision resistance means its extremely unlikely that two files would map to the same hash value, either by accident or by deliberate attack.

Second pre-image resistance is like collision resistance except one file is fixed. A second pre-image attack is harder than a collision attack because the attacker can only vary one file.

This post explains how to compute hash functions from the Linux command line, from Windows, from Python, and from Mathematica.

Files vs strings

Hash functions are often applied to files. If a web site makes a file available for download, and publishes a hash value, you can compute the hash value yourself after downloading the file to make sure they match. A checksum could let you know if a bit was accidentally flipped in transit, but it’s easy to deliberately tamper with files without changing the checksum. But a secure hash function makes such tampering unfeasible.

You can think of a file as a string or a string as a file, but the distinction between files and strings may matter in practice. When you save a string to a file, you might implicitly add a newline character to the end, causing the string and its corresponding file to have different hash values. The problem is easy to resolve if you’re aware of it.

Another gotcha is that text encoding matters. You cannot hash text per se; you hash the binary representation of that text. Different representations will lead to different has values. In the examples below, only Python makes this explicit.

openssl digest

One way to compute hash values is using openssl. You can give it a file as an argument, or pipe a string to it.

Here’s an example creating a file f and computing its SHA256 hash.

    $ echo "hello world" > f
    $ openssl dgst -sha256 f
    SHA256(f)= a948904f2f0f479b8f8197694b30184b0d2ed1c1cd2a1ec0fb85d299a192a447

We get the same hash value if we pipe the string “hello world” to openssl.

    $ echo "hello world" | openssl dgst -sha256
    a948904f2f0f479b8f8197694b30184b0d2ed1c1cd2a1ec0fb85d299a192a447

However, echo silently added a newline at the end of our string. To get the hash of “hello world” without this newline, use the -n option.

    $ echo -n "hello world" | openssl dgst -sha256
    b94d27b9934d3e08a52e52d7da7dabfac484efe37a5380ee9088f7ace2efcde9

To see the list of hash functions openssl supports, use list --digest-commands. Here’s what I got, though the output could vary with version.

    $ openssl list --digest-commands
    blake2b512 blake2s256 gost     md4
    md5        mdc2       rmd160   sha1
    sha224     sha256     sha3-224 sha3-256
    sha3-384   sha3-512   sha384   sha512
    sha512-224 sha512-256 shake128 shake256
    sm3

A la carte commands

If you’re interested in multiple hash functions, openssl has the advantage of handling various hashing algorithms uniformly. But if you’re interested in a particular hash function, it may have its only command line utility, such as sha256sum and md5sum. But these are not named consistently. For example, the utility to compute BLAKE2 hashes is b2sum.

hashalot

The hashalot utility is designed for hashing passphrases. As you type in a string, the characters are not displayed, and the input is hashed without a trailing newline character.

Here’s what I get when I type “hello world” at the passphrase prompt below.

    $ hashalot -x sha256
    Enter passphrase:
    b94d27b9934d3e08a52e52d7da7dabfac484efe37a5380ee9088f7ace2efcde9

The -x option tells hashalot to output hexadecimal rather than binary.

Note that this produces the same output as

    echo -n "hello world" | openssl dgst -sha256

According to the documentation,

Supported values for HASHTYPE:
    ripemd160 rmd160 rmd160compat sha256 sha384 sha512

Python hashlib

Python’s hashlib library supports several hashing algorithms. And unlike the examples above, it makes the encoding of the input and output explicit.

    import hashlib
    print(hashlib.sha256("hello world".encode('utf-8')).hexdigest())

This produces b94d…cde9 as in the examples above.

hashlib has two attributes that let you know which algorithms are available. The algorithms_available attribute is the set of hashing algorithms available in your particular instance, and the algorithms_available attribute is the set of algorithm guaranteed to be available anywhere the library is installed.

Here’s what I got on my computer.

    >>> a = hashlib.algorithms_available
    >>> g = hashlib.algorithms_guaranteed
    >>> assert(a.intersection(g) == g)
    >>> g
    {'sha1', 'sha512', 'sha3_224', 'shake_256', 
     'sha3_256', 'sha256', 'shake_128', 'sha224', 
     'md5', 'sha384', 'blake2s', 'sha3_512', 
     'blake2b', 'sha3_384'}
   >>> a.difference(g)                                                             
   {'md5-sha1', 'mdc2', 'sha3-384', 'ripemd160', 
    'blake2s256', 'md4', 'sha3-224', 'whirlpool', 
    'sha512-256', 'blake2b512', 'sha512-224', 'sm3', 
   'shake128', 'shake256', 'sha3-512', 'sha3-256'}                                                    

Hashing on Windows

Windows has a utility fciv whose name stands for “file checksum integrity verifier”. It only supports the broken hashes MD5 and SHA1 [1].

PowerShell has a function Get-FileHash that uses SHA256 by default, but also supports SHA1, SHA384, SHA512, and MD5.

Hashing with Mathematica

Here’s our running example, this time in Mathematica.

    Hash["hello world", "SHA256", "HexString"]

This returns b94d…cde9 as above. Other hash algorithms supported by Mathematica: Adler32, CRC32, MD2, MD3, MD4, MD5, RIPEMD160, RIPEMD160SHA256, SHA, SHA256, SHA256SHA256, SHA384, SHA512, SHA3-224, SHA3-256, SHA3-384, SHA3-512, Keccak224, Keccak256, Keccak384, Keccak512, Expression.

Names above that concatenate two names are the composition of the two functions. RIPEMD160SHA256 is included because of its use in Bitcoin. Here “SHA” is SHA-1. “Expression” is a non-secure 64-bit hash used internally by Mathematica.

Mathematica also supports several output formats besides hexadecimal: Integer, DecimalString, HexStringLittleEndian, Base36String, Base64Encoding, and ByteArray.

Related posts

[1] It’s possible to produce MD5 collisions quickly. MD5 remains commonly used, and is fine as a checksum, though it cannot be considered a secure hash function any more.

Google researchers were able to produce SHA1 collisions, but it took over 6,000 CPU years distributed across many machines.