# Notes on computing hash functions

A secure hash function maps a file to a string of bits in a way that is hard to reverse. Ideally such a function has three properties:

1. pre-image resistance
2. collision resistance
3. second pre-image resistance

Pre-image resistance means that starting from the hash value, it is very difficult to infer what led to that output; it essentially requires a brute force attack, trying many inputs until something hashes to the given value.

Collision resistance means its extremely unlikely that two files would map to the same hash value, either by accident or by deliberate attack.

Second pre-image resistance is like collision resistance except one file is fixed. A second pre-image attack is harder than a collision attack because the attacker can only vary one file.

This post explains how to compute hash functions from the Linux command line, from Windows, from Python, and from Mathematica.

## Files vs strings

Hash functions are often applied to files. If a web site makes a file available for download, and publishes a hash value, you can compute the hash value yourself after downloading the file to make sure they match. A checksum could let you know if a bit was accidentally flipped in transit, but it’s easy to deliberately tamper with files without changing the checksum. But a secure hash function makes such tampering unfeasible.

You can think of a file as a string or a string as a file, but the distinction between files and strings may matter in practice. When you save a string to a file, you might implicitly add a newline character to the end, causing the string and its corresponding file to have different hash values. The problem is easy to resolve if you’re aware of it.

Another gotcha is that text encoding matters. You cannot hash text per se; you hash the binary representation of that text. Different representations will lead to different has values. In the examples below, only Python makes this explicit.

## openssl digest

One way to compute hash values is using `openssl`. You can give it a file as an argument, or pipe a string to it.

Here’s an example creating a file `f` and computing its SHA256 hash.

```    \$ echo "hello world" > f
\$ openssl dgst -sha256 f
SHA256(f)= a948904f2f0f479b8f8197694b30184b0d2ed1c1cd2a1ec0fb85d299a192a447
```

We get the same hash value if we pipe the string “hello world” to `openssl`.

```    \$ echo "hello world" | openssl dgst -sha256
a948904f2f0f479b8f8197694b30184b0d2ed1c1cd2a1ec0fb85d299a192a447
```

However, `echo` silently added a newline at the end of our string. To get the hash of “hello world” without this newline, use the `-n` option.

```    \$ echo -n "hello world" | openssl dgst -sha256
b94d27b9934d3e08a52e52d7da7dabfac484efe37a5380ee9088f7ace2efcde9
```

To see the list of hash functions `openssl` supports, use `list --digest-commands`. Here’s what I got, though the output could vary with version.

```    \$ openssl list --digest-commands
blake2b512 blake2s256 gost     md4
md5        mdc2       rmd160   sha1
sha224     sha256     sha3-224 sha3-256
sha3-384   sha3-512   sha384   sha512
sha512-224 sha512-256 shake128 shake256
sm3
```

## A la carte commands

If you’re interested in multiple hash functions, `openssl` has the advantage of handling various hashing algorithms uniformly. But if you’re interested in a particular hash function, it may have its only command line utility, such as `sha256sum` and `md5sum`. But these are not named consistently. For example, the utility to compute BLAKE2 hashes is `b2sum`.

## hashalot

The `hashalot` utility is designed for hashing passphrases. As you type in a string, the characters are not displayed, and the input is hashed without a trailing newline character.

Here’s what I get when I type “hello world” at the passphrase prompt below.

```    \$ hashalot -x sha256
Enter passphrase:
b94d27b9934d3e08a52e52d7da7dabfac484efe37a5380ee9088f7ace2efcde9
```

The `-x` option tells `hashalot` to output hexadecimal rather than binary.

Note that this produces the same output as

```    echo -n "hello world" | openssl dgst -sha256
```

According to the documentation,

```Supported values for HASHTYPE:
ripemd160 rmd160 rmd160compat sha256 sha384 sha512```

## Python hashlib

Python’s `hashlib` library supports several hashing algorithms. And unlike the examples above, it makes the encoding of the input and output explicit.

```    import hashlib
print(hashlib.sha256("hello world".encode('utf-8')).hexdigest())
```

This produces `b94d…cde9` as in the examples above.

`hashlib` has two attributes that let you know which algorithms are available. The `algorithms_available` attribute is the set of hashing algorithms available in your particular instance, and the `algorithms_available` attribute is the set of algorithm guaranteed to be available anywhere the library is installed.

Here’s what I got on my computer.

```    >>> a = hashlib.algorithms_available
>>> g = hashlib.algorithms_guaranteed
>>> assert(a.intersection(g) == g)
>>> g
{'sha1', 'sha512', 'sha3_224', 'shake_256',
'sha3_256', 'sha256', 'shake_128', 'sha224',
'md5', 'sha384', 'blake2s', 'sha3_512',
'blake2b', 'sha3_384'}
>>> a.difference(g)
{'md5-sha1', 'mdc2', 'sha3-384', 'ripemd160',
'blake2s256', 'md4', 'sha3-224', 'whirlpool',
'sha512-256', 'blake2b512', 'sha512-224', 'sm3',
'shake128', 'shake256', 'sha3-512', 'sha3-256'}
```

## Hashing on Windows

Windows has a utility `fciv` whose name stands for “file checksum integrity verifier”. It only supports the broken hashes MD5 and SHA1 [1].

PowerShell has a function `Get-FileHash` that uses SHA256 by default, but also supports SHA1, SHA384, SHA512, and MD5.

## Hashing with Mathematica

Here’s our running example, this time in Mathematica.

`    Hash["hello world", "SHA256", "HexString"]`

This returns `b94d…cde9` as above. Other hash algorithms supported by Mathematica: Adler32, CRC32, MD2, MD3, MD4, MD5, RIPEMD160, RIPEMD160SHA256, SHA, SHA256, SHA256SHA256, SHA384, SHA512, SHA3-224, SHA3-256, SHA3-384, SHA3-512, Keccak224, Keccak256, Keccak384, Keccak512, Expression.

Names above that concatenate two names are the composition of the two functions. RIPEMD160SHA256 is included because of its use in Bitcoin. Here “SHA” is SHA-1. “Expression” is a non-secure 64-bit hash used internally by Mathematica.

Mathematica also supports several output formats besides hexadecimal: Integer, DecimalString, HexStringLittleEndian, Base36String, Base64Encoding, and ByteArray.

## Related posts

[1] It’s possible to produce MD5 collisions quickly. MD5 remains commonly used, and is fine as a checksum, though it cannot be considered a secure hash function any more.

Google researchers were able to produce SHA1 collisions, but it took over 6,000 CPU years distributed across many machines.