Hashes
Contents
Hashes.
What are they good for? 🎵Absolute nothing Actually quite a lot!
From optimising access to data with hash tables, to hashbrowns™, to validating data - something we’ll look at now!
Cryptographic Hashes
A cryptographic hash function (also known as a digest, or fingerprint) takes an input, and produces a fixed size output from the input.
Some hashes operate on a byte-by-byte basis, whilst others operate on a block of data at a time (better)
With a grain of salt, cryptographic hash functions have the output properties
- One way - ‘Impossible’ to reverse
- Unique - Two messages will ‘never’ have the same hash
- Avalanche effect - A small change in input changes at least half the output bits
I say to take it which a grain of salt, because you can never be too certain that something is secure.
Though highly unlikely or perhaps unfeasible - there is a possibility that a hashing function can be reversed. Just because we haven’t found a way, doesn’t mean it’s impossible.
We say our computers are secure, but then lo and behold the NSA publicly announces an exploit to a vulnerability discovered a lifetime ago…
Message Authentication Codes
MACs ensure that a message has not been altered or modified, and usually use some sort of a cryptographic hash implementation.
Weaknesses
The structure of a MAC is important, as it could lead to weaknesses.
For example:
- hash(key + message) - susceptible to a length extension attack
- hash(message + key ) - susceptible to a collision attack
There are several attack variants on hashing functions that can be performed:
- Preimage attack - Given
h(m)
, we are able to findm
- Second preimage attack - Given
m
, we can findm'
whereh(m) = h(m')
- Collision attack - Find
m
andm'
whereh(m) = h(m')
- Length extension attack - Reconstructing internal hashing state to append extra data
Length Extension Attacks
Ripping straight off Wikipedia…
“The vulnerable hashing functions work by taking the input message, and using it to transform an internal state. After all of the input has been processed, the hash digest is generated by outputting the internal state of the function. It is possible to reconstruct the internal state from the hash digest, which can then be used to process the new data. In this way, one may extend the message and compute the hash that is a valid signature for the new message.”
Length extension attacks allow attackers to append additional data to a message, whilst also recomputing the hash.
If you know the hash function / algorithm being used, you would be able to continue the hashing process from the original hash result
Example
Consider the secret key z5206677
, the message Hello World!
, and some arbitrary byte-based hashing function myHash
that I’ll make up.
|
|
|
|
Let’s say we make the message “Hello World! yeet
”.
With knowledge of the key, myHash("Hello World! yeet", "z5206677") == "fpSa_YD"
So if we didn’t know the key, how could we get this hash with a length extension attack?
Analysis time! Let’s look at what myHash
does:
- The hash output is 7 characters long
- The hash structure is
h(key + message)
- For each character in
key + message
, add the ASCII letter value into the result array - Modulo each value by 58 and add 65 (‘A’)
- We can also see that this is a bad cryptographic hash function, as it doesn’t exhibit the avalanche property.
To attack this hashing function, let’s first consider what we have.
- We do not have the value of the key
- We do have the length of the key
- If we know the algorithm, we could possibly solve for the key length
- We may have the value of the message
- We do have the length of the message
We need to set the internal states (variable values) of the function. In our case, it would be knowing the result
values, and the index/cursor position for the next letter.
To get the result
values, we convert each character in the hash to its ASCII value, and subtract 65. We can ignore the modulo operation, as for any n
and A
, (n + A) % A == n
.
We then set our index to where the end of the original message would have been, and perform the hashing operation as if we’re continuing.
|
|
1
| New hash: fpSa_YD |
So given a message that has been hashed with keymyHash("Hello World!", "z5206677") == "aEba_Y^"
We can perform a length extension attack on the hashattack_myHash("aEba_Y^", " yeet", 8, 12) == "fpSa_YD"
Which produces the same hash as that message properly hashedmyHash("Hello World! yeet", "z5206677") == "fpSa_YD"
When it comes to block-based hash functions, the attacker would have to additionally pad their input so that the data blocks align.
Solution? Hash it again
By hashing the hash of a message, the possibility for a length extension attack is mitigated. In addition, it becomes much harder (and time consuming) to crack the hash.
We call one particular implementation of this idea a HMAC - or “hash-based message authentication code”. It operates along the functionality of HMAC = hash(key + hash(key + message))
What if you hash the hash of a message hash?
Or hash the hash of the hash of the hash of a message?
Or hash the hash that was hashed from the hash of the hash of a hash?
We call these hashing rounds - that is, how many times the hash was hashed.
Hashing functions (usually) take considerably less time to perform than a cracking function.
i.e.
if h(X) => hash
takes 1 second, h(h(h(X))) would take 3 seconds.
if H(x) => plain
takes 5 seconds, H(H(H(x))) would take 15 (or 125) seconds
Salts
Salts are arbitrary strings of text that are either appended or prepended to a message before it is all hashed. This means that given two different salts, the hash of the same message and a different salt would produce two hashes that are not the same!!
md5(AAApassword) = D5581DAEF8238853D0EDFBDA8361562F
md5(BBBpassword) = 2DA397558A2C230E0B38FB75B0BC519A
Salts (ie AAA
, and BBB
) do not need to be stored privately and could just be left as plain-text inside a database (although that’s always a nice idea)
Rainbow Tables
Rainbow tables are precomputed hashes of many many many different password combinations. They essentially save us the effort of having to compute the hashes ourselves - for the tradeoff of really massive rainbow table files (in multitudes of tens to thousands of gigabytes)
They however become somewhat useless if a password was hashed with a salt.
A new rainbow table would need to be generated with that salt in mind - quickly making cracking of salted hashes very unfeasible.
Making cracking harder with modulos
So I don’t know about you, but the modulo operation looks really useful to add uncertainty into a reversing operation.
Given n % 5 == 0
, we don’t know if n
is 0
, or 5
, or 10
, 15
, etc.
If n
was dependent on another modulo result (ie n = m % 29
), then we would need to consider the possible values for m over the different n values.
Code
|
|
Output
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 | n % 5 == 0 when n = [5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100] m % 29 == 5 when m = [5, 34, 63, 92] m % 29 == 10 when m = [10, 39, 68, 97] m % 29 == 15 when m = [15, 44, 73] m % 29 == 20 when m = [20, 49, 78] m % 29 == 25 when m = [25, 54, 83] m % 29 == 30 when m = [] m % 29 == 35 when m = [] m % 29 == 40 when m = [] m % 29 == 45 when m = [] m % 29 == 50 when m = [] m % 29 == 55 when m = [] m % 29 == 60 when m = [] m % 29 == 65 when m = [] m % 29 == 70 when m = [] m % 29 == 75 when m = [] m % 29 == 80 when m = [] m % 29 == 85 when m = [] m % 29 == 90 when m = [] m % 29 == 95 when m = [] m % 29 == 100 when m = [] Got 17 m values [5, 10, 15, 20, 25, 34, 39, 44, 49, 54, 63, 68, 73, 78, 83, 92, 97] Got 0 collisions [] |
As you see here, for m % 29 % 5 == 0
for 0 < m <= 100
, there are 17 possible values of n
which you would have to try.
Sure 17 in 100 isn’t a lot of trials,
and even for 0 < m <= 1000
, 172 in 1000 isn’t a lot.
But with a better choice of modulo numbers, m % 91 % 2 == 0
gives us 495 in 1000 values… That’s much much better!
A modulo operation could also be dangerous though
If you are using the result of a modulo operation - for example x = A % 2
- then x
only has two possible values 0
and 1
. By applying a modulo operation, the output space has been reduced to only two values, which could be bad.