Hashes.
What are they good for? 🎵
Absolute nothing Actually quite a lot!

From optimising access to data with hash tables, to hashbrowns™, to validating data - something we’ll look at now!

Cryptographic Hashes

A cryptographic hash function (also known as a digest, or fingerprint) takes an input, and produces a fixed size output from the input.

Some hashes operate on a byte-by-byte basis, whilst others operate on a block of data at a time (better)

With a grain of salt, cryptographic hash functions have the output properties

  • One way - ‘Impossible’ to reverse
  • Unique - Two messages will ‘never’ have the same hash
  • Avalanche effect - A small change in input changes at least half the output bits

I say to take it which a grain of salt, because you can never be too certain that something is secure.
Though highly unlikely or perhaps unfeasible - there is a possibility that a hashing function can be reversed. Just because we haven’t found a way, doesn’t mean it’s impossible.

We say our computers are secure, but then lo and behold the NSA publicly announces an exploit to a vulnerability discovered a lifetime ago…

Message Authentication Codes

MACs ensure that a message has not been altered or modified, and usually use some sort of a cryptographic hash implementation.

Weaknesses

The structure of a MAC is important, as it could lead to weaknesses.

For example:

  • hash(key + message) - susceptible to a length extension attack
  • hash(message + key ) - susceptible to a collision attack

There are several attack variants on hashing functions that can be performed:

  • Preimage attack - Given h(m), we are able to find m
  • Second preimage attack - Given m, we can find m' where h(m) = h(m')
  • Collision attack - Find m and m' where h(m) = h(m')
  • Length extension attack - Reconstructing internal hashing state to append extra data

Length Extension Attacks

Ripping straight off Wikipedia

“The vulnerable hashing functions work by taking the input message, and using it to transform an internal state. After all of the input has been processed, the hash digest is generated by outputting the internal state of the function. It is possible to reconstruct the internal state from the hash digest, which can then be used to process the new data. In this way, one may extend the message and compute the hash that is a valid signature for the new message.”

Length extension attacks allow attackers to append additional data to a message, whilst also recomputing the hash.

If you know the hash function / algorithm being used, you would be able to continue the hashing process from the original hash result

Example

Consider the secret key z5206677, the message Hello World!, and some arbitrary byte-based hashing function myHash that I’ll make up.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
key = "z5206677"
message = "Hello World!"

def myHash(message, key):
  # Hash length
  hashLength = 7
  result = [0] * hashLength
  
  # Hash structure
  string = key + message
  
  # Some arbitrary hashing function
  for c in range(len(string)):
    result[c % hashLength] += ord(string[c])
  result = "".join([chr(a % 58 + 65) for a in result])
  
  return result

print("Hash:", myHash(message, key))
1
Hash: aEba_Y^

Let’s say we make the message “Hello World! yeet”.
With knowledge of the key, myHash("Hello World! yeet", "z5206677") == "fpSa_YD"

So if we didn’t know the key, how could we get this hash with a length extension attack?

Analysis time! Let’s look at what myHash does:

  • The hash output is 7 characters long
  • The hash structure is h(key + message)
  • For each character in key + message, add the ASCII letter value into the result array
  • Modulo each value by 58 and add 65 (‘A’)
  • We can also see that this is a bad cryptographic hash function, as it doesn’t exhibit the avalanche property.

To attack this hashing function, let’s first consider what we have.

  • We do not have the value of the key
  • We do have the length of the key
    • If we know the algorithm, we could possibly solve for the key length
  • We may have the value of the message
  • We do have the length of the message

We need to set the internal states (variable values) of the function. In our case, it would be knowing the result values, and the index/cursor position for the next letter.

To get the result values, we convert each character in the hash to its ASCII value, and subtract 65. We can ignore the modulo operation, as for any n and A, (n + A) % A == n.
We then set our index to where the end of the original message would have been, and perform the hashing operation as if we’re continuing.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
def attack_myHash(hash, extraMessage, keyLength, messageLength):
  result = list(hash)                     # Turn our string into an array
  result = [ord(a) - 65 for a in result]  # Remove the `+ 65` offset
                                          # We can ignore the `mod 58`

  # Reuse the hashing logic from `myHash`
  for c in range(len(extraMessage)):
    result[(c + keyLength + messageLength) % 7] += ord(extraMessage[c])
  result = "".join([chr(a % 58 + 65) for a in result])

print("New hash:", attack_myHash("aEba_Y^", " yeet", 8, 12))
1
New hash: fpSa_YD

So given a message that has been hashed with key
myHash("Hello World!", "z5206677") == "aEba_Y^"

We can perform a length extension attack on the hash
attack_myHash("aEba_Y^", " yeet", 8, 12) == "fpSa_YD"

Which produces the same hash as that message properly hashed
myHash("Hello World! yeet", "z5206677") == "fpSa_YD"


When it comes to block-based hash functions, the attacker would have to additionally pad their input so that the data blocks align.

Solution? Hash it again

By hashing the hash of a message, the possibility for a length extension attack is mitigated. In addition, it becomes much harder (and time consuming) to crack the hash.

We call one particular implementation of this idea a HMAC - or “hash-based message authentication code”. It operates along the functionality of HMAC = hash(key + hash(key + message))

What if you hash the hash of a message hash?
Or hash the hash of the hash of the hash of a message?
Or hash the hash that was hashed from the hash of the hash of a hash?

We call these hashing rounds - that is, how many times the hash was hashed.

Hashing functions (usually) take considerably less time to perform than a cracking function.

i.e.
if h(X) => hash takes 1 second, h(h(h(X))) would take 3 seconds.
if H(x) => plain takes 5 seconds, H(H(H(x))) would take 15 (or 125) seconds

Salts

Salts are arbitrary strings of text that are either appended or prepended to a message before it is all hashed. This means that given two different salts, the hash of the same message and a different salt would produce two hashes that are not the same!!

md5(AAApassword) = D5581DAEF8238853D0EDFBDA8361562F
md5(BBBpassword) = 2DA397558A2C230E0B38FB75B0BC519A

Salts (ie AAA, and BBB) do not need to be stored privately and could just be left as plain-text inside a database (although that’s always a nice idea)

Rainbow Tables

Rainbow tables are precomputed hashes of many many many different password combinations. They essentially save us the effort of having to compute the hashes ourselves - for the tradeoff of really massive rainbow table files (in multitudes of tens to thousands of gigabytes)

They however become somewhat useless if a password was hashed with a salt.
A new rainbow table would need to be generated with that salt in mind - quickly making cracking of salted hashes very unfeasible.

Making cracking harder with modulos

So I don’t know about you, but the modulo operation looks really useful to add uncertainty into a reversing operation.
Given n % 5 == 0, we don’t know if n is 0, or 5, or 10, 15, etc.

If n was dependent on another modulo result (ie n = m % 29), then we would need to consider the possible values for m over the different n values.

Code

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
def solveModulo(modulo, remainder):
  """
  Solve for `i` where i % modulo == remainder
  """
  # limit to values <= 100
  for i in range(1, 100+1):
    if i % modulo == remainder:
      yield i

n_mod_5_is_zero = list(solveModulo(5, 0))
print(f"n % 5 == 0 when n = {n_mod_5_is_zero}")

print()

possible_values_of_m = set()
mod_collisions = []

for n in n_mod_5_is_zero:
  m_mod_29_is_n = list(solveModulo(29, n))
  print(f"m % 29 == {n} when m = {m_mod_29_is_n}")
  
  for n_value in m_mod_29_is_n:
    if n_value in possible_values_of_m:
      mod_collisions.append(n_value)
    else:
      possible_values_of_m.add(n_value)
    
print()
print(f"Got {len(possible_values_of_m)} m values", sorted(possible_values_of_m)))
print(f"Got {len(mod_collisions)} collisions", sorted(mod_collisions))

Output

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
n % 5 == 0 when n = [5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100]

m % 29 == 5 when m = [5, 34, 63, 92]
m % 29 == 10 when m = [10, 39, 68, 97]
m % 29 == 15 when m = [15, 44, 73]
m % 29 == 20 when m = [20, 49, 78]
m % 29 == 25 when m = [25, 54, 83]
m % 29 == 30 when m = []
m % 29 == 35 when m = []
m % 29 == 40 when m = []
m % 29 == 45 when m = []
m % 29 == 50 when m = []
m % 29 == 55 when m = []
m % 29 == 60 when m = []
m % 29 == 65 when m = []
m % 29 == 70 when m = []
m % 29 == 75 when m = []
m % 29 == 80 when m = []
m % 29 == 85 when m = []
m % 29 == 90 when m = []
m % 29 == 95 when m = []
m % 29 == 100 when m = []

Got 17 m values [5, 10, 15, 20, 25, 34, 39, 44, 49, 54, 63, 68, 73, 78, 83, 92, 97]
Got 0 collisions []

As you see here, for m % 29 % 5 == 0 for 0 < m <= 100, there are 17 possible values of n which you would have to try.

Sure 17 in 100 isn’t a lot of trials,
and even for 0 < m <= 1000, 172 in 1000 isn’t a lot.

But with a better choice of modulo numbers, m % 91 % 2 == 0 gives us 495 in 1000 values… That’s much much better!

A modulo operation could also be dangerous though
If you are using the result of a modulo operation - for example x = A % 2 - then x only has two possible values 0 and 1. By applying a modulo operation, the output space has been reduced to only two values, which could be bad.