Researchers have discovered a flaw in the MD5 algorithm that is used to provide a unique signature for data. Xiaoyun Wang, a Chinese expert, and three colleagues have discovered the flaw in the hash function algorithm, which is used in applications, such as EMC's Centera content-addressable file store. The flaw was revealed at the Crypto 2004 conference.

A duplicated hash value is called a collision and the four researchers' paper can be downloaded here. Such a hash function is not un-crackable. It relies for its effectiveness on the great amount of time required to break it. Until the Chinese team's work, several million hours of compute time would have been needed. They showed that it could be done within a few hours on a standard PC.

If MD5 is flawed then data uniqueness cannot be guaranteed. Thus, for example, Centera's ability to guarantee data integrity would be compromised and compliance regimes based on it could no longer be trusted. MD5 is also used by the Apache web server to guarantee integrity of downloadable source code data on mirror sites. Sun also uses it in its Solaris fingerprint database to assure the integrity of downloadable binary files.

Code designed to reveal the flaw can be found here.

Another hash function algorithm also used in data integrity applications, was also shown to be vulnerable at the same conference. However the SHA-1 vulnerability is not as severe as that shown with MD5.

What is the real effect of this?
The MD5 flaw could be used by a malicious hacker to get corrupted code onto unsuspecting users' machines by means of a forged hash code that deceives the affected server into treating the corrupted code as safe.

As an instance of this it is reported by Byte and Switch that Val Bercovici, Network Appliance’s chief technical architect of ILM data protection and compliance solutions, thinks there is now a problem with content-addressed storage - single-instance storage as he puts it. The MD5 flaw provides hackers with a shortcut method to crack the algorithm.

What might happen is that a hacker could generate a script to create a binary file with the same content address as an existing file. This cloned file could be sent to the hacker as an e-mail attachment, which gets stored in an MD5-based system. Then the hacker mails out the original file, which happens to contain sensitive or secret data. Because its hash value is the same as the cloned attachment the MD5 system doesn't store it. There is then no record of the secret data being sent out.

While theoretically possible in the future this is not what the four Chinese researchers actually showed, according to Roy Sanford, EMC’s VP of CAS, mentioned in the same report. They showed random files could have duplicate addresses. They didn't generate a file which specifically had the same address as a target file. He also points out that Centera uses MD5 plus another EMC algorithm which has not been shown to be vulnerable. Centera files have not been demonstrated to be compromised.

Crypto 2004's chairman, Jim Hughes, is reported elsewhere as commenting that MD5 is now compromised and data integrity methods using it had better move on to use better algorithms.