Of Hashes and Clashes
There’s a lot in the news lately about the virtues of various hash algorithms, such as SHA1, SHA2, SHA3, and the venerable MD5. Essentially these are wicked complex cryptography algorithms – too sophisticated for most of us to understand – that distill the contents of a file or any stream of data of any size, perhaps megabytes or more of data, into a single simple string of ASCII text characters called a hash or digest. A hash is small, fitting easily on one line of text. That string of text characters is absolute gobbledygook and utterly meaningless, but has one very useful purpose: You can use it to know if the content of a file or stream of data is the same as it was the last time it was looked at.
The hope with hashing algorithms is that they’ll never “fail,” that is, that any change to file contents, even the slightest change, would result in a completely different hash. Whenever there are two files with different content that have the same hash, that’s called a collision.
It is generally accepted that collisions occurring naturally have an insanely low probability of occurring. However, hashes are in the news lately because some smart folks at Google have proven that Bad Guys with a ton of CPU power can artificially manufacture bad data that has the same hash as “good data,” in theory allowing Bad Guys to substitute bad content for good.
That has caused a bit of concern in the version management world, because repositories like Git and Subversion rely on hashes to verify that the contents of an entire repository are “known good stuff.” The ability of a Bad Guy to arbitrarily compromise the hash algorithm would give them, in theory, the ability to sneak a surreptitious, corrupt repository with contents of their choosing in place of a good repository, by injecting garbage data to make the hash match. Most experts consider it a bit of a stretch that such a replacement could occur undetected. But regardless, developers of Git and Subversion are taking the threat seriously, and working to defend against possible attacks. They are considering, for example, upgrading from SHA1 to other, even stronger cryptography algorithms, and contemplating detecting “collision attacks” such that they could be rejected. Admins of these systems would need to upgrade to the latest version (once it is available) to be safer.
Though unlikely, the risk is that a Bad Guy with regular user (non-admin) access could submit a bogus file that results in a collision.
Perforce uses cryptography in a very different way from Git and Subversion, and is far less vulnerable to the risk of hash collisions. Unlike Git and Subversion, which use a hash to represent an entire repository of files at any point in time, Perforce uses hashes sparingly, only to verify the contents of individual files. Further, submitting bogus collision files would do nothing more than add individual junk files to the server. Lots of pain for the Bad Guys, and no gain in terms of causing harm. Without direct admin access to the master server machine, the ability to generate hash collisions wouldn’t benefit an attacker against a Perforce server. That’s part of why Perforce still uses the venerable MD5 algorithm, yet is no less safe for it. Unlike Git and Subversion, where the hash algorithms are core to the design and integrity of the entire repository, Perforce’s reliance on hashes is to guard against disk rot or network hiccups during file transfer of individual files.
With any of the version control systems mentioned here (Perforce, Git, and Subversion), a successful attack would require far more than the ability to generate collisions in whatever hashing algorithm is used. Though the potential to damage repos has in fact been proven, improvements in those systems will make future attacks even harder, and hopefully not worth the effort. That said, the best defense is to always have a few Good Guys who wear Black Hats, think like the Bad Guys, and help keep us all safer.