There have been some exciting advances in hashing technology over the past few years and it’s time for md5deep to catch up. It’s also time to remove some of the original design constraints. The program, rapidly approaching its eighth birthday, was first written to support the First Responder's Evidence Disk (FRED). At the time there were no recursive MD5 programs, so I wrote my own. But the requirements of FRED, such as being as small as possible to fit on a floppy disc, are no longer relevant.
The largest user visible change to the program will be speeding up the file matching modes. The current version reads each input file in its entirety, hashes it, and then compares the hash to the entire set of known hashes. While this works, it requires the program read and hash every byte of every input file.
Research conducted over the past few years and shown there are faster ways to match known files without sacrificing accuracy. Because the program reads the directory entry for each file before processing it, the program can get the size of each input file ahead of time. If the size of an input file does not match the size of any known file, the input cannot possibly match any known file, and is skipped in its entirety. (Yes, it is theoretically possible that two files, of different sizes, could have the same cryptographic hash. I will buy a beer to the first person who can show me such a construction, and then alter the program to handle that situation.)
In the next step of a comparison operation, the program can read a small portion of the input file and hash it. If this partial hash of the input file does not match the partial hash of the known file, then the overall files cannot possibly match and the remainder of the file read and hash can be skipped. This approach goes by many names, including partial hashing, Fibonacci Hashing, and header hashing. There are too many references for me to cite here, but let the record reflect these ideas were developed by other people.
Supporting these innovations will require the program to add two additional fields to sets of known hashes, namely the file size and a partial hash. I plan to update the hashdeep file format. The change will be minimal, merely to add another valid column name, specifically for the partial hash. Here’s an example, where the partial hash is called “md5_8192”, perhaps an MD5 of the first 8,192 bytes of each file:
%%%% HASHDEEP-1.1 %%%% size,md5_8192,md5,filename ## Invoked from: /Users/jessek ## $ md5deep foo.txt bar.txt 37,58500e774618a18997cecb1ff050aabd,170efaa2ccadbe4bd65842fe3d18ede2,foo.txt 63,4dd9616dd87e06f6e0a0ea602b1340f4,fff1898ff22415cc22f66c55bd4faeb1,bar.txt
While md5deep version four will be able to generate hashes in either this new style or the traditional format, which should be the default? If the old style is the default, users may not be able to take advantage of the faster matching operations. If the new style is the default, the program may become unwieldy for users trying to get hashes of just a few files. Yet another option would be to remove the matching operations from md5deep altogether and instead encourage users to use hashdeep for all matching operations. That way md5deep retains its purity, but users must adapt to having a second tool.
What do you think?