Log in

Designing md5deep v4’s Default Output - A Geek Raised by Wolves [entries|archive|friends|userinfo]

[ website | My Website ]
[ userinfo | livejournal userinfo ]
[ archive | journal archive ]

[Links:| Browse by Tag LiveJournal Portal Update Journal Logout ]

Designing md5deep v4’s Default Output [Jul. 14th, 2010|10:45 am]

I would like to get some input on the upcoming md5deep version four. My biggest question is whether the program should default to producing the current plain form of hashes, or move to a new format for more efficient matching operations? It will be possible to get output in both formats, of course, but which should be the default?

There have been some exciting advances in hashing technology over the past few years and it’s time for md5deep to catch up. It’s also time to remove some of the original design constraints. The program, rapidly approaching its eighth birthday, was first written to support the First Responder's Evidence Disk (FRED). At the time there were no recursive MD5 programs, so I wrote my own. But the requirements of FRED, such as being as small as possible to fit on a floppy disc, are no longer relevant.

The largest user visible change to the program will be speeding up the file matching modes. The current version reads each input file in its entirety, hashes it, and then compares the hash to the entire set of known hashes. While this works, it requires the program read and hash every byte of every input file.

Research conducted over the past few years and shown there are faster ways to match known files without sacrificing accuracy. Because the program reads the directory entry for each file before processing it, the program can get the size of each input file ahead of time. If the size of an input file does not match the size of any known file, the input cannot possibly match any known file, and is skipped in its entirety. (Yes, it is theoretically possible that two files, of different sizes, could have the same cryptographic hash. I will buy a beer to the first person who can show me such a construction, and then alter the program to handle that situation.)

In the next step of a comparison operation, the program can read a small portion of the input file and hash it. If this partial hash of the input file does not match the partial hash of the known file, then the overall files cannot possibly match and the remainder of the file read and hash can be skipped. This approach goes by many names, including partial hashing, Fibonacci Hashing, and header hashing. There are too many references for me to cite here, but let the record reflect these ideas were developed by other people.

Supporting these innovations will require the program to add two additional fields to sets of known hashes, namely the file size and a partial hash. I plan to update the hashdeep file format. The change will be minimal, merely to add another valid column name, specifically for the partial hash. Here’s an example, where the partial hash is called “md5_8192”, perhaps an MD5 of the first 8,192 bytes of each file:

%%%% HASHDEEP-1.1
%%%% size,md5_8192,md5,filename
## Invoked from: /Users/jessek
## $ md5deep foo.txt bar.txt

While md5deep version four will be able to generate hashes in either this new style or the traditional format, which should be the default? If the old style is the default, users may not be able to take advantage of the faster matching operations. If the new style is the default, the program may become unwieldy for users trying to get hashes of just a few files. Yet another option would be to remove the matching operations from md5deep altogether and instead encourage users to use hashdeep for all matching operations. That way md5deep retains its purity, but users must adapt to having a second tool.

What do you think?

From: (Anonymous)
2010-07-14 11:29 pm (UTC)

Vote for new style output

I think most users of these tools would be able to adapt to the new output, so I vote for the new output to be default.
(Reply) (Thread)
[User Picture]From: capnbuckle
2010-07-15 01:59 am (UTC)
Hi Jesse! While I agree with Anonymous that most users of these tools would be able to adapt, any scripts which were written to both run md5deep and then consume the output will not be able to adapt...at least not by themselves.

So my opinion as a system administrator is that the older format be the default. This way scripts do not immediately start falling over if and when md5deep is updated on a system. I understand such a conservative viewpoint can be frustrating for a developer, because obviously you _want_ your users to be able to easily take advantage of the new functionality or you wouldn't have put it in there in the first place! But I think it's better to make use of the new functionality a conscious choice rather than present the maintainer of a script or process with an unpleasant surprise.

(Reply) (Thread)
From: (Anonymous)
2010-07-20 04:54 pm (UTC)

New format

Keep moving forward. Use the new format, especially if scripts expecting the old format and receiving the new format are likely to abend in a manner that is obviously in need of fixing. If that is not the case (I don't know how difficult that would be to determine), make 'em use hashdeep for the new style and keep md5deep hash matching as it currently is.
(Reply) (Thread)
From: (Anonymous)
2010-08-18 07:37 pm (UTC)

put fixed-length fields first

If you change the format - which is OK by me - I'd just suggest putting the fixed-length MD5 fields before the size and file name, so they be easily cut rather than parsed on field separators. fwiw.
(Reply) (Thread)