A Geek Raised by Wolves - Fuzzy Hashing for Source Code Comparison [entries|archive|friends|userinfo]
jessekornblum

[ website | My Website ]
[ userinfo | livejournal userinfo ]
[ archive | journal archive ]

Links
[Links:| Browse by Tag LiveJournal Portal Update Journal Logout ]

Fuzzy Hashing for Source Code Comparison [Oct. 5th, 2006|08:05 am]
Previous Entry Share Next Entry
[Tags|, ]

Back in August I published a new tool for "fuzzy hashing" (slides, paper, ssdeep program) that can identify documents that have homologies but are not necessarily identical. That is, they have a lot of the same bits in the same order, but are not the same.

You can use fuzzy hashing to find source code reuse. For example, let's say you suspect I reused source code from one of my earlier projects, md5deep, when writing the fuzzy hashing program ssdeep. (Computer scientist types just love when you use a program to analyze itself. Somewhere there are 6.001 geeks dancing with metasyntactic glee as they read this.)

Let's say we have two folders, ssdeep-1.1 and md5deep-1.12. First we record the fuzzy hashes, with relative filenames (the -l switch) to a file:
C:\> ssdeep -lr md5deep-1.12 > hashes.txt

Then we compare those saved hashes with the other directory:
C:\> ssdeep -lrm hashes.txt ssdeep-1.1
ssdeep-1.1\cycles.c matches md5deep-1.12\cycles.c (94)
ssdeep-1.1\dig.c matches md5deep-1.12\dig.c (35)
ssdeep-1.1\helpers.c matches md5deep-1.12\helpers.c (57)
Aha! Those matches indicate source code reuse! A manual examination of the files in question is required to tell exactly what kind of copying occurred, but we've saved ourselves a lot of work!

For the really geeky, there's a way to do this via one command line, but it will also include all of the matches internal to each directory. Like this:
C:\> ssdeep -lrd md5deep-1.12 ssdeep-1.1
md5deep-1.12\md5.h matches md5deep-1.12\cycles.c (27)
md5deep-1.12\sha1.h matches md5deep-1.12\cycles.c (25)
md5deep-1.12\sha1.h matches md5deep-1.12\md5.h (58)
md5deep-1.12\sha256.h matches md5deep-1.12\cycles.c (25)
md5deep-1.12\sha256.h matches md5deep-1.12\md5.h (61)
md5deep-1.12\sha256.h matches md5deep-1.12\sha1.h (57)
md5deep-1.12\tiger.h matches md5deep-1.12\cycles.c (29)
md5deep-1.12\tiger.h matches md5deep-1.12\md5.h (65)
md5deep-1.12\tiger.h matches md5deep-1.12\sha1.h (63)
md5deep-1.12\tiger.h matches md5deep-1.12\sha256.h (61)
ssdeep-1.1\cycles.c matches md5deep-1.12\cycles.c (94)
ssdeep-1.1\dig.c matches md5deep-1.12\dig.c (35)
ssdeep-1.1\helpers.c matches md5deep-1.12\helpers.c (57)

If you'd like to see the matches in both directions (i.e. for two files A and B that match, see that A matches B and B matches A), use the -p flag instead of -d.
LinkReply

Comments:
[User Picture]From: dancinglights
2006-10-05 12:37 pm (UTC)

(Link)

Depending on the compiler, I think you can do this with the compiled code to match reused code with comment and variable name changes. An odd application, perhaps, but related to what my university automatically runs against student project submissions for cheating in low-level CS classes where you have a thousand students and several different human graders.
[User Picture]From: jessekornblum
2006-10-05 01:47 pm (UTC)

(Link)

Yes! You are absolutely right. Can you try this out? Or are you not a grader?

Additionally, fuzzy hashing can be used in lieu of MD5 for application like Tripwire when Prelinking is in use. The prelinking process, which speeds up program execution, also changes the MD5 hash for each program. But the fuzzy hashes of the new executables still match the originals!
[User Picture]From: dancinglights
2006-10-05 01:52 pm (UTC)

(Link)

I was a grader for several years, but it's a student position and I've long since graduated. You would be surprised (or maybe not) how many people we caught with a similar process because they thought changing things the compiler strips out anyway would make their copying undetectable. I know how *parts* of the tools we used worked, but not the whole system, so I'm still interested in things like this as a chance to retroactively figure it out. Also, I have no idea how other schools do things, so it could be a fairly good suggestion for other places.
From: illix
2006-10-05 05:18 pm (UTC)

(Link)

That's pretty cool. It might make an interesting tool for tracking down open source fraud (like when proprietary software includes GPL'ed code and doesn't acknowledge it).

Awrk 6.001 those were the days (the days of having 3h sleep and yet aceing the exam) how I miss you. Except...not.
[User Picture]From: jessekornblum
2006-10-06 02:49 am (UTC)

(Link)

It might make an interesting tool for tracking down open source fraud (like when proprietary software includes GPL'ed code and doesn't acknowledge it).

Absolutely! I've done anecdotal testing to show that object files match, but have had trouble with whole executables.
[User Picture]From: thewronghands
2006-10-06 03:28 am (UTC)

(Link)

Have you had any problems with false positives that look like code reuse but actually were independently developed but done in the same way? (Obviously your code =/= legal case and all, though it certainly could produce some helpful evidence.) I'm just curious as to how often (if ever) that sort of thing happens.
[User Picture]From: jessekornblum
2006-10-06 03:44 am (UTC)

(Link)

All of the false positives I've seen have come from stub files. They're short, contain very similar headers and footers and have no substance. I think there may be a feature in the next version to exclude files below a certain size.
[User Picture]From: snarl817
2006-10-06 04:18 pm (UTC)

That's really cool.

(Link)

As far as image hashing goes, I had (YEARS ago, when I could still write code) considered using densitometry on an image to generate a hash in a similiar manner. Basically generate densitometry maps for strips of the image (using 10% increments, both horizontal and vertical) and hash that data (somehow), then store it. I have no idea how accurate it would be, but combining that methodology with your algorithm might yield some interesting results. And by generating hashes based on 10% strips of the image, the data should survive something like a resize, and even a crop. Plus, it could even match images that have had text overlayed on them.