The Fuzzy Hashing Patent [May. 15th, 2008|06:40 am]
It appears that somebody has patented fuzzy hashing. Specifically, US Patent 7,272,602, System and method for unorchestrated determination of data sequences using "sticky byte” factoring to determine breakpoints in digital sequences, was issued to Gregory Hagan Moulton of the EMC Corporation on 18 Sep 2007.

When I published my fuzzy hashing paper, I expected (and hoped) other researchers would improve the algorithm. A search this morning revealed two papers that appear to do so, An Efficient Piecewise Hashing Method for Computer Forensics and Improving Disk Sector Integrity Using 3-dimension Hashing Scheme. Excellent! I can't wait to read these.

But the same search also revealed the patent. Submitted in 2004, the patent examiner apparently cited my fuzzy hashing paper published in 2006. Please don't ever let anybody tell you I "invented" fuzzy hashing. I had the idea of using the existing (bad) spam detector, spamsum, for computer forensics. I combined the existing spamsum engine with the md5deep interface to create ssdeep and wrote the paper to explain it.

What does the existence of this patent mean? Should I no longer be working on fuzzy hashing? Do I need to pay a license? Does the existence of spamsum and what it was based on (rsync) count as prior art? Does this patent cover what I think it covers? More? Less?

Is nilsimsa mentioned as prior art in this patent? (Unless you mean something else by "fuzzy hashing", I think the idea is much older than 2004.)
I don't think it was mentioned in the patent, but it was certainly something we considered when we first got into similarity matching!
Ah, so you do know of it. What do you think of the algorithm? I seem to recall it was included in Vipul's Razor for a while, but removed for some reason.

Regardless, it might prove to be valuable should there be a patent challenge. Certainly the idea of fuzzy hashing is demonstrably older than that patent, so they would at best be left with a specific class of fuzzy hashing techniques, and at worst be left with only their algorithm.

I should read the patent before trying to answer the questions in your last paragraph. Most of my work with fuzzy hashing has to do with signal acquisition issues and is covered by trade secrets rather than patents, but there is some overlap.
I think you are ok

The patent cited Spamsum as prior art. It's probably a different usage of fuzzy hashing. I only skimmed the patent quickly. You might want to read it in depth to see what it does to avoid the pitfalls.
