Log in

Practical "Looks Like" Similarity - A Geek Raised by Wolves [entries|archive|friends|userinfo]

[ website | My Website ]
[ userinfo | livejournal userinfo ]
[ archive | journal archive ]

[Links:| Browse by Tag LiveJournal Portal Update Journal Logout ]

Practical "Looks Like" Similarity [Oct. 4th, 2012|04:13 pm]
[Tags|, , , , , ]

Despite all of the research being done on similarity, there are still no practical tools for identifying similar pictures—'this picture looks like that one'— suitable for computer forensics. (I would be happy to be proven wrong.) The following is an explanation of how such a tool would work and provides a proof-of-concept.

The technical names for this problem are Perceptual Hashing or Content Based Image Retrieval (CBIR). A CBIR tool would take a picture as a search query and, from a set of other pictures, identify any which are look like the query.

There are several companies trying to bring such products to market [1]. So far I haven't found any products in the list which would meet our needs, but again, I would be happy to be proven wrong. Several people have pointed me to pHash, but I haven't been able to get it to compile [2].

Forensic examiners are unique in that they need the CBIR engine and dataset to reside on their own computer. The pictures in question may contain child pornography or other evidence which can't be released to anybody else. (No, we can't upload our data to a web-based provider, sorry.)

The definition of "looks like" for pictures is, of course, highly subjective. Let's start with an easy example. You could have the "same" image, but saved in different formats (e.g. JPEG and PNG). The word "same" is in quotes because the compression processes guarantees that these images are not the same when compressed, and almost certainly not the same even when not compressed. Although almost indistinguishable by the human eye, the patterns of ones and zeros is radically different. Such differences render traditional fuzzy hashing ineffective.

For example, here are two pictures of my cat Luke. The first is a JPEG, the second is a PNG. They look identical, but have different traditional and fuzzy hashes.

$ md5deep -b kitty1-small.* 
1dcfeb21e006836bb4a999e07ad54571  kitty1-small.jpg
12de826dce2567c85af22750644bc9ed  kitty1-small.png

$ ssdeep -bad kitty1-small.* 
kitty1-small.png matches kitty1-small.jpg (0)

But there is another way!

We can make a "similarity score" for pictures by reducing the information contained in the picture and then comparing those reduced data. For example, Dr. Neil Krawetz described such a method on his blog [3]. The technique reduces any image to an 8x8 image, converts it to grayscale, throws away some bits from each pixel, and computes the average value of any pixel in this image. A signature is then constructed based on whether each pixel is above or below the average value. The result is a single 64-bit number which represents the image.

I've modified an existing Python script by David J. Oftedal to compute this algorithm [4]. The script can store these 'hashes' and compare them to new files. The result is a system which works a little like fuzzy hashing to generate matches between similar looking pictures. The script is called samecat.py, as it was written to identify similar looking CP. (Cat Pictures. What were you thinking?)

Here a sample of my script in action. First, we generate a signature on one of the pictures of my cat from above:
$ python samecat.py kitty1-small.jpg

Note the 64-bit number followed by a comma and the filename. We can save this signature to a file and then use it for matching:
$ python samecat.py kitty1-small.jpg > known.txt

$ python samecat.py -m known.txt *
kitty1-small.jpg matches kitty1-small.png (100)

We've found a match! The JPEG picture matches the PNG picture.

This kind of approach works for other kinds of similar pictures too. Pictures where one has been slightly resized, cropped, or color enhanced, for example, could be similar. Computer scientists love these kinds of pictures because they can be automatically created. In the sample data set, below, I've included different sizes of the kitty pictures. The script can readily find matches among these pictures.

In my testing, I experimented with the corpus of JPEGs available from digitalcorpora.org [5]. In directory '000' of this corpus, for example, the program found four pairs of images where one was a resized version of the other.

This technique for picture similarity also applies when the pictures are slightly different, but are still, to the naked eye, almost identical. Here are two more pictures of Luke, but this time the images are in fact slightly different. In the second image, notice that the position of his head is different. There is also a patch of brown in the upper left.

If we loosen the similarity threshold a little, we can generate a match between these two pictures too. The threshold is controlled with the -t flag. First we create a signature, save it to a file, and then search for matches.
$ python samecat.py kitty1-small.jpg 

$ python samecat.py kitty1-small.jpg > known.txt 

$ python samecat.py -m known.txt -t 90 kitty2-small.jpg 
kitty1-small.jpg matches kitty2-small.jpg (93)

Again, a match! The number in parenthesis is the match score, out of 100.

The samecat.py script is a proof-of-concept only. It's slow and there is almost certainly a better algorithm out there for this task. On the other hand, let's compare it to the other tools out there for this work. Oh wait! There are none!

Admittedly, I hope to be proved wrong. I hope that one of you can let me know about some existing tool which I've overlooked which does precisely what I want. But for now, enjoy!

Questions? Comments? LiveJournal users can reply below, and anybody can reach me on Twitter at @jessekornblum. You can also send an email to research@jessekornblum.com.

Download Links
  • samecat.py - Python script to identify similar looking pictures
  • kitty.zip - Some cat pictures to get you started

The samecat.py script requires the Python Imaging Library (PIL), http://www.pythonware.com/products/pil/. Windows users can get a binary installer from http://effbot.org/downloads/#pil.

You can see a complete list of command line options by using the -h flag.

The script is designed to work like the *deep programs. There is a -r flag for recursive operation. The -t flag sets the threshold for matching, which defaults to 100 (out of 100). Setting a lower threshold will result in more matches. The lowest threshold is zero.

By default the program hides any matches where the files have the same name (e.g. "foo/bar.txt matches foo/bar.txt"). Using the -s flag causes these self matches to be displayed.

[1] http://en.wikipedia.org/wiki/List_of_CBIR_engines

[2] http://www.phash.org/

[3] Dr. Neil Krawetz, The Hacker Factor Blog, 26 May 2011, http://www.hackerfactor.com/blog/index.php?/archives/432-Looks-Like-It.html.

[4] David J. Oftedal, AverageHash, http://folk.uio.no/davidjo/programming.php

[5] http://digitalcorpora.org/archives/250