?

Log in

Visualizing Differences with Colorize and Filecompare - A Geek Raised by Wolves [entries|archive|friends|userinfo]
jessekornblum

[ website | My Website ]
[ userinfo | livejournal userinfo ]
[ archive | journal archive ]

Links
[Links:| Browse by Tag LiveJournal Portal Update Journal Logout ]

Visualizing Differences with Colorize and Filecompare [Mar. 16th, 2013|11:45 pm]
jessekornblum
[Tags|, , ]

I am pleased to announce two new tools for visualizing data and comparing files. These are certainly not the only tools for these tasks, but I'm hoping you will find them useful. Here are the download links, with details below:

Windows: http://jessekornblum.com/tools/colorize/colorize-1.0.zip
Source code: https://github.com/jessek/colorize/archive/1.0.zip
Development: https://github.com/jessek/colorize/

Background

A few weeks ago somebody asked me how to compare two disk images to each other. The two images were taken from the same drive at different times. The person wanted to know what had changed on the drive in between. In addition to the files in the logical file system, they wanted to know about anything written anywhere on the drive. What kind of tool could do that?

Fuzzy hashing (aka ssdeep) is not suited to the task, and as far as I know, sdhash isn't a good idea either. The former would most likely not generate a match between the drive images. The latter would identify lots of matches, and perhaps a percentage score between the drives. But what does it mean that the drives are 82% identical? My questioner didn't know, and was looking for another option.

Colorize

The first tool I've written, colorize, takes an input file and produces a BMP (bitmap image) to represent it. This is nothing special and has been done many times before. You can begin to recognize some file formats based on their distinctive patterns. Here are the required portraits of different kinds of files "colorized" in this manner:

Windows EXE: http://jessekornblum.com/tools/colorize/img/colorize.exe.bmp (900KB)

Mach-O Executable: http://jessekornblum.com/tools/colorize/img/colorize.bmp (45KB)

(Yes, these pictures are the colorized program run on Win32 and OS X versions of itself. Computer scientists just love when you run programs on themselves.)

JPEG: http://jessekornblum.com/tools/colorize/img/sample.jpg.bmp (268KB)

Microsoft Word DOCX file: http://jessekornblum.com/tools/colorize/img/sample.docx.bmp (552KB)

In all of these images, the data starts at the top of the file, going right, and then going down. Both of these options can be reversed. The images also default to a vertical orientation and being 100 pixels wide, but these can also be configured.

Filecompare

These are pretty pictures, but don't get us any closer to comparing hard drive images. Generating BMPs to represent entire hard drives would be impractical at best. As such, I wrote another tool, filecompare, which is used to do the comparison and create input for colorize.

As the name suggests, filecompare compares two files, in user-specified sized blocks, and produces an output for each block which denotes if they are identical or not. An optional mode indicates the degree of difference between the blocks.

In the default mode, the filecompare program uses blocks of 512 bytes and indicates identical blocks with the 0x80 character. Here's the program comparing two identical files:

$ filecompare lorem.txt lorem.txt | xxd
0000000: 8080 8080 8080 8080 8080 8080 8080 8080  ................
0000010: 8080 8080 8080 8080 8080 8080 8080 8080  ................
0000020: 8080 8080 8080 8080 8080 8080 8080 8080  ................
0000030: 8080 8080 8080 8080 80                   .........


By itself this is not terribly useful, but can be used as input into the colorize program. When we do so, the result is a solid green bar:

$ filecompare lorem.txt lorem.txt > same512.dat
$ colorize same512.dat 
$ open same512.dat.bmp




We can get a bigger picture by changing the block size from the 512 byte default to one-byte blocks:

$ filecompare -b 1 lorem.txt lorem.txt > same1.dat
$ colorize same1.dat
$ open same1.dat.bmp




Now let's try making a copy of this text file and making some changes in it. These changes are *in place*. That is, we neither insert or delete any characters, only replace them. Inserting or deleting characters would change the alignment of all subsequent data and render these utilities useless.

$ cp lorem.txt lorem-edit.txt
$ vi lorem-edit.txt

[edits happen, highlighted in red:]

$ diff lorem.txt lorem-edit.txt
5c5
< Morbi sit amet lacus lectus, vitae auctor sapien. Praesent dictum fringilla sollicitudin. Maecenas quis vestibulum tortor. Sed fringilla, mi porttitor venenatis imperdiet, metus justo pharetra est, et condimentum metus dolor at quam. Integer placerat, mi sit amet luctus ornare, mi lorem egestas turpis, in sodales velit eros id leo. Cras in est et metus interdum auctor et at leo. Sed ac urna ante, vel semper lacus. Fusce ac urna ac mi tincidunt bibendum. Vestibulum scelerisque lacus sit amet justo congue rhoncus. Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia Curae; Cras at semper purus.
---
> I am the lizard king! You s, vitae auctor sapien. Praesent dictum fringilla sollicitudin. Maecenas quis vestibulum tortor. Sed fringilla, mi porttitor venenatis imperdiet, metus justo pharetra est, et condimentum metus dolor at quam. Integer placerat, mi sit amet luctus ornare, mi lorem egestas turpis, in sodales velit eros id leo. Cras in est et metus interdum auctor et at leo. Sed ac urna ante, vel semper lacus. Fusce ac urna ac mi tincidunt bibendum. Vestibulum scelerisque lacus sit amet justo congue rhoncus. Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia Curae; Cras at semper purus.


Let's look at the filecompare output again with the 512 byte blocks:

$ filecompare lorem.txt lorem-edit.txt | xxd
0000000: 8080 fe80 8080 8080 8080 8080 8080 8080  ................
0000010: 8080 8080 8080 8080 8080 8080 8080 8080  ................
0000020: 8080 8080 8080 8080 8080 8080 8080 8080  ................
0000030: 8080 8080 8080 8080 80                   .........


Note the single 0xfe byte at offset 0x3 in the output. Filecompare uses this value to indicate data which doesn't match. When we run this data through colorize, we will see a red pixel:

$ filecompare lorem.txt lorem-edit.txt > edit512.dat
$ colorize edit512.dat
$ open edit512.dat.bmp




We can see the single \xfe byte--the single different block--as a red pixel in the image.

When we decrease the block size again to one byte, we can see even more detail

$ filecompare -b 1 lorem.txt lorem-edit.txt > edit1.dat
$ colorize edit1.dat
$ open edit1.dat.bmp




Here the change we made is a small line in the image.

Yet another way to view these changes is not just to view differences, but to view the degree of differences. Filecompare has one more mode to use "transitional" colors to show differences. In this mode, the average of the absolute value differences between bytes is computed. The result is returned as the pixel value for that block. Blocks which are more identical--on average--will appear darker, closer to black. Blocks which are more different will appear brighter, closer to white. Here's the one-byte filecompare image again, but this time with transitional colors. Note that several blank lines have been removed using the -a flag in xxd:

$ filecompare -b 1 -t lorem.txt lorem-edit.txt | xxd -a
0000000: 0000 0000 0000 0000 0000 0000 0000 0000  ................
*
00004b0: 0000 0000 0000 0004 4f11 0b49 540b 0454  ........O..IT..T
00004c0: 4c08 0d04 0244 4c0a 0607 0c01 4c0c 0c01  L....DL.....L...
00004d0: 5500 0000 0000 0000 0000 0000 0000 0000  U...............
00004e0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
*
0007170: 0000 0000                                ....


Note that some of these blocks--in this case, bytes--are quite different, but in others, the values are more similar. The result is immediately visible in the image:

$ filecompare -b 1 -t lorem.txt lorem-edit.txt > transition1.dat
$ colorize transition1.dat
$ open transition1.dat.bmp




Downloads

Try it out for yourself:

Windows: http://jessekornblum.com/tools/colorize/colorize-1.0.zip
Source code: https://github.com/jessek/colorize/archive/1.0.zip
Development: https://github.com/jessek/colorize/

The Fine Print

Although the filecompare program claims to support large block sizes, the maximum supported size is 512 MB blocks.

FIlecompare and colorize are licensed under the General Public License version 3 (GPL3).

The original JPEG image was from Flickr user s13n1 and used under a Creative Commons license, http://www.flickr.com/photos/s13n1/6216195937/. The sample text in the Word document was generated by http://www.lipsum.com/.

You can contact me by commenting on this post, emailing research@jessekornblum.com, or tweeting me http://twitter.com/jessekornblum.

No animals were harmed during the writing of this software, which is a bit of a change. Normally there are a few goats or cats involved somehow.
LinkReply