?

Log in

Parsing web server logs for search queries - A Geek Raised by Wolves [entries|archive|friends|userinfo]
jessekornblum

[ website | My Website ]
[ userinfo | livejournal userinfo ]
[ archive | journal archive ]

Links
[Links:| Browse by Tag LiveJournal Portal Update Journal Logout ]

Parsing web server logs for search queries [Feb. 24th, 2007|10:39 am]
jessekornblum
[Tags|, ]

I've been looking at my web server logs recently because, well, I'm a geek and my wife is out of town. Using some homebrew scripts I've been able to get a peek at what people have been searching for when they reach my web site.

Every time you view a web page your computer sends a bunch of information to the web server sending you the page. Not only does your computer ask for a particular page, but it also says which page you are coming from. For example, if you're viewing http://www.whitehouse.gov/ and click on the link for http://svr.gov.ru/ [1], the svr.gov.ru web server is told that you came from www.whitehouse.gov. The site that sent you to a new web site is called the referring site and the specific page you were on is called the referring URL [2].

The referring URL can be very informative when dealing with search engines. When you run a search on Google for example, your search terms appears in the URL. For example, searching for "Happy Puppies" on Google sends me to the page http://www.google.com/search?hl=en&q=%22Happy+Puppies%22&btnG=Google+Search. See the words "Happy Puppies" in there? If the user then clicks on a link from that page, the Google URL, including the search term, is sent to the web site hosting the search result.

For example, let's say that somebody searches for the phrase "jesse kornblum" on Google and then follows a link to my site. My computer will record something like this in the log:

18.72.0.3 - - [01/Jan/1904:15:34:11 -0500] "GET /porn/goats/hotgoat07.html HTTP/1.0" 200 8417 "http://www.google.com/search?hl=en&q=jesse+kornblum" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)"

In this record, we see somebody from the IP address 18.72.0.3 used Google to search for "jesse kornblum" at 3:34pm on 1 Jan 1904. The logging format, although complete, can be a little tedious to read by hand. Thus I wrote a quick and dirty script to parse out the search terms.
#!/usr/bin/perl -w

while (<>)
{
    # Make sure it's a search record
    $_ =~ /google/i || next;
    $_ =~ /search/i || next;
    $_ =~ /q\=/ || next;
    $_ =~ /Googlebot/ && next;

    chop;

    $_ =~ s/^(.+[\&\?]{1}p\=)(\%22)*([\%\|\,\(\)\.A-z0-9\+\_\-]+)[&\"\%].*$/$3/;

    # Fix spaces
    $_ =~ s/\+/ /g;

    # Translate escaped values back to normal
    $_ =~ s/%([a-fA-F0-9]{2,2})/chr(hex($1))/eg;

    print "$_\n";
}


Most of the searches involved my name, tools I had written, presentations I had given, or something else that made sense. But along with those have been some really interesting queries:

Zoey the dog
zoey naked
naked zoey
luke for zoey
WHY DOGS HIDE UNDER THE BED
caninity
a bigger dog
Zoey ER
tennis zoey
play bows
zoey 18
7508736 BIOS
3efea3144abee232fda1719d2c1a4066
under the clothes
18 inches
goat farm
cow and goat
your a goat
farm cow
koala gif
out of time the horse
horse run
goat milks
horse cow
in peril


[1] This is the web site for the Служба внешней разведки (Sluzhba Vneshney Razvedki), or the Russian Foreign Intelligence Service.

[2] My proxy server, Privoxy, helps me by always sending the referring URL as belonging to the new site I'm visiting. For example, if I'm visiting http://www.whitehouse.gov/pages/context04.html and click on the link for http://svr.gov,ru/, my computer sends the referring URL as http://www.svr.ru/.
LinkReply

Comments:
[User Picture]From: illix
2007-02-24 06:42 pm (UTC)
Your goat-seeker just had to come from 18.0.0.0/8, hmm? :)
(Reply) (Thread)
[User Picture]From: illix
2007-02-24 06:45 pm (UTC)
18.72.0.3...that's the W20 Athena cluster, no?
(Reply) (Parent) (Thread)
[User Picture]From: jessekornblum
2007-02-24 11:08 pm (UTC)
$ whois mit.edu

[...]

Name Servers:
STRAWB.MIT.EDU 18.71.0.151
W20NS.MIT.EDU 18.70.0.160
BITSY.MIT.EDU 18.72.0.3
(Reply) (Parent) (Thread)