COMPUTERS CAN COMPUTE, BUT THAT'S NOT WHAT PEOPLE USE THEM FOR, MOSTLY. Mostly, computers store and retrieve information. Retrieve implies find, and in the time since the advent of the Web, search has become a dominant application for people using computers.
As data volumes continue to grow—both absolutely, and relative to the number of people or computers or anything, really—search becomes an increasingly large part of the life of the programmer as well. A few applications lack the need to locate the right morsel in some information store, but very few.
The subject of search is one of the largest in computer science, and thus I won’t try to survey all of it or discuss the mechanics; in fact, I’ll only consider one simple search technique in depth. Instead, I’ll focus on the trade-offs that go into selecting search techniques, which can be subtle.
You really can’t talk about search without talking about time. There are two different flavors of time that apply to problems of search. The first is the time it takes the search to run, which is experienced by the user who may well be staring at a message saying something like “Loading...”. The second is the time invested by the programmer who builds the search function, and by the programmer’s management and customers waiting to use the program.
Problem: Weblog Data
Let’s look at a sample problem to get a feel for how a search works in real life. I have a directory containing logfiles from my weblog (http://www.tbray.org/ongoing) from early 2003 to late 2006; as of the writing of this chapter, they recorded 140,070,104 transactions and occupied 28,489,788,532 bytes (uncompressed). All these statistics, properly searched, can answer lots of questions about my traffic and readership.
Let’s look at a simple question first: which articles have been read the most? It may not be instantly obvious that this problem is about search, but it is. First of all, you have to search through the logfiles to find the lines that record someone fetching an article. Second, you have to search through those lines to find the name of the article they fetched. Third, you have to keep track, for each article, of how often it was fetched.
Here is an example of one line from one of these files, which wraps to fit the page in this book, but is a single long line in the file:
Somebody from an organization namedchelloin Sweden, who provided neither a username nor a password, contacted my weblog early in the morning of October 8, 2006 (my server’s time zone is seven hours off Greenwich), and requested a resource named /ongoing/When/200x/2006/10/08/Grief-Lessons using the HTTP 1.1 protocol; the request was successful and returned 5,945 bytes; the visitor had been referred from my blog’s home page, and was using Internet Explorer 6 running on Windows XP.
This is an example of the kind of line I want: one that records the actual fetch of an article. There are lots of other lines that record fetching stylesheets, scripts, pictures, and so on, and attacks by malicious users. You can spot the kind of line I want by the fact that the article’s name starts with/ongoing/When/and continues with elements for the decade, year, month, and day.
Our first step, then, should be to find lines that contain something like:
Whatever language you’re programming in, you could spend lots of time writing code to match this pattern character by character. Or you could apply regular expressions.