Finding Things
(Page 1 of 5 )
Search, whether it's searching the web or the contents of your computer, presents the developer with a major challenge. This article, the first of two parts, provides an overview of several search techniques, and the trade-offs that go with them. It is excerpted from chapter four of
Beautiful Code: Leading Programmers Explain How They Think, written by Andy Oram and Greg Wilson (O'Reilly, 2007; ISBN: 0596510047). Copyright © 2007 O'Reilly Media, Inc. All rights reserved. Used with permission from the publisher. Available from booksellers or direct from O'Reilly Media.
COMPUTERS CAN COMPUTE, BUT THAT'S NOT WHAT PEOPLE USE THEM FOR, MOSTLY. Mostly, computers store and retrieve information. Retrieve implies find, and in the time since the advent of the Web, search has become a dominant application for people using computers.
As data volumes continue to grow—both absolutely, and relative to the number of people or computers or anything, really—search becomes an increasingly large part of the life of the programmer as well. A few applications lack the need to locate the right morsel in some information store, but very few.
The subject of search is one of the largest in computer science, and thus I won’t try to survey all of it or discuss the mechanics; in fact, I’ll only consider one simple search technique in depth. Instead, I’ll focus on the trade-offs that go into selecting search techniques, which can be subtle.
On Time You really can’t talk about search without talking about time. There are two different flavors of time that apply to problems of search. The first is the time it takes the search to run, which is experienced by the user who may well be staring at a message saying something like “Loading...”. The second is the time invested by the programmer who builds the search function, and by the programmer’s management and customers waiting to use the program.
Problem: Weblog Data Let’s look at a sample problem to get a feel for how a search works in real life. I have a directory containing logfiles from my weblog (http://www.tbray.org/ongoing) from early 2003 to late 2006; as of the writing of this chapter, they recorded 140,070,104 transactions and occupied 28,489,788,532 bytes (uncompressed). All these statistics, properly searched, can answer lots of questions about my traffic and readership.
Let’s look at a simple question first: which articles have been read the most? It may not be instantly obvious that this problem is about search, but it is. First of all, you have to search through the logfiles to find the lines that record someone fetching an article. Second, you have to search through those lines to find the name of the article they fetched. Third, you have to keep track, for each article, of how often it was fetched.
Here is an example of one line from one of these files, which wraps to fit the page in this book, but is a single long line in the file:
c80-216-32-218.cm-upc.chello.se - - [08/Oct/2006:06:37:48 -0700] "GET /ongoing/When/
200x/2006/10/08/Grief-Lessons HTTP/1.1" 200 5945 http://www.tbray.org/ongoing/
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)
Reading from left to right, this tells us that:
Somebody from an organization namedchelloin Sweden,
who provided neither a username nor a password,
contacted my weblog early in the morning of October 8, 2006 (my server’s time zone
is seven hours off Greenwich),
and requested a resource named /ongoing/When/200x/2006/10/08/Grief-Lessons
using the HTTP 1.1 protocol;
the request was successful and returned 5,945 bytes;
the visitor had been referred from my blog’s home page,
and was using Internet Explorer 6 running on Windows XP.
This is an example of the kind of line I want: one that records the actual fetch of an article. There are lots of other lines that record fetching stylesheets, scripts, pictures, and so on, and attacks by malicious users. You can spot the kind of line I want by the fact that the article’s name starts with/ongoing/When/and continues with elements for the decade, year, month, and day.
Our first step, then, should be to find lines that contain something like:
/ongoing/When/200x/2006/10/08/
Whatever language you’re programming in, you could spend lots of time writing code to match this pattern character by character. Or you could apply regular expressions.
Next: Regular Expressions >>
More Practices Articles
More By O'Reilly Media
|
This article is excerpted from chapter four of Beautiful Code: Leading Programmers Explain How They Think, written by Andy Oram and Greg Wilson (O'Reilly, 2007; ISBN: 0596510047). Check it out today at your favorite bookstore. Buy this book now.
|
|