Some look at my binary-search algorithm and ask why the loop always runs to the end without checking whether it’s found the target. In fact, this is the correct behavior; the math is beyond the scope of this chapter, but with a little work, you should be able to get an intuitive feeling for it—and this is the kind of intuition I’ve observed in some of the great programmers I’ve worked with.
Let’s think about the progress of the loop. Suppose you have n elements in the array, where n is some really large number. The chance of finding the target the first time through is 1/n, a really small number. The next iteration (after you divide the search set in half) is 1/(n/2)—still small—and so on. In fact, the chance of hitting the target becomes significant only when you’re down to 10 or 20 elements, which is to say maybe the last four times through the loop. And in the case where the search fails (which is common in many applications), those extra tests are pure overhead.
You could do the math to figure out when the probability of hitting the target approaches 50 percent, but qualitatively, ask yourself: does it make sense to add extra complexity to each step of anO(log2 N)algorithm when the chances are it will save only a small number of steps at the end?
The take-away lesson is that binary search, done properly, is a two-step process. First, write an efficient loop that positions yourlowandhighbounds properly, then add a simple check to see whether you hit or missed.
Search in the Large
When most people think of search they think of web search, as offered by Yahoo!, Google, and their competitors. While ubiquitous web search is a new thing, the discipline of full-text search upon which it is based is not. Most of the seminal papers were written by Gerald Salton at Cornell as far back as the early 1960s. The basic techniques for indexing and searching large volumes of text have not changed dramatically since then. What has changed is how result ranking is done.*
Searching with Postings
The standard approach to full-text search is based on the notion of a posting, which is a small, fixed-size record. To build an index, you read all the documents and, for each word, create a posting that says word x appears in document y at position z. Then you sort all the words together, so that for each unique word you have a list of postings, each a pair of numbers consisting of a document ID and the text’s offset in that document.
Because postings are small and fixed in size, and because you tend to have a huge number of them, a natural approach is to use binary search. I have no idea of the details of how Google or Yahoo! do things, but I’d be really unsurprised to hear that those tens of thousands of computers spend a whole lot of their time binary-searching big arrays of postings.
People who are knowledgeable about search shared a collective snicker a few years ago when the number of documents Google advertised as searching, after having been stuck at two billion and change for some years, suddenly became much larger and then kept growing. Presumably they had switched the document ID in all those postings from 32-bit to 64-bit numbers.
Ranking Results
Given a word, searching a list of postings to figure out which documents contain it is not rocket science. A little thought shows that combining the lists to do AND and OR queries and phrase search is also simple, conceptually at least. What’s hard is sorting the result list so that the good results show up near the top. Computer science has a subdiscipline called Information Retrieval (IR for short) that focuses almost entirely on this problem. Historically, the results had been very poor, up until recently.