Google and its competitors have been able to produce good results in the face of unimaginably huge data sets and populations of users. When I say “good,” I mean that high-quality results appear near the top of the result list, and that the result list appears quickly.
The promotion of high-quality results is a result of many factors, the most notable of which is what Google calls PageRank, based largely on link counting: pages with lots of hyperlinks pointing at them are deemed to be more popular and thus, by popular vote, winners.
In practice, this seems to work well. A couple of interesting observations follow. First, until the rise of PageRank, the leaders in the search-engine space were offerings such as Yahoo! and DMoz, which worked by categorizing results; so, the evidence seems to suggest that it’s more useful to know how popular something is than to know what it’s about.
Second, PageRank is applicable only to document collections that are richly populated with links back and forth between the documents. At the moment, two document collections qualify: the World Wide Web and the corpus of peer-reviewed academic publications (which have applied PageRank-like methods for decades).
The ability of large search engines to scale up with the size of data and number of users has been impressive. It is based on the massive application of parallelism: attacking big problems with large numbers of small computers, rather than a few big ones. One of the nice things about postings is that each posting is independent of all the others, so they naturally lend themselves to parallel approaches.
For example, an index based on doing binary search in arrays of postings is fairly straightforward to partition. In an index containing only English words, you could easily create 26 partitions (the term used in the industry is shards), one for words beginning with each letter. Then you can make as many copies as you need of each shard. Then, a huge volume of word-search queries can be farmed out across an arbitrarily large collection of cooperating search nodes.
This leaves the problem of combining search results for multiword or phrase searches, and this requires some real innovation, but it’s easy to see how the basic word-search function could be parallelized.
This discussion is a little unfair in that it glosses over a huge number of important issues, notably including fighting the Internet miscreants who continually try to outsmart search-engine algorithms for commercial gain.
It is hard to imagine any computer application that does not involve storing data and finding it based on its content. The world’s single most popular computer application, web search, is a notable example.
This chapter has considered some of the issues, notably bypassing the traditional “database” domain and the world of search strategies that involve external storage. Whether operating at the level of a single line of text or billions of web documents, search is central. From the programmer’s point of view, it also needs to be said that implementing searches of one kind or another is, among other things, fun.
* People who have used regular expressions know that a period is a placeholder for “any character,” but it’s harder to remember that when a period is enclosed in square brackets, it loses the special meaning and refers to just a period.
* This discussion of binary search borrows heavily from my 2003 piece, “On the Goodness of Binary Search,” available online at http://www.tbray.org/ongoing/When/200x/2003/03/22/Binary.
* This discussion of full-text search borrows heavily from my 2003 series, On Search, available online at http://www.tbray.org/ongoing/When/200x/2003/ 07/30/OnSearchTOC. The series covers the topic of search quite broadly, including issues of user experience, quality control, natural language processing, intelligence, internationalization, and so on.