Home arrow Practices arrow Page 4 - More Techniques for Finding Things

Escaping the Loop - Practices

In this second part of a two-part series that provides an overview of search techniques for the developer, you'll learn more about the challenges and trade-offs of various approaches. It is excerpted from chapter four of Beautiful Code: Leading Programmers Explain How They Think, written by Andy Oram and Greg Wilson (O'Reilly, 2007; ISBN: 0596510047). Copyright © 2007 O'Reilly Media, Inc. All rights reserved. Used with permission from the publisher. Available from booksellers or direct from O'Reilly Media.

TABLE OF CONTENTS:
  1. More Techniques for Finding Things
  2. Binary Search
  3. Binary Search Trade-offs
  4. Escaping the Loop
  5. Searching the Web
By: O'Reilly Media
Rating: starstarstarstarstar / 4
July 17, 2008

print this article
SEARCH DEV SHED

TOOLS YOU CAN USE

advertisement

Some look at my binary-search algorithm and ask why the loop always runs to the end without checking whether itís found the target. In fact, this is the correct behavior; the math is beyond the scope of this chapter, but with a little work, you should be able to get an intuitive feeling for itóand this is the kind of intuition Iíve observed in some of the great programmers Iíve worked with.

Letís think about the progress of the loop. Suppose you have n elements in the array, where n is some really large number. The chance of finding the target the first time through is 1/n, a really small number. The next iteration (after you divide the search set in half) is 1/(n/2)óstill smallóand so on. In fact, the chance of hitting the target becomes significant only when youíre down to 10 or 20 elements, which is to say maybe the last four times through the loop. And in the case where the search fails (which is common in many applications), those extra tests are pure overhead.

You could do the math to figure out when the probability of hitting the target approaches 50 percent, but qualitatively, ask yourself: does it make sense to add extra complexity to each step of anO(log2  N)algorithm when the chances are it will save only a small number of steps at the end?

The take-away lesson is that binary search, done properly, is a two-step process. First, write an efficient loop that positions yourlowandhighbounds properly, then add a simple check to see whether you hit or missed.

Search in the Large

When most people think of search they think of web search, as offered by Yahoo!, Google, and their competitors. While ubiquitous web search is a new thing, the discipline of full-text search upon which it is based is not. Most of the seminal papers were written by Gerald Salton at Cornell as far back as the early 1960s. The basic techniques for indexing and searching large volumes of text have not changed dramatically since then. What has changed is how result ranking is done.*

Searching with Postings

The standard approach to full-text search is based on the notion of a posting, which is a small, fixed-size record. To build an index, you read all the documents and, for each word, create a posting that says word x appears in document y at position z. Then you sort all the words together, so that for each unique word you have a list of postings, each a pair of numbers consisting of a document ID and the textís offset in that document.

Because postings are small and fixed in size, and because you tend to have a huge number of them, a natural approach is to use binary search. I have no idea of the details of how Google or Yahoo! do things, but Iíd be really unsurprised to hear that those tens of thousands of computers spend a whole lot of their time binary-searching big arrays of postings.

People who are knowledgeable about search shared a collective snicker a few years ago when the number of documents Google advertised as searching, after having been stuck at two billion and change for some years, suddenly became much larger and then kept growing. Presumably they had switched the document ID in all those postings from 32-bit to 64-bit numbers.

Ranking Results

Given a word, searching a list of postings to figure out which documents contain it is not rocket science. A little thought shows that combining the lists to do AND and OR queries and phrase search is also simple, conceptually at least. Whatís hard is sorting the result list so that the good results show up near the top. Computer science has a subdiscipline called Information Retrieval (IR for short) that focuses almost entirely on this problem. Historically, the results had been very poor, up until recently.



 
 
>>> More Practices Articles          >>> More By O'Reilly Media
 

blog comments powered by Disqus
escort Bursa Bursa escort Antalya eskort
   

PRACTICES ARTICLES

- Calculating Development Project Costs
- More Techniques for Finding Things
- Finding Things
- Finishing the System`s Outlines
- The System in So Many Words
- Basic Data Types and Calculations
- What`s the Address? Pointers
- Design with ArgoUML
- Pragmatic Guidelines: Diagrams That Work
- Five-Step UML: OOAD for Short Attention Span...
- Five-Step UML: OOAD for Short Attention Span...
- Introducing UML: Object-Oriented Analysis an...
- Class and Object Diagrams
- Class Relationships
- Classes

Developer Shed Affiliates

 


Dev Shed Tutorial Topics: