Practices
  Home arrow Practices arrow Page 5 - More Techniques for Finding Things
Dev Shed Forums  
Administration  
AJAX  
Apache  
BrainDump  
DHTML  
Flash  
Java  
JavaScript  
Multimedia  
MySQL  
Oracle  
Perl  
PHP  
Practices  
Python  
Reviews  
Security  
Smartphone Development  
Style-Sheets  
Web Services  
XML  
Zend  
Zope  
Mobile Linux  
App Generation ROI  
IBM® developerWorks  
Forums Sitemap  
E-Commerce Hosting  
Linux Web Hosting  
Managed Hosting  
Small Business Hosting  
VPS Hosting  
Weekly Newsletter

 
Developer Updates  
Free Website Content 
 RSS  Articles
 RSS  Forums
 RSS  All Feeds
Write For Us Get Paid  
Request Media Kit
Contact Us  
Site Map  
Privacy Policy  
Support  
 USERNAME
 
 PASSWORD
 
 
  >>> SIGN UP!  
  Lost Password? 
PRACTICES

More Techniques for Finding Things
By: O'Reilly Media
  • Search For More Articles!
  • Disclaimer
  • Author Terms
  • Rating: starstarstarstarstar / 4
    2008-07-17


    Table of Contents:
  • More Techniques for Finding Things
  • Binary Search
  • Binary Search Trade-offs
  • Escaping the Loop
  • Searching the Web

  • Rate this Article: Poor Best 
      ADD THIS ARTICLE TO:
      error-file:tidyout.log Del.ici.ous error-file:tidyout.log Digg
      error-file:tidyout.log Blink error-file:tidyout.log Simpy
      error-file:tidyout.log Google error-file:tidyout.log Spurl
      error-file:tidyout.log Y! MyWeb error-file:tidyout.log Furl
    Email Me Similar Content When Posted
    Add Developer Shed Article Feed To Your Site
    Email Article To Friend
    Print Version Of Article
    PDF Version Of Article

     
     
    ADVERTISEMENT


    More Techniques for Finding Things - Searching the Web
    ( Page 5 of 5 )

    Google and its competitors have been able to produce good results in the face of unimaginably huge data sets and populations of users. When I say “good,” I mean that high-quality results appear near the top of the result list, and that the result list appears quickly.

    The promotion of high-quality results is a result of many factors, the most notable of which is what Google calls PageRank, based largely on link counting: pages with lots of hyperlinks pointing at them are deemed to be more popular and thus, by popular vote, winners.

    In practice, this seems to work well. A couple of interesting observations follow. First, until the rise of PageRank, the leaders in the search-engine space were offerings such as Yahoo! and DMoz, which worked by categorizing results; so, the evidence seems to suggest that it’s more useful to know how popular something is than to know what it’s about.

    Second, PageRank is applicable only to document collections that are richly populated with links back and forth between the documents. At the moment, two document collections qualify: the World Wide Web and the corpus of peer-reviewed academic publications (which have applied PageRank-like methods for decades).

    The ability of large search engines to scale up with the size of data and number of users has been impressive. It is based on the massive application of parallelism: attacking big problems with large numbers of small computers, rather than a few big ones. One of the nice things about postings is that each posting is independent of all the others, so they naturally lend themselves to parallel approaches.

    For example, an index based on doing binary search in arrays of postings is fairly straightforward to partition. In an index containing only English words, you could easily create 26 partitions (the term used in the industry is shards), one for words beginning with each letter. Then you can make as many copies as you need of each shard. Then, a huge volume of word-search queries can be farmed out across an arbitrarily large collection of cooperating search nodes.

    This leaves the problem of combining search results for multiword or phrase searches, and this requires some real innovation, but it’s easy to see how the basic word-search function could be parallelized.

    This discussion is a little unfair in that it glosses over a huge number of important issues, notably including fighting the Internet miscreants who continually try to outsmart search-engine algorithms for commercial gain.

    Conclusion

    It is hard to imagine any computer application that does not involve storing data and finding it based on its content. The world’s single most popular computer application, web search, is a notable example.

    This chapter has considered some of the issues, notably bypassing the traditional “database” domain and the world of search strategies that involve external storage. Whether operating at the level of a single line of text or billions of web documents, search is central. From the programmer’s point of view, it also needs to be said that implementing searches of one kind or another is, among other things, fun.


    * People who have used regular expressions know that a period is a placeholder for “any character,” but it’s harder to remember that when a period is enclosed in square brackets, it loses the special meaning and refers to just a period.

    * This discussion of binary search borrows heavily from my 2003 piece, “On the Goodness of Binary Search,” available online at http://www.tbray.org/ongoing/When/200x/2003/03/22/Binary.

    * This discussion of full-text search borrows heavily from my 2003 series, On Search, available online at http://www.tbray.org/ongoing/When/200x/2003/ 07/30/OnSearchTOC. The series covers the topic of search quite broadly, including issues of user experience, quality control, natural language processing, intelligence, internationalization, and so on.  



     
     
    >>> More Practices Articles          >>> More By O'Reilly Media
     

       

    PRACTICES ARTICLES

    - More Techniques for Finding Things
    - Finding Things
    - Finishing the System`s Outlines
    - The System in So Many Words
    - Basic Data Types and Calculations
    - What`s the Address? Pointers
    - Design with ArgoUML
    - Pragmatic Guidelines: Diagrams That Work
    - Five-Step UML: OOAD for Short Attention Span...
    - Five-Step UML: OOAD for Short Attention Span...
    - Introducing UML: Object-Oriented Analysis an...
    - Class and Object Diagrams
    - Class Relationships
    - Classes
    - Basic Ideas





    © 2003-2009 by Developer Shed. All rights reserved. DS Cluster 1 Hosted by Hostway
    Stay green...Green IT