Practices
  Home arrow Practices arrow Page 5 - More Techniques for Finding Things
Dev Shed Forums 
Administration  
AJAX  
Apache  
BrainDump  
DHTML  
Flash  
Java  
JavaScript  
Multimedia  
MySQL  
Oracle  
Perl  
PHP  
Practices  
Python  
Reviews  
Security  
Style-Sheets  
Web Services  
XML  
Zend  
Zope  
Forums Sitemap 
IBM® developerWorks 
Sun Developer Network 
E-Commerce Hosting 
Linux Web Hosting 
Managed Hosting 
Small Business Hosting 
Mobile Linux 
App Generation ROI 
VPS Hosting 
Weekly Newsletter

 
Developer Updates  
Free Website Content 
 RSS  Articles
 RSS  Forums
 RSS  All Feeds
Write For Us Get Paid 
Request Media Kit
Contact Us 
Site Map 
Privacy Policy 
Support 
 USERNAME
 
 PASSWORD
 
 
  >>> SIGN UP!  
  Lost Password? 
PRACTICES

More Techniques for Finding Things
By: O'Reilly Media
  • Search For More Articles!
  • Disclaimer
  • Author Terms
  • Rating: 5 stars5 stars5 stars5 stars5 stars / 3
    2008-07-17

    Table of Contents:
  • More Techniques for Finding Things
  • Binary Search
  • Binary Search Trade-offs
  • Escaping the Loop
  • Searching the Web

  • Rate this Article: Poor Best 
      ADD THIS ARTICLE TO:
      Del.ici.ous Digg
      Blink Simpy
      Google Spurl
      Y! MyWeb Furl
    Email Me Similar Content When Posted
    Add Developer Shed Article Feed To Your Site
    Email Article To Friend
    Print Version Of Article
    PDF Version Of Article
     
     
    ADVERTISEMENT


    More Techniques for Finding Things - Searching the Web


    (Page 5 of 5 )

    Google and its competitors have been able to produce good results in the face of unimaginably huge data sets and populations of users. When I say “good,” I mean that high-quality results appear near the top of the result list, and that the result list appears quickly.

    The promotion of high-quality results is a result of many factors, the most notable of which is what Google calls PageRank, based largely on link counting: pages with lots of hyperlinks pointing at them are deemed to be more popular and thus, by popular vote, winners.

    In practice, this seems to work well. A couple of interesting observations follow. First, until the rise of PageRank, the leaders in the search-engine space were offerings such as Yahoo! and DMoz, which worked by categorizing results; so, the evidence seems to suggest that it’s more useful to know how popular something is than to know what it’s about.

    Second, PageRank is applicable only to document collections that are richly populated with links back and forth between the documents. At the moment, two document collections qualify: the World Wide Web and the corpus of peer-reviewed academic publications (which have applied PageRank-like methods for decades).

    The ability of large search engines to scale up with the size of data and number of users has been impressive. It is based on the massive application of parallelism: attacking big problems with large numbers of small computers, rather than a few big ones. One of the nice things about postings is that each posting is independent of all the others, so they naturally lend themselves to parallel approaches.

    For example, an index based on doing binary search in arrays of postings is fairly straightforward to partition. In an index containing only English words, you could easily create 26 partitions (the term used in the industry is shards), one for words beginning with each letter. Then you can make as many copies as you need of each shard. Then, a huge volume of word-search queries can be farmed out across an arbitrarily large collection of cooperating search nodes.

    This leaves the problem of combining search results for multiword or phrase searches, and this requires some real innovation, but it’s easy to see how the basic word-search function could be parallelized.

    This discussion is a little unfair in that it glosses over a huge number of important issues, notably including fighting the Internet miscreants who continually try to outsmart search-engine algorithms for commercial gain.

    Conclusion

    It is hard to imagine any computer application that does not involve storing data and finding it based on its content. The world’s single most popular computer application, web search, is a notable example.

    This chapter has considered some of the issues, notably bypassing the traditional “database” domain and the world of search strategies that involve external storage. Whether operating at the level of a single line of text or billions of web documents, search is central. From the programmer’s point of view, it also needs to be said that implementing searches of one kind or another is, among other things, fun.


    * People who have used regular expressions know that a period is a placeholder for “any character,” but it’s harder to remember that when a period is enclosed in square brackets, it loses the special meaning and refers to just a period.

    * This discussion of binary search borrows heavily from my 2003 piece, “On the Goodness of Binary Search,” available online at http://www.tbray.org/ongoing/When/200x/2003/03/22/Binary.

    * This discussion of full-text search borrows heavily from my 2003 series, On Search, available online at http://www.tbray.org/ongoing/When/200x/2003/ 07/30/OnSearchTOC. The series covers the topic of search quite broadly, including issues of user experience, quality control, natural language processing, intelligence, internationalization, and so on.  


    DISCLAIMER: The content provided in this article is not warranted or guaranteed by Developer Shed, Inc. The content provided is intended for entertainment and/or educational purposes in order to introduce to the reader key ideas, concepts, and/or product reviews. As such it is incumbent upon the reader to employ real-world tactics for security and implementation of best practices. We are not liable for any negative consequences that may result from implementing any information covered in our articles or tutorials. If this is a hardware review, it is not recommended to open and/or modify your hardware.

       · This article is an excerpt from the book "Beautiful Code: Leading Programmers...
     

    Buy this book now. This article is excerpted from chapter four of Beautiful Code: Leading Programmers Explain How They Think, written by Andy Oram and Greg Wilson (O'Reilly, 2007; ISBN: 0596510047). Check it out today at your favorite bookstore. Buy this book now.

       

    PRACTICES ARTICLES

    - More Techniques for Finding Things
    - Finding Things
    - Finishing the System`s Outlines
    - The System in So Many Words
    - Basic Data Types and Calculations
    - What`s the Address? Pointers
    - Design with ArgoUML
    - Pragmatic Guidelines: Diagrams That Work
    - Five-Step UML: OOAD for Short Attention Span...
    - Five-Step UML: OOAD for Short Attention Span...
    - Introducing UML: Object-Oriented Analysis an...
    - Class and Object Diagrams
    - Class Relationships
    - Classes
    - Basic Ideas

     
    Application Delivery: Everything You Wanted to Know, but Didn`t Know You Needed to Ask
    A comprehensive guide to examining the topics of Wide-area Data Services and app....

     
    Best Practices: Safe and Secure Hardware Asset Recovery
    Companies increasingly must meet EPA and local requirements for the disposal of ....

     
    Managing SSL Security in Multi-Server Environments
    Read this white paper to learn how to simplify management of your organization's....

     
    Open Source Security Myths
    Open Source Software (OSS) is computer software whose source code is available t....

     
    Power and Cooling Capacity Management for Data Centers
    This paper describes the principles for achieving power and cooling capacity man....

     




    © 2003-2008 by Developer Shed. All rights reserved. DS Cluster 1 hosted by Hostway
    Stay green...Green IT