Practices
  Home arrow Practices arrow Finding Things
Dev Shed Forums  
Administration  
AJAX  
Apache  
BrainDump  
DHTML  
Flash  
Java  
JavaScript  
Multimedia  
MySQL  
Oracle  
Perl  
PHP  
Practices  
Python  
Reviews  
Security  
Smartphone Development  
Style-Sheets  
Web Services  
XML  
Zend  
Zope  
Mobile Linux  
App Generation ROI  
IBM® developerWorks  
Forums Sitemap  
E-Commerce Hosting  
Linux Web Hosting  
Managed Hosting  
Small Business Hosting  
VPS Hosting  
Weekly Newsletter

 
Developer Updates  
Free Website Content 
 RSS  Articles
 RSS  Forums
 RSS  All Feeds
Write For Us Get Paid  
Request Media Kit
Contact Us  
Site Map  
Privacy Policy  
Support  
 USERNAME
 
 PASSWORD
 
 
  >>> SIGN UP!  
  Lost Password? 
Google.com  
PRACTICES

Finding Things
By: O'Reilly Media
  • Search For More Articles!
  • Disclaimer
  • Author Terms
  • Rating: starstarstarstarstar / 1
    2008-07-10


    Table of Contents:
  • Finding Things
  • Regular Expressions
  • Putting Regular Expressions to Work
  • Content-Addressable Storage
  • Time to Optimize?

  • Rate this Article: Poor Best 
      ADD THIS ARTICLE TO:
      error-file:tidyout.log Del.ici.ous error-file:tidyout.log Digg
      error-file:tidyout.log Blink error-file:tidyout.log Simpy
      error-file:tidyout.log Google error-file:tidyout.log Spurl
      error-file:tidyout.log Y! MyWeb error-file:tidyout.log Furl
    Email Me Similar Content When Posted
    Add Developer Shed Article Feed To Your Site
    Email Article To Friend
    Print Version Of Article
    PDF Version Of Article

     
     
    ADVERTISEMENT


    Finding Things
    ( Page 1 of 5 )

    Search, whether it's searching the web or the contents of your computer, presents the developer with a major challenge. This article, the first of two parts, provides an overview of several search techniques, and the trade-offs that go with them. It is excerpted from chapter four of Beautiful Code: Leading Programmers Explain How They Think, written by Andy Oram and Greg Wilson (O'Reilly, 2007; ISBN: 0596510047). Copyright © 2007 O'Reilly Media, Inc. All rights reserved. Used with permission from the publisher. Available from booksellers or direct from O'Reilly Media.

    COMPUTERS CAN COMPUTE, BUT THAT'S NOT WHAT PEOPLE USE THEM FOR, MOSTLY. Mostly, computers store and retrieve information. Retrieve implies find, and in the time since the advent of the Web, search has become a dominant application for people using computers.

    As data volumes continue to grow—both absolutely, and relative to the number of people or computers or anything, really—search becomes an increasingly large part of the life of the programmer as well. A few applications lack the need to locate the right morsel in some information store, but very few.

    The subject of search is one of the largest in computer science, and thus I won’t try to survey all of it or discuss the mechanics; in fact, I’ll only consider one simple search technique in depth. Instead, I’ll focus on the trade-offs that go into selecting search techniques, which can be subtle.

    On Time

    You really can’t talk about search without talking about time. There are two different flavors of time that apply to problems of search. The first is the time it takes the search to run, which is experienced by the user who may well be staring at a message saying something like “Loading...”. The second is the time invested by the programmer who builds the search function, and by the programmer’s management and customers waiting to use the program.

    Problem: Weblog Data

    Let’s look at a sample problem to get a feel for how a search works in real life. I have a directory containing logfiles from my weblog (http://www.tbray.org/ongoing) from early 2003 to late 2006; as of the writing of this chapter, they recorded 140,070,104 transactions and occupied 28,489,788,532 bytes (uncompressed). All these statistics, properly searched, can answer lots of questions about my traffic and readership.

    Let’s look at a simple question first: which articles have been read the most? It may not be instantly obvious that this problem is about search, but it is. First of all, you have to search through the logfiles to find the lines that record someone fetching an article. Second, you have to search through those lines to find the name of the article they fetched. Third, you have to keep track, for each article, of how often it was fetched.

    Here is an example of one line from one of these files, which wraps to fit the page in this book, but is a single long line in the file:

      c80-216-32-218.cm-upc.chello.se - - [08/Oct/2006:06:37:48 -0700] "GET /ongoing/When/
      200x/2006/10/08/Grief-Lessons HTTP/1.1" 200 5945 http://www.tbray.org/ongoing/
      Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)

    Reading from left to right, this tells us that:

      Somebody from an organization named chello in Sweden,
      who provided neither a username nor a password,
      contacted my weblog early in the morning of October 8, 2006 (my server’s time zone
      is seven hours off Greenwich),
      and requested a resource named /ongoing/When/200x/2006/10/08/Grief-Lessons
      using the HTTP 1.1 protocol;
      the request was successful and returned 5,945 bytes;
      the visitor had been referred from my blog’s home page,
      and was using Internet Explorer 6 running on Windows XP.

    This is an example of the kind of line I want: one that records the actual fetch of an article. There are lots of other lines that record fetching stylesheets, scripts, pictures, and so on, and attacks by malicious users. You can spot the kind of line I want by the fact that the article’s name starts with /ongoing/When/ and continues with elements for the decade, year, month, and day.

    Our first step, then, should be to find lines that contain something like:

      /ongoing/When/200x/2006/10/08/

    Whatever language you’re programming in, you could spend lots of time writing code to match this pattern character by character. Or you could apply regular expressions.



     
     
    >>> More Practices Articles          >>> More By O'Reilly Media
     

       

    PRACTICES ARTICLES

    - More Techniques for Finding Things
    - Finding Things
    - Finishing the System`s Outlines
    - The System in So Many Words
    - Basic Data Types and Calculations
    - What`s the Address? Pointers
    - Design with ArgoUML
    - Pragmatic Guidelines: Diagrams That Work
    - Five-Step UML: OOAD for Short Attention Span...
    - Five-Step UML: OOAD for Short Attention Span...
    - Introducing UML: Object-Oriented Analysis an...
    - Class and Object Diagrams
    - Class Relationships
    - Classes
    - Basic Ideas





    © 2003-2009 by Developer Shed. All rights reserved. DS Cluster 6 Hosted by Hostway
    For more Enterprise Application Development news, visit eWeek