Practices
  Home arrow Practices arrow More Techniques for Finding Things
Dev Shed Forums  
Administration  
AJAX  
Apache  
BrainDump  
DHTML  
Flash  
Java  
JavaScript  
Multimedia  
MySQL  
Oracle  
Perl  
PHP  
Practices  
Python  
Reviews  
Security  
Smartphone Development  
Style-Sheets  
Web Services  
XML  
Zend  
Zope  
Mobile Linux  
App Generation ROI  
IBM® developerWorks  
Forums Sitemap  
E-Commerce Hosting  
Linux Web Hosting  
Managed Hosting  
Small Business Hosting  
VPS Hosting  
Weekly Newsletter

 
Developer Updates  
Free Website Content 
 RSS  Articles
 RSS  Forums
 RSS  All Feeds
Write For Us Get Paid  
Request Media Kit
Contact Us  
Site Map  
Privacy Policy  
Support  
 USERNAME
 
 PASSWORD
 
 
  >>> SIGN UP!  
  Lost Password? 
Google.com  
PRACTICES

More Techniques for Finding Things
By: O'Reilly Media
  • Search For More Articles!
  • Disclaimer
  • Author Terms
  • Rating: starstarstarstarstar / 4
    2008-07-17


    Table of Contents:
  • More Techniques for Finding Things
  • Binary Search
  • Binary Search Trade-offs
  • Escaping the Loop
  • Searching the Web

  • Rate this Article: Poor Best 
      ADD THIS ARTICLE TO:
      error-file:tidyout.log Del.ici.ous error-file:tidyout.log Digg
      error-file:tidyout.log Blink error-file:tidyout.log Simpy
      error-file:tidyout.log Google error-file:tidyout.log Spurl
      error-file:tidyout.log Y! MyWeb error-file:tidyout.log Furl
    Email Me Similar Content When Posted
    Add Developer Shed Article Feed To Your Site
    Email Article To Friend
    Print Version Of Article
    PDF Version Of Article

     
     
    ADVERTISEMENT


    More Techniques for Finding Things
    ( Page 1 of 5 )

    In this second part of a two-part series that provides an overview of search techniques for the developer, you'll learn more about the challenges and trade-offs of various approaches. It is excerpted from chapter four of Beautiful Code: Leading Programmers Explain How They Think, written by Andy Oram and Greg Wilson (O'Reilly, 2007; ISBN: 0596510047). Copyright © 2007 O'Reilly Media, Inc. All rights reserved. Used with permission from the publisher. Available from booksellers or direct from O'Reilly Media.

    Problem: Who Fetched What, When? 

    Running a couple of quick scripts over the logfile data reveals that there are 12,600,064 instances of an article fetch coming from 2,345,571 different hosts. Suppose we are interested in who was fetching what, and when? An auditor, a police officer, or a marketing professional might be interested.

    So, here’s the problem: given a hostname, report what articles were fetched from that host, and when. The result is a list; if the list is empty, no articles were fetched.

    We’ve already seen that a language’s built-in hash or equivalent data structure gives the programmer a quick and easy way to store and look up key/value pairs. So, you might ask, why not use it?

    That’s an excellent question, and we should give the idea a try. There are reasons to worry that it might not work very well, so in the back of our minds, we should be thinking of a Plan B. As you may recall if you’ve ever studied hash tables, in order to go fast, they need to have a small load factor; in other words, they need to be mostly empty. However, a hash table that holds 2.35 million entries and is still mostly empty is going to require the use of a whole lot of memory.

    To simplify things, I wrote a program that ran over all the logfiles and pulled out all the article fetches into a simple file; each line has the hostname, the time of the transaction, and the article name. Here are the first few lines:

      crawl-66-249-72-77.googlebot.com 1166406026 2003/04/08/Riff s
      egspd42470.ask.com 1166406027 2006/05/03/MARS-T-Shirt
      84.7.249.205 1166406040 2003/03/27/Scanner

    (The second field, the 10-digit number, is the standard Unix/Linux representation of time as the number of seconds since the beginning of 1970.)

    Then I wrote a simple program to read this file and load a great big hash. Example 4-5 shows the program.

    EXAMPLE 4-5. Loading a big hash

    1 class BigHash
    2
    3   def initialize(file)
    4     @hash = {}
    5     lines = 0
    6     File.open(file).each_line do |line|
    7       s = line.split
    8       article = s[2].intern
    9       if @hash[s[0]]
    10        @hash[s[0]] << [ s[1], article ]
    11      else
    12        @hash[s[0]] = [ s[1], article ]
    13      end
    14      lines += 1
    15      STDERR.puts "Line: #{lines}" if (lines % 100000) == 0
    16    end
    17  end
    18
    19  def find(key)
    20    @hash[key]
    21  end
    22
    23 end

    The program should be fairly self-explanatory, but line 15 is worth a note. When you’re running a big program that’s going to take a lot of time, it’s very disturbing when it works away silently, maybe for hours. What if something’s wrong? What if it’s going incredibly slow and will never finish? So, line 15 prints out a progress report after every 100,000 lines of input, which is reassuring.

    Running this program was interesting. It took about 55 minutes of CPU time to load up the hash, and the program grew to occupy 1.56 GB of memory. A little calculation sug gests that it costs around 680 bytes to store the information for each host, or slicing the data another way, about 126 bytes per fetch. This is a little scary, but probably reasonable for a hash table.

    Retrieval performance was excellent. I ran 2,000 queries, half of which were randomly selected hosts from the log and thus succeeded, while the other half were those same hostnames reversed, none of which succeeded. The 2,000 queries completed in an average of about .02 seconds, so Ruby’s hash implementation can look up records in a hash containing 12 million or so records thousands of times per second.

    Those 55 minutes to load up the data are troubling, but there are some tricks to address that. You could, for example, load it up once, then serialize the hash out and read it back in. And I didn’t try particularly hard to optimize the program.

    The program was easy and quick to write, and it runs fast once it’s initialized, so its performance is good both in terms of waiting-for-the-program time and waiting-for-the-programmer time. Still, I’m unsatisfied. I have the feeling that there ought to be a way to get this kind of performance while burning less memory, less startup time, or both. It involves writing our own search code, though.



     
     
    >>> More Practices Articles          >>> More By O'Reilly Media
     

       

    PRACTICES ARTICLES

    - More Techniques for Finding Things
    - Finding Things
    - Finishing the System`s Outlines
    - The System in So Many Words
    - Basic Data Types and Calculations
    - What`s the Address? Pointers
    - Design with ArgoUML
    - Pragmatic Guidelines: Diagrams That Work
    - Five-Step UML: OOAD for Short Attention Span...
    - Five-Step UML: OOAD for Short Attention Span...
    - Introducing UML: Object-Oriented Analysis an...
    - Class and Object Diagrams
    - Class Relationships
    - Classes
    - Basic Ideas





    © 2003-2009 by Developer Shed. All rights reserved. DS Cluster 2 Hosted by Hostway
    For more Enterprise Application Development news, visit eWeek