Home arrow Practices arrow Page 4 - Finding Things

Content-Addressable Storage - Practices

Search, whether it's searching the web or the contents of your computer, presents the developer with a major challenge. This article, the first of two parts, provides an overview of several search techniques, and the trade-offs that go with them. It is excerpted from chapter four of Beautiful Code: Leading Programmers Explain How They Think, written by Andy Oram and Greg Wilson (O'Reilly, 2007; ISBN: 0596510047). Copyright © 2007 O'Reilly Media, Inc. All rights reserved. Used with permission from the publisher. Available from booksellers or direct from O'Reilly Media.

TABLE OF CONTENTS:
  1. Finding Things
  2. Regular Expressions
  3. Putting Regular Expressions to Work
  4. Content-Addressable Storage
  5. Time to Optimize?
By: O'Reilly Media
Rating: starstarstarstarstar / 4
July 10, 2008

print this article
SEARCH DEV SHED

TOOLS YOU CAN USE

advertisement

Now we’re approaching the core of our problem, computing the popularity of articles. We’ll have to pull the article name out of each line, look it up to see how many times it’s been fetched, add one to that number, and then store it away again.

This may be the most basic of search patterns: we start with a key (what we’re using to search—in this case, an article name), and we’re looking for a value (what we want to find—in this case, the number of times the article has been fetched). Here are some other examples:

Key Value
Word List of web pages containing the word
Employee number Employee’s personnel record
Passport number “true” or “false,” indicating whether the person with that passport should be subject to extra scrutiny

What programmers really want in this situation is a very old idea in computer science: content-addressable memory, also known as an associative store and various other permutations of those words. The idea is to put the key in and get the value out. There actually exists hardware which does just that; it mostly lives deep in the bowels of microprocessors, providing rapid access to page tables and memory caches.

The good news is that you, the programmer, using any modern computer language, have access to excellent software implementations of associative memory. Different languages call these implementations different things. Often they are implemented as hash tables; in Java, Perl, and Ruby, which use this technique, they are called Hashes, HashMaps, or something similar. In Python, they are called dictionaries, and in the computer algebra language Maple, simply tables.

Now if you’re an eager search-algorithm fan just itching to write your own super-efficient search, this may sound like bad news, not good news. But think about those flavors of time; if you use the built-in associative store, the amount of programmer time and management invested in writing search algorithms goes to nearly zero.

By writing your own search, you might be able to save a little computer (and thus end-user) time, compared to the built-in version, but on the other hand, you might not; the people who write these things tend to be pretty clever. Andrew Kuchling has written Chapter 18 of this book on one such effort.

Associative stores are so important that dynamically typed languages such as Ruby and Python have not only built-in support, but special syntax for defining and using them. Let’s use Ruby’s hashes to count article popularity in Example 4-3.

EXAMPLE 4-3. Counting article fetches

1 counts = {}
2 counts.default = 0
3
4 ARGF.each_line do |line|
5   if line =~ %r{GET /ongoing/When/\d\d\dx/(\d\d\d\d/\d\d/\d\d/[^ .]+) }
6     counts[$1] += 1
7   end
8 end

This program isn’t that much different from the version in Example 4-2. Line 1 creates an empty Hash calledcounts. Line 2 gives the array a “default value” of zero; hold on for an explanation of that.

Then, in line 6, instead of printing out the article name, the name serves as the key to look up the number of fetches of this article seen so far incounts, add one to it, and store the value.

Now, consider what happens when the program sees some article name stored in$1for the first time. I could write code along the lines of “if there is acounts[$1], then add one to it; otherwise, setcounts[$1]to one.” The designers of Ruby hate that kind of awkwardness; this is why they provided the notion of a “default value” for a Hash. If you look up a key the Hash doesn’t know about, it says “OK, zero,” allowing you to writecounts[$1] += 1and have it always just work.

I originally stated the problem as “Which of my articles have been read the most?” That’s kind of fuzzy; let’s interpret it to mean “Print out the top 10 most popular articles.” The resulting program is shown in Example 4-4.

EXAMPLE 4-4. Reporting the most popular articles

1 counts = {}
2 counts.default = 0
3
4 ARGF.each_line do |line|
5   if line =~ %r{GET /ongoing/When/\d\d\dx/(\d\d\d\d/\d\d/\d\d/[^ .]+) }
6     counts[$1] += 1
7   end
8 end
9
10 keys_by_count = counts.keys.sort { |a, b| counts[b] <=> counts[a] }
11 keys_by_count[0 .. 9].each do |key|
12   puts "#{counts[key]}: #{key}"
13 end

Line 10 looks a little less beautiful to me than most Ruby code, but it’s easy enough to understand. Thekeysmethod ofcountsreturns an array containing all of the Hash’s keys. Because of the hash implementation, the keys are stored in no predictable order, and are also returned by thekeysmethod in random order. So, I have to sort them and store them back in a new array.

In Ruby,sortis accompanied by a code block, here enclosed in curly braces. (In Ruby, you can delimit a block either withdoandend or with{and}.) The sort works its way back and forth through the array being sorted, passing pairs of elements to the block, which has to return a negative number, 0, or a positive number depending on whether the first element is less than, equal to, or greater than the second.

In this case, we want to get the data out of the hash in an order defined by the values (the counts themselves) rather than by the filenames (the keys), so we have to sort the keys by their values. Have a close look at the code, and you’ll see how it works. Because this is something that people do all the time, I’m surprised that Ruby’s Hash doesn’t come withsort_by_value.

We use a decreasing order for the sort so that, no matter how many articles we’ve found, we know the first 10 items inkeys_by_countrepresent the top 10 articles in popularity.

Now that we have an array of keys (article names) sorted in descending order of how many times they’ve been fetched, we can accomplish our assignment by printing out the first 10. Line 11 is simple, but a word is in order about thateachmethod. In Ruby, you almost never see aforstatement because anything whose elements you might want to loop through has aneach method that does it for you.

Line 12 may be a little hard to read for the non-Rubyist because of the#{}syntax, but it’s pretty straightforward.

So, let’s declare victory on our first assignment. It took us only 13 lines of easy-to-read code. A seasoned Rubyist would have squeezed the last three lines into one.

Let’s run this thing and see what it reports. Instead of running it over the whole 28 GB, let’s just use it on a week’s data: a mere 1.2 million records comprising 245 MB.

  ~/dev/bc/ 548> zcat ~/ongoing/logs/2006-12-17.log.gz | \
                 time ruby code/report-counts.rb
 
4765: 2006/12/11/Mac-Crash
  3138: 2006/01/31/Data-Protection
  1865: 2006/12/10/EMail
  1650: 2006/03/30/Teacup
  1645: 2006/12/11/Java
  1100: 2006/07/28/Open-Data
  900: 2006/11/27/Choose-Relax
  705: 2003/09/18/NXML
  692: 2006/07/03/July-1-Fireworks
  673: 2006/12/13/Blog-PR
         13.54 real    7.49 user    0.73 sys

This run took place on my 1.67 GHz Apple PowerBook. The results are unsurprising, but the program does seem kind of slow. Should we worry about performance?



 
 
>>> More Practices Articles          >>> More By O'Reilly Media
 

blog comments powered by Disqus
escort Bursa Bursa escort Antalya eskort
   

PRACTICES ARTICLES

- Calculating Development Project Costs
- More Techniques for Finding Things
- Finding Things
- Finishing the System`s Outlines
- The System in So Many Words
- Basic Data Types and Calculations
- What`s the Address? Pointers
- Design with ArgoUML
- Pragmatic Guidelines: Diagrams That Work
- Five-Step UML: OOAD for Short Attention Span...
- Five-Step UML: OOAD for Short Attention Span...
- Introducing UML: Object-Oriented Analysis an...
- Class and Object Diagrams
- Class Relationships
- Classes

Developer Shed Affiliates

 


Dev Shed Tutorial Topics: