Running a couple of quick scripts over the logfile data reveals that there are 12,600,064 instances of an article fetch coming from 2,345,571 different hosts. Suppose we are interested in who was fetching what, and when? An auditor, a police officer, or a marketing professional might be interested.
So, here’s the problem: given a hostname, report what articles were fetched from that host, and when. The result is a list; if the list is empty, no articles were fetched.
We’ve already seen that a language’s built-in hash or equivalent data structure gives the programmer a quick and easy way to store and look up key/value pairs. So, you might ask, why not use it?
That’s an excellent question, and we should give the idea a try. There are reasons to worry that it might not work very well, so in the back of our minds, we should be thinking of a Plan B. As you may recall if you’ve ever studied hash tables, in order to go fast, they need to have a small load factor; in other words, they need to be mostly empty. However, a hash table that holds 2.35 million entries and is still mostly empty is going to require the use of a whole lot of memory.
To simplify things, I wrote a program that ran over all the logfiles and pulled out all the article fetches into a simple file; each line has the hostname, the time of the transaction, and the article name. Here are the first few lines:
1 class BigHash 2 3 def initialize(file) 4 @hash = {} 5 lines = 0 6 File.open(file).each_line do |line| 7 s = line.split 8 article = s[2].intern 9 if @hash[s[0]] 10 @hash[s[0]] << [ s[1], article ] 11 else 12 @hash[s[0]] = [ s[1], article ] 13 end 14 lines += 1 15 STDERR.puts "Line: #{lines}" if (lines % 100000) == 0 16 end 17 end 18 19 def find(key) 20 @hash[key] 21 end 22 23 end
The program should be fairly self-explanatory, but line 15 is worth a note. When you’re running a big program that’s going to take a lot of time, it’s very disturbing when it works away silently, maybe for hours. What if something’s wrong? What if it’s going incredibly slow and will never finish? So, line 15 prints out a progress report after every 100,000 lines of input, which is reassuring.
Running this program was interesting. It took about 55 minutes of CPU time to load up the hash, and the program grew to occupy 1.56 GB of memory. A little calculation suggests that it costs around 680 bytes to store the information for each host, or slicing the data another way, about 126 bytes per fetch. This is a little scary, but probably reasonable for a hash table.
Retrieval performance was excellent. I ran 2,000 queries, half of which were randomly selected hosts from the log and thus succeeded, while the other half were those same hostnames reversed, none of which succeeded. The 2,000 queries completed in an average of about .02 seconds, so Ruby’s hash implementation can look up records in a hash containing 12 million or so records thousands of times per second.
Those 55 minutes to load up the data are troubling, but there are some tricks to address that. You could, for example, load it up once, then serialize the hash out and read it back in. And I didn’t try particularly hard to optimize the program.
The program was easy and quick to write, and it runs fast once it’s initialized, so its performance is good both in terms of waiting-for-the-program time and waiting-for-the-programmer time. Still, I’m unsatisfied. I have the feeling that there ought to be a way to get this kind of performance while burning less memory, less startup time, or both. It involves writing our own search code, though.