Nobody gets a Computer Science degree without studying a wide variety of search algorithms: trees, heaps, hashes, lists, and more. My favorite among all these is binary search. Let’s try it on the who-fetched-what-when problem and then look at what makes it beautiful.
My first attempt at putting binary search to use was quite disappointing; while the data took 10 minutes less to load, it required almost 100 MB more memory than with the hash. Clearly, there are some surprising things about the Ruby array implementation. The search also ran several times slower (but still in the range of thousands per second), but this is not surprising at all because the algorithm is running in Ruby code rather than with the underlying hardcoded hash implementation.
The problem is that in Ruby everything is an object, and arrays are fairly abstracted things with lots of built-in magic. So, let’s reimplement the program in Java, in which integers are just integers, and arrays come with very few extras.*
Nothing could be simpler, conceptually, than binary search. You divide your search space in two and see whether you should be looking in the top or bottom half; then you repeat the exercise until done. Instructively, there are a great many ways to code this algorithm incorrectly, and several widely published versions contain bugs. The implementation mentioned in “On the Goodness of Binary Search,” and shown in Java in Example 4-6, is based on one I learned from Gaston Gonnet, the lead developer of the Maple language for symbolic mathematics and currently Professor of Computer Science at ETH in Zürich.
EXAMPLE 4-6. Binary search
1 package binary; 2 3 public class Finder { 4 public static int find(String[] keys, String target) { 5 int high = keys.length; 6 int low = -1; 7 while (high - low > 1) { 8 int probe = (low + high) >>> 1; 9 if (keys[probe].compareTo(target) > 0) 10 high = probe; 11 else 12 low = probe; 13 } 14 if (low == -1 || keys[low].compareTo(target) != 0) 15 return -1; 16 else 17 return low; 18 } 19 }
Key aspects of this program are as follows:
In lines 5–6, note that thehighandlowbounds are set one off the ends of the array, so neither are initially valid indices. This eliminates all sorts of corner cases.
The loop that starts in line 7 runs until the high and low bounds are adjacent; there is no testing to see whether the target has been found. Think for a minute whether you agree with this choice; we’ll return to the question later.
The loop has two invariants.lowis either –1 or points to something less than or equal to the target value.high is either one off the top of the array or points to something strictly greater than the target value.
Line 8 is particularly interesting. In an earlier version it read:
probe = (high + low) / 2;
but in June 2006, Java guru Josh Bloch showed how, in certain obscure circumstances, that code could lead to integer overflow (see http://googleresearch.blogspot.com/2006/06/ extra-extra-read-all-about-it-nearly.html). It is sobering indeed that, many decades into the lifetime of computer science, we are still finding bugs in our core algorithms. (The issue is also discussed by Alberto Savoia in Chapter7.)
At this point, Rubyists will point out that modern dynamic languages such as Ruby and Python take care of integer overflow for you, and thus don’t have this bug.
Because of the loop invariant, once I’m done with the loop, I just need to checklow(lines 14–17). If it’s not –1, either it points to something that matches the target, or the target isn’t there.
The Java version took only six and a half minutes to load, and it ran successfully, using less than 1 GB of heap. Also, while it’s harder to measure CPU time in Java than in Ruby, there was no perceptible delay in running the same 2,000 searches.