Home arrow Practices arrow Page 3 - Finding Things

Putting Regular Expressions to Work - Practices

Search, whether it's searching the web or the contents of your computer, presents the developer with a major challenge. This article, the first of two parts, provides an overview of several search techniques, and the trade-offs that go with them. It is excerpted from chapter four of Beautiful Code: Leading Programmers Explain How They Think, written by Andy Oram and Greg Wilson (O'Reilly, 2007; ISBN: 0596510047). Copyright © 2007 O'Reilly Media, Inc. All rights reserved. Used with permission from the publisher. Available from booksellers or direct from O'Reilly Media.

TABLE OF CONTENTS:
  1. Finding Things
  2. Regular Expressions
  3. Putting Regular Expressions to Work
  4. Content-Addressable Storage
  5. Time to Optimize?
By: O'Reilly Media
Rating: starstarstarstarstar / 4
July 10, 2008

print this article
SEARCH DEV SHED

TOOLS YOU CAN USE

advertisement

A regular expression standing by itself, as shown above, can be used on the command line to search files. But it turns out that most modern computer languages allow you to use them directly in program code. Let’s do that, and write a program that prints out only the lines that match the expression, which is to say a program that records all the times someone fetched an article from the weblog.

This example (and most other examples in this chapter) is in the Ruby programming language because I believe it to be, while far from perfect, the most readable of languages.

If you don’t know Ruby, learning it will probably make you a better programmer. In Chapter 29, the creator of Ruby, Yukihiro Matsumoto (generally known as “Matz”), discusses some of the design choices that have attracted me and so many other programmers to the language.

Example 4-1 shows our first Ruby program, with added line numbers on the left side. (All the examples in this chapter are available from the O’Reilly web site for this book.)

EXAMPLE 4-1. Printing article-fetch lines

1 ARGF.each_line do |line|
2   if line =~ %r{GET /ongoing/When/\d\d\dx/\d\d\d\d/\d\d/\d\d/[^ .]+ }
3     puts line
4   end
5 end

Running this program prints out a bunch of logfile lines that look like the first example. Let’s have a line-by-line look at it:

Line 1

We want to read all the lines of the input, and we don’t care whether they’re from files named on the command line or are being piped in from another program on the standard input. The designers of Ruby believe strongly that programmers shouldn’t have to write ugly code to deal with common situations, and this is a common situation. So,ARGFis a special variable that represents all the input sources. If the command line includes arguments, ARGF assumes they’re names of files and opens them one by one; if there aren’t any, it uses the standard input.

each_lineis a method that you can call on pretty well any file-like object, such asARGF. It reads the lines of input and passes them, one at a time, to a “block” of following code.

The followingdosays that the block getting the input stretches from there to the correspondingend, and the|line|asks that theeach_linemethod load each line into the variablelinebefore giving it to the block.

This kind of loop may surprise the eyes of a new convert to Ruby, but it’s concise, powerful, and very easy to follow after just a bit of practice.

Line 2

This is a pretty straightforwardifstatement. The only magic is the=~, which means “matches” and expects to be followed by regular expression. You can tell Ruby that something is a regular expression by putting slashes before and after it—for example,/this-is-a-regex/. But the particular regular expression we want to use is full of slashes. So to use the slash syntax, you’d have to “escape” them by turning each/into\/, which would be ugly. In this case, therefore, the%rtrick produces more beautiful code.

Line 3

We’re inside theifblock now. So, if the currentlinematches the regexp, the program executesputs line, which prints out the line and a line feed.

Lines 4 and 5

That’s about all there is to it. The firstendterminates theif, and the second terminates thedo. They look kind of silly dangling off the bottom of the code, and the designers of Python have figured out a way to leave them out, which leads to some Python code being more beautiful than the corresponding Ruby.

So far, we’ve shown how regular expressions can be used to find the lines in the logfile that we’re interested in. But what we’re really interested in is counting the fetches for each article. The first step is to identify the article names. Example 4-2 is a slight variation on the previous program.

EXAMPLE 4-2. Printing article names

1 ARGF.each_line do |line|
2   if line =~ %r{GET /ongoing/When/\d\d\dx/(\d\d\d\d/\d\d/\d\d/[^ .]+) }
3     puts $1
4   end
5 end

The differences are subtle. In line 2, I’ve added a pair of parentheses (in boldface) around the interesting part of the article name in the regular expression. In line 3, instead of printing out the whole value ofline, I print out$1, which in Ruby (and several other regular-expression-friendly languages) means “the first place in the regular expression marked with parentheses.” You can mark lots of different pieces of the expression, and thus use$2,$3, and so on.

The first few lines of output produced by running this program over some logfile data look like this:

  2003/10/10/FooCampMacs
  2006/11/13/Rough-Mix
  2003/05/22/StudentLookup
  2003/11/13/FlyToYokohama
  2003/07/31/PerlAngst
  2003/05/21/RDFNet
  2003/02/23/Democracy
  2005/12/30/Spolsky-Recursion
  2004/05/08/Torture
  2004/04/27/RSSticker

Before we go to work determining the popularity of different articles, I’d like to argue that in some important ways, this code is beautiful. Take a moment and think of the code you’d have to write to look at an arbitrary chunk of text and do the same matching and selection work done by the parenthesized regexp. It would be quite a few lines of code, and it would be easy to get wrong. Furthermore, if the format of the logfile changed, fixing the pattern matcher would be error-prone and irritating.

Under the covers, the way that regular expressions work is also among the more wonderful things in computer science. It turns out that they can conveniently be translated into finite automata. These automata are mathematically elegant, and there are astoundingly efficient algorithms for matching them against the text you’re searching. The great thing is that when you’re running an automaton, you have to look only once at each character in the text you’re trying to match. The effect is that a well-built regular expression engine can do pattern matching and selection faster than almost any custom code, even if it were written in hand-optimized assembly language. That’s beautiful.

I think that the Ruby code is pretty attractive, too. Nearly every character of the program is doing useful work. Note that there are no semicolons on the ends of the lines, nor parentheses around the conditional block, and that you can writeputs lineinstead ofputs(line). Also, variables aren’t declared—they’re just used. This kind of stripped-down language design makes for programs that are shorter and easier to write, as well as (more important) easier to read and easier to understand.

Thinking in terms of time, regular expressions are a win/win. It takes the programmer way less time to write them than the equivalent code, it takes less time to deliver the program to the people waiting for it, it uses the computer really efficiently, and the program’s user spends less time sitting there bored.



 
 
>>> More Practices Articles          >>> More By O'Reilly Media
 

blog comments powered by Disqus
escort Bursa Bursa escort Antalya eskort
   

PRACTICES ARTICLES

- Calculating Development Project Costs
- More Techniques for Finding Things
- Finding Things
- Finishing the System`s Outlines
- The System in So Many Words
- Basic Data Types and Calculations
- What`s the Address? Pointers
- Design with ArgoUML
- Pragmatic Guidelines: Diagrams That Work
- Five-Step UML: OOAD for Short Attention Span...
- Five-Step UML: OOAD for Short Attention Span...
- Introducing UML: Object-Oriented Analysis an...
- Class and Object Diagrams
- Class Relationships
- Classes

Developer Shed Affiliates

 


Dev Shed Tutorial Topics: