Mathematics is the language of
nature. Everything around us can be represented and understood through
numbers. If you graph these numbers, patterns emerge. Therefore: There are
patterns everywhere in nature."
- Max Cohen in Pi, 1998
Whether or
not you agree that Max's assumptions give rise to his conclusion is your own
opinion, but his case is much easier to follow in the field of computers - there
are certainly patterns everywhere in programming.
Regular expressions allow us look for patterns in
our data. So far we've been limited to checking a single value against that of a
scalar variable or the contents of an array or hash. By using the rules outlined
in this chapter, we can use that one single value (or pattern) to describe what
we're looking for in more general terms: we can check that every sentence in a
file begins with a capital letter and ends with a full stop, find out how many
times James Bond's name is mentioned in 'Goldfinger', or learn if there are any
repeated sequences of numbers in the decimal representation of p greater than
five in length.
However, regular expressions are a very big area -
they're one of the most powerful features of Perl. We're going to break our
treatment of them up into six sections:
Basic
patterns Special characters to use Quantifiers, anchors and memorizing
patterns Matching, substituting, and transforming text using
patterns Backtracking A quick look at some simple pitfalls
Generally speaking, if you want to ask perl
something about a piece of text, regular expressions are going to be your first
port of call - however, there's probably one simple question burning in your
headÖ
What Are They? The term "Regular Expression" (now commonly abbreviated to "RegExp"
or even "RE") simply refers to a pattern that follows the rules of syntax
outlined in the rest of this chapter. Regular expressions are not limited to
perl - Unix utilities such as sed and egrep use the same notation for finding patterns in text. So
why aren't they just called 'search patterns' or something less
obscure?
Well, the actual phrase itself originates from the mid-fifties
when a mathematician called Stephen Kleene developed a notation for manipulating
'regular sets'. Perl's regular expressions have grown and grown beyond the
original notation and have significantly extended the original system, but some
of Kleene's notation remains, and the name has stuck.
Patterns History lessons aside, it's
all about identifying patterns in text. So what constitutes a pattern? And how
do you compare it against something?
The simplest pattern is a word - a
simple sequence of characters - and we may, for example, want to ask perl
whether a certain string contains that word. Now, we can do this with the
techniques we have already seen: We want to split the string into separate
words, and then test to see if each word is the one we're looking for. Here's
how we might do that:
#!/usr/bin/perl # match1.plx use warnings; use
strict; my $found = 0; $_ = "Nobody wants to hurt you... 'cept, I do hurt
people sometimes, Case."; my $sought = "people"; foreach my $word (split)
{ if ($word eq $sought) { $found = 1; last; } } if ($found)
{ print "Hooray! Found the word 'people'\n"; }
Sure enough the program returns
success:
>perl
match1.plx
Hooray! Found the word 'people' >
But that's messy! It's complicated, and it's
slow to boot! Worse still, the split function (which
breaks each of our lines up into a list of 'words' - we'll see more of this,
later on in the chapter) actually keeps all
the punctuation - the string 'you ' wouldn't be found in
the above, whereas 'you... ' would. This looks like a
hard problem, but it should be easy. Perl was designed to make easy tasks easy
and hard things possible, so there should be a better way to do this. This is
how it looks using a regular expression:
#!/usr/bin/perl# match1.plxuse warnings;use
strict; $_ = "Nobody wants to hurt you... 'cept, I do hurt people sometimes,
Case.";
if ($_ =~ /people/) { print "Hooray! Found the word
'people'\n"; }
This
is much, much easier and yeilds the same result. We place the text we want to
find between forward slashes - that's the regular expression part - that's our
pattern, what we're trying to match. We also need to tell perl which particular
string we're looking for in that pattern. We do this with the =~ operator. This returns 1 if the pattern match was
successful (in our case, whether the character sequence 'people' was found in
the string) and the undefined value if it wasn't.
Before we go on to
more complicated patterns, let's just have a quick look at that syntax. As we
noted previously, a lot of Perl's operations take $_ as
a default argument, and regular expressions are one such operation. Since we
have the text we want to test in $_ , we don't need to
use the =~ operator to 'bind' the pattern to another
string. We could write the above even more simply:
$_ = "Nobody wants to hurt you... 'cept, I do hurt people
sometimes, Case.";
if (/people/) {
print "Hooray! Found the word 'people'\n";}
Alternatively, we might want to
test for the pattern not matching - the word not being found. Obviously, we
could say unless (/people/) , but if the text we're
looking at isn't in $_ , we may also use the negative
form of that =~ operator, which is !~ . For example:
#!/usr/bin/perl # nomatch.plx use warnings; use
strict; my $gibson = "Nobody wants to hurt you... 'cept, I do hurt people
sometimes, Case."; if ($gibson !~ /fish/) { print "There are no fish in
William Gibson.\n"; }
True to form, for cyberpunk books that don't regularly involve
fish, we get the result.
>perl
nomatch.plx
There are no fish in William Gibson. >
Literal text is the simplest
regular expression of all to look for, but we needn't look for just the one word
- we could look for any particular phrase. However, we need to make sure that we
exactly match all the characters: words (with correct capitalization), numbers,
punctuation, and even whitespace:
#!/usr/bin/perl # match2.plx use warnings; use
strict; $_ = "Nobody wants to hurt you... 'cept, I do hurt people sometimes,
Case."; if (/I do/) { print "'I do' is in that string.\n"; } if
(/sometimes Case/) { print "'sometimes Case' matched.\n"; }
Let's run this program
and see what happens:
>perl match2.plx 'I
do' is in that string. >
The other string didn't match, even
though those two words are there. This is because everything in a regular
expression has to match the string, from start to finish: first "sometimes",
then a space, then "Case". In $_ , there was a comma
before the space, so it didn't match exactly. Similarly, spaces inside the
pattern are significant:
#!/usr/bin/perl # match3.plx use warnings; use
strict; my $test1 = "The dog is in the kennel"; my $test2 = "The sheepdog
is in the field"; if ($test1 =~ / dog/) { print "This dog's at
home.\n"; } if ($test2 =~ / dog/) { print "This dog's at
work.\n"; }
This will
only find the first dog, as perl was looking for a space followed by the three
letters, 'dog':
>perl match3.plx This dog's
at home. >
So, for the moment, it looks like we shall have
to specify our patterns with absolute precision. As another example, look at
this:
#!/usr/bin/perl # match4.plx use warnings; use
strict;
$_ = "Nobody wants to hurt you... 'cept, I do hurt people
sometimes, Case."; if (/case/) { print "I guess it's just the way I'm
made.\n"; } else { print "Case? Where are you, Case?\n"; }
>
perl match4.plx Case? Where are you, Case? >
Hmm, no
match. Why not? Because we asked for a small 'c' when we had a big 'C' - regexps
are (if you'll pardon the pun) case-sensitive. We can get around this by asking
perl to compare insensitively, and we do this by putting an 'i' (for
'insensitive') after the closing slash. If we alter the code above as
follows:
if (/case/i) {
print "I guess it's just the way I'm made.\n";} else {
print "Case? Where are you, Case?\n";}
then we find him:
>perl
match4.plx I guess it's just the way I'm made. >
This
'i ' is one of several modifiers that we can add to the end of the regular
expression to change its behavior slightly. We'll see more of them later
on.
Interpolation Regular expressions work a little like double-quoted strings;
variables and metacharacters are interpolated. This allows us to store patterns
in variables and determine what we are matching when we run the program - we
don't need to have them hard-coded in:
Try it
out - Pattern Tester This program will ask the user
for a pattern and then test to see if it matches our string. We can use this
throughout the chapter to help us test the various different styles of pattern
we'll be looking at:
#!/usr/bin/perl # matchtest.plx use warnings; use
strict; $_ = q("I wonder what the Entish is for 'yes' and 'no'," he
thought.); # Tolkien, Lord of the Rings print "Enter some text to find:
"; my $pattern = <STDIN>; chomp($pattern);
if (/$pattern/) { print "The text matches the pattern
'$pattern'.\n"; } else { print "'$pattern' was not found.\n"; }
Now we can test out a few
things:
> perl matchtest.plx Enter some text
to find: wonder The text matches the pattern 'wonder'.
> perl
matchtest.plx Enter some text to find: entish 'entish' was not
found.
> perl matchtest.plx Enter some text to find: hough The
text matches the pattern 'hough'.
> perl matchtest.plx Enter some
text to find: and 'no', The text matches the pattern 'and
'no''.
Pretty straightforward, and I'm sure you could all spot
those not in $_ as well.
How It
Works matchtest.plx has its basis in the
three lines:
my $pattern = <STDIN>;chomp($pattern); if
(/$pattern/) {
We're
taking a line of text from the user. Then, since it will end in a new line, and
we don't necessarily want to find a new line in our pattern, we chomp it away. Now we do our test.
Since we're not
using the =~ operator, the test will be looking at the
variable $_ . The regular expression is /$pattern/ , and just like the double-quoted string "$pattern" , the variable $pattern is
interpolated. Hence, the regular expression is purely and simply whatever the
user typed in, once we've got rid of the new line.