We've now moved from matching a specific
character to a more general type of character - when we don't know (or don't
care) exactly what the character will be. Now we're going to see what happens
when we want to talk about a more general quantity of characters: more than
three digits in a row; two to four capital letters, and so on. The
metacharacters that we use to deal with a number of characters in a row are
called quantifiers .
Indefinite Repetition The easiest of
these is the question mark. It should suggest uncertainty - something may be
there, or it may not. That's exactly what it does: stating that the immediately
preceding character(s) - or metacharacter(s) - may appear once, or not at all.
It's a good way of saying that a particular character or group is optional. To
match the word 'he or she', you can put:
> perl
matchtest.plx Enter some text to find: \bs?he\b The text matches the
pattern '\bs?he\b'. >
To make a series of characters (or
metacharacters) optional, group them in parentheses as before. Did he say 'what
the Entish is' or 'what the Entish word is'? Either will do:
> perl matchtest.plx Enter some text to find: what the
Entish (word )?is The text matches the pattern 'what the Entish (word
)?is'. >
Notice that we had to put the space inside the
group: otherwise we end up with two spaces between 'Entish' and 'is', whereas
our text only has one:
> perl
matchtest.plx Enter some text to find: what the Entish (word)? is 'what
the Entish (word)? is' was not found. >
As well as matching
something one or zero times, you can match something one or more times. We do
this with the plus sign - to match an entire word without specifying how long it
should be, you can say:
> perl
matchtest.plx Enter some text to find: \b\w+\b The text matches the
pattern '\b\w+\b'. >
In this case, we match the first
available word - I.
If, on the other hand, you have something which may
be there any number of times but might not be there at all - zero or one or many
- you need what's called 'Kleene's star': the *
quantifier. So, to find a capital letter after any - but possibly no - spaces at
the start of the string, what would you do? The start of the string, then any
number of whitespace characters, then a capital:
> perl matchtest.plx Enter some text to find:
^\s*[A-Z] '^\s*[A-Z]' was not found.
>
Of course, our test
string begins with a quote, so the above pattern won't match, but, sure enough,
if you take away that first quote, the pattern will match fine. Let's review
the three qualifiers:
/bea?t/
Matches either 'beat' or 'bet'
/bea+t/
Matches 'beat', 'beaat', 'beaaat'Ö
/bea*t/
Matches 'bet', 'beat', 'beaat'Ö
Novice Perl programmers tend to go to town on combinations of
dot and star, and the results often surprise them, particularly when it comes to
searching-and-replacing. We'll explain the rules of the regular expression
matcher shortly, but bear the following in mind:
A regular expression should hardly ever start or
finish with a starred character.
You should also consider the fact that .*
and .+ in the middle of a regular expression will match
as much of your string as they possibly can. We'll look more at this 'greedy'
behavior later on.
Well-Defined
Repetition If you want to be more precise about how
many times a character or roups of characters might be repeated, you can specify
the maximum and minimum number of repeats in curly brackets. '2 or 3 spaces' can
be written as follows:
> perl
matchtest.plx Enter some text to find: \s{2,3} '\s{2,3}' was not
found. >
So we have no doubled or trebled spaces in our
string. Notice how we construct that - the minimum, a comma, and the maximum,
all inside braces. Omitting either the maximum or the minimum signifies 'or
more' and 'or fewer' respectively. For example, {2,}
denotes '2 or more', while {,3} is '3 or fewer'. In
these cases, the same warnings apply as for the star operator.
Finally,
you can specify exactly how many things are to be in a row by simply putting
that number inside the curly brackets. Here's the five-letter-word example
tidied up a little:
> perl
matchtest.plx Enter some text to find: \b\w{5}\b '\b\w{5}\b' was not
found. >
Summary
Table To refresh your memory, here are the various
metacharacters we've seen so far:
Metacharacter
Meaning
[abc]
any one of the characters a , b , or c.
[^abc]
any one character other than a , b, or c.
Table continued on following
page
Metacharacter
Meaning
[a-z]
any one ASCII character between a and z.
\d \D
a digit; a non-digit.
\w \W
a 'word' character; a non-'word' character.
\s \S
a whitespace character; a non-whitespace character.
\b
the boundary between a \w character and a \W character.
.
any character (apart from a new line).
(abc)
the phrase 'abc ' as a group.
?
preceding character or group may be present 0 or 1 times.
+
preceding character or group is present 1 or more times.
*
preceding character or group may be present 0 or more times.
{x,y}
preceding character or group is present between x and y times.
{,y}
preceding character or group is present at most y times.
{x,}
preceding character or group is present at least x times.
{x}
preceding character or group is present x
times.
Backreferences What if we want to
know what a certain regular expression matched? It was easy when we were
matching literal strings: we knew that 'Case' was going to match those four
letters and nothing else. But now, what matches? If we have /\w{3}/, which three word characters are getting
matched?
Perl has a series of special variables in which it stores
anything that's matched with a group in parentheses. Each time it sees a set of
parentheses, it copies the matched text inside into a numbered variable - the
first matched group goes in $1 , the second group in
$2 , and so on. By looking at these variables, which we
call the backreference variables, we can see
what triggered various parts of our match, and we can also extract portions of
the data for later use.
First, though, let's rewrite our test program so
that we can see what's in those variables:
Try it out : A Second Pattern Tester
#!/usr/bin/perl # matchtest2.plx use warnings; use
strict; $_ = '1: A silly sentence (495,a) *BUT* one which will be useful.
(3)'; print "Enter a regular expression: "; my $pattern =
<STDIN>; chomp($pattern);
if (/$pattern/) { print "The text matches the pattern
'$pattern'.\n"; print "\$1 is '$1'\n" if defined $1; print "\$2 is '$2'\n"
if defined $2; print "\$3 is '$3'\n" if defined $3; print "\$4 is '$4'\n"
if defined $4; print "\$5 is '$5'\n" if defined $5; } else { print
"'$pattern' was not found.\n"; }
Note that we use a backslash to escape the first
'dollar' symbol in each print statement, thus displaying
the actual symbol, while leaving the second in each to display the contents of
the appropriate variable. We've got our special
variables in place, and we've got a new sentence to do our matching on. Let's
see what's been happening:
> perl
matchtest2.plx Enter a regular expression: ([a-z]+) The text matches the
pattern '([a-z]+)'. $1 is 'silly'
>
perl matchtest2.plx Enter a regular expression: (\w+) The text matches the
pattern '(\w+)'.
$1 is
'1' > perl
matchtest2.plx Enter a regular expression: ([a-z]+)(.*)([a-z]+) The text
matches the pattern '([a-z]+)(.*)([a-z]+)'. $1 is 'silly' $2 is ' sentence
(495,a) *BUT* one which will be usefu' $3 is 'l'
> perl matchtest2.plx
Enter a regular expression: e(\w|n\w+) The text matches the
pattern 'e(\w|n\w+)'. $1 is 'n'
How It
Works By printing out what's in each of the groups,
we can see exactly what caused perl to start and stop matching, and when. If we
look carefully at these results, we'll find that they can tell us a great deal
about how perl handles regular expressions.
How the Engine Works We've now seen
most of the syntax behind regular expression matching and plenty of examples of
it in action. The code that does all the matching is called perl's 'regular
expression engine'. You might now be wondering about the exact rules applied by
this engine when determining whether or not a piece of text matches. And how
much of it matches what. From what our examples have shown us, let us make some
deductions about the engine's operation. Our first expression, ([a-z]+) plucked out a set of one-or-more lower-case letters.
The first such set that perl came across was 'silly '.
The next character after 'y ' was a space, and so no
longer matched the expression.
Rule one: Once
the engine starts matching, it will keep matching a character at a time for as
long as it can. Once it sees something that doesn't match, however, it has to
stop. In this example, it can never get beyond a character that is not a lower
case letter. It has to stop as soon as it encounters one. Next, we looked for a series of word characters, using (\w+ ). The engine started looking at the beginning of the
string and found one, '1'. The next character was not a word character (it was a
colon), and so the engine had to stop.
Rule
two: Unlike me, the engine is eager . It's
eager to start work and eager to finish, and it starts matching as soon as
possible in the string; if the first character doesn't match, try and start
matching from the second. Then take every opportunity to finish as quickly as
possible.
Then we tried this:([a-z]+)(.*)([a-z]+) . The result we got with this was a
little strange. Let's look at it again:
> perl
matchtest2.plx Enter a regular expression: ([a-z]+)(.*)([a-z]+) The text
matches the pattern '([a-z]+)(.*)([a-z]+)'. $1 is 'silly' $2 is ' sentence
(495,a) *BUT* one which will be usefu' $3 is 'l' >
Our
first group was the same as what matched before - nothing new there. When we
could no longer match lower case letters, we switched to matching anything we
could. Now, this could take up the rest of the string, but that wouldn't allow a
match for the third group. We have to leave at least one lower-case
letter.
So, the engine started to reverse back along the string, giving
characters up one by one. It gave up the closing bracket, the 3, then the
opening bracket, and so on, until we got to the first thing that would satisfy
all the groups and let the match go ahead - namely a lower-case letter: the 'l'
at the end of 'useful'.
From this, we can draw up the third
rule:
Rule three: Like me, in this case, the
engine is greedy. If you use the + or * operators, they will try and steal as much of the string as
possible. If the rest of the expression does not match, it grudgingly gives up a
character at a time and tries to match again, in order to find the fullest
possible match. We can turn a greedy match into
a non-greedy match by putting the ? operator after
either the plus or star. For instance, let's turn this example into a non-greedy
version: ([a-z]+)(.*?)([a-z]+) . This gives us an
entirely different result:
> perl
matchtest2.plx Enter a regular expression: ([a-z]+)(.*?)([a-z]+) The text
matches the pattern '([a-z]+)(.*?)([a-z]+)'. $1 is 'silly' $2 is ' ' $3
is 'sentence' >
Now we've shut off rule three, rule two
takes over. The smallest possible match for the second group was a single space.
First, it tried to get nothing at all, but then the third group would be faced
with a space. This wouldn't match. So, we grudgingly accept the space and try
and finish again. This time the third group has some lower case letters, and
that can match as well.
What if we turn off greediness in all three
groups, and say this: ([a-z]+?)(.*?)([a-z]+?)
> perl matchtest2.plx Enter a regular
expression: ([a-z]+?)(.*?)([a-z]+?) The text matches the pattern
'([a-z]+?)(.*?)([a-z]+?)'. $1 is 's' $2 is '' $3 is 'i' >
What about this? Well, the
smallest possible match for the first group is the 's' of silly. We asked it to
find one character or more, and so the smallest it could find was one. The
second group actually matched no characters at all. This left the third group
facing an 'i', which it took to complete the match.
Our last example
included an alternation:
> perl
matchtest2.plx
Enter a regular
expression: e(\w|n\w+) The text matches the pattern 'e(\w|n\w+)'. $1 is 'n' >
The engine took the first branch of
the alternation and matched a single character, even though the second branch
would actually satisfy greed. This leads us onto the fourth rule:
Rule four: Again like me,
the regular expression engine hates decisions
. If there are two branches, it will always choose the first one, even though
the second one might allow it to gain a longer match. To summarize:
The regular expression engine starts as soon as
it can, grabs as much as it can, then tries to finish as soon as it can, while
taking the first decision available to it.