Beginning Perl - More Advanced Topics (
Page 6 of 6 )
We've not
actually plumbed the depths of the regular expression language syntax - Perl has
a habit of adding wilder and more bizarre features to it on a regular basis. All
of the more off-the-wall extensions begin with a question mark in a group - this
is supposed to make you stop and ask yourself: 'Do I really want to do this?'
Some of these are experimental and may change from perl version to
version (and may soon disappear altogether), but there are others that aren't so
tricky. Some of these are extremely useful, so let's dive
in!
Inline Comments We've already seen how we can use the /x
modifier to add comments and whitespace to our regular expressions. We can also
do this with the (?#) pattern:
/^Today's (?# This is ignored, by the way)date:/
Unfortunately, there's no way
to have parentheses inside these comments, since perl closes the comment as soon
as it sees a closing bracket. If you want to have longer or more detailed
comments, you should consider using the /x modifier
instead.
Inline Modifiers If you are reading patterns from a file or constructing them from
inside your code, you have no way of adding a modifier to the end of the regular
expression operator. For example:
#!/usr/bin/perl # inline.plx use warnings; use
strict; my $string = "There's more than One Way to do it!"; print "Enter a
test expression: "; my $pat = <STDIN>; chomp($pat); if ($string
=~ /$pat/) { print "Congratulations! '$pat' matches the sample
string.\n"; } else { print "Sorry. No match found for '$pat'"; }
If we run this and momentarily forgot how our sample
string had been capitalized, we might get this:
>perl inline.plx Enter a test expression: one way to do
it! Sorry. No match found for 'one way to do it!' >
So
how can we make this case-insensitive? The solution is to use an inline
modifier, the syntax for which is (?i) . This will make
the enclosing group match case-insensitively. Therefore we have:
>perl inline.plx Enter a test expression: (?i)one way to
do it! Congratulations! '(?i)one way to do it!'
matches the sample string. >
If, conversely, you have a
modifier in place that you temporarily want to get rid of, you can say, for
example, (?-i) to turn it off. If we have
this:
/There's More Than ((?-i)One Way) To Do It!/i;
the words 'One Way' alone are
matched case-sensitively.
Note that you can also inline the /m , /s, and /x modifiers in the same way.
Grouping without Backreferences Parentheses perform the function of grouping and populating the
backreference variables. If you have a portion of your match in parentheses, it
will, if successful, be placed in one of the numbered variables. However, there
may be times when you only want to use brackets for grouping. For example,
you're expecting the first backreference to contain something important, but
there may be some preceding text in the way. You could have something like
this:
/(X-)?Topic: (\w+)/;
You can't be certain whether your first defined backreference
is going to end up in $1 or $2 -
it depends on whether the 'X- ' part is present or not.
For example, if we tried to match the string "Topic: the weather", we'd find
that $1 was left undefined. If we'd tried to do
something with its contents, we'd get the warning:
Use of uninitialized value in
concatenation
Now that's not necessarily a problem here. After
all, we'll find our word in $2 whether or not there's
anything preceding "Topic: ". Surely we can just be
careful not to use $1?
But what if there's more
than one optional field? Say we had an expression that left all but the 2nd and
6th groups optional. We then have to look in $2 for our first word and $6 for
our second, while $1, $3, $4, and $5 are left undefined. This really isn't good
programming style and is asking for trouble! We really shouldn't backreference
fields if we don't need to. We can resolve this problem very simply, by
adding the characters ?: like this:
/(?:X-)?Topic: (\w+)/;
This ensures that the first set
of brackets will now group only and not fill a backreference variable. Our word
will always be put into $1 .
Lookaheads and Lookbehinds Sometimes
you may want to say something along the lines of 'substitute the word "fish"
with "cream", but only if the next word is "cake".' You can do this very simply
by saying:
s/fish cake/cream cake/
What does this do? The regular
expression engine scans a referenced string, looking for a match on "fish cake"
On finding one, it substitutes the text "cream cake". Not too bad - it does the
job. In this case it's not too big a deal that it has to substitute five
characters from each match with five identical
characters from the substitution string. It's not hard to see how this sort of
inefficiency could really start to bog a program down if we used substitutions
excessively.
What we want is a way of putting an assertion into the match
- a 'match the text only if the next word is
"cake"' clause - without actually matching the assertion itself. Having matched
"fish", we really just want to look ahead , to
see if it says " cake" (and give the match a thumbs-up if it does), then forget
about "cake" altogether.
In life, that's not so easy. Fortunately in Perl
we have an operator for just this sort of thing:
/fish(?= cake)/
will match exactly what we want - it looks for "fish", does a
positive lookahead on " cake", and matches "fish"
only if that succeeds. For example:
#!/usr/bin/perl # look1.plx use warnings; use strict; $_ = "fish
cake and fish pie"; print "Our original order was ", $_, "\n"; s/fish(?=
cake)/cream/; print "Actually, make that ", $_, " instead.\n";
will return
>perl look1.plx
Our original order was
fish cake and fish pie Actually, make that cream cake and fish pie
instead. >
We can also look ahead
negatively, by using an exclamation mark instead of the equals sign:
/fish(?! cake)/
which will match "fish" only if the following word is not " cake". If we adapt look1.plx like so:
#!/usr/bin/perl
# look2.plx
use warnings;use strict; $_ = "fish cake and fish
pie";print "Our original order was ", $_, "\n";
s/fish(?! cake)/cream/;
print "Actually, make that ", $_, " instead.\n";
then sure enough, it's "fish
pie" that gets matched this time and not "fish cake".
>perl look2.plx
Our original order was
fish cake and fish pie Actually, make that fish cake and cream pie
instead. >
Lookaheads are very powerful
as you'll soon discover if you experiment a little, particularly when you start
to use less specific expressions (using metacharacters) with
them.
However, we may also wish to look at the text preceding a matched
pattern. We therefore have a similar pair of lookbehind operators. We now use the < sign to point 'behind' the match, matching "cake" only if
"fish" precedes it. So to find all those
boring old fish cakes, we use:
/(?<=fish )cake/
but to find all the cream cakes and chocolate cakes, do
this:
/(?<!fish )cake/
Let's have fish and chips instead of our fish cakes and cream
doughnuts instead of cream cakes:
#!/usr/bin/perl # look3.plx use warnings; use
strict; $_ = "fish cake and cream cake"; print "Our original order was ",
$_, "\n"; s/(?<=fish )cake/and chips/; print "No, wait. I'll have ",
$_, " instead\n"; s/(?<!fish )cake/slices/; print "Actually, make that
", $_, ", will you?\n";
>perl look3.plx Our original order
was fish cake and cream cake No, wait. I'll have fish and chips and cream
cake instead Actually, make that fish and chips and cream slices, will
you?
> One very important thing to note about lookbehind assertions is
that they can only handle fixed-width expressions. So while you can use most of
the metacharacters, indeterminate quantifiers like . ,
?, and * aren't
allowed.
Backreferences
(again) Finally, in our tour of regular
expressions, let's look again at backreferences. Suppose you want to find any
repeated words in a string. How would you do it? You might think about doing
this:
if (/\b(\w+) $1\b/) { print "Repeated word: $1\n";}
Except, this doesn't work,
because $1 is only set when the match is complete. In
fact, if you have warnings turned on, you'll be alerted to the fact that $1 is undefined every time. In order to match while still
inside the regular expression, you need to use the following
syntax:
if (/\b(\w+) \1\b/) { print "Repeated word: $1\n";}
However, when you're replacing,
you'll get a warning if you try and use the \<number> syntax on the wrong side. It'll work, but
you'll be told "\1 better written as $1
".
Summary Regular expressions are quite possibly the most powerful means at
your disposal of looking for patterns in text, extracting sub-patterns and
replacing portions of text. They're the basis of any text shuffling you do in
Perl, and they should be your first port of call when you need to do some string
manipulation.
In this chapter, we've seen how to match simple text,
different classes of text, and then different amounts of text. We've also seen
how to provide alternative matches, how to refer back to portions of the match,
and how to substitute and transliterate text.
The key to learning and
understanding regular expressions is to be able to break them down into their
component parts and unravel the language, translating it piecewise into English.
Once you can fluently read out the intention of a complex regular expression,
you're well on your way to creating powerful matches of your own.
You
can find a summary of regular expression syntax in Appendix A. Section 6 of the
Perl FAQ (at www.perl.com ) contains a good
selection of regexp hints and tricks.
Exercises Write out English descriptions of the following regular
expressions, and describe what the operations actually do:
$var =~ /(\w+)$/ $code !~
/^#/ s/#{2,}/#/g
Using the
contents of the gettysburg.txt file (provided in the download for Chapter 6),
use regular expressions to do the following, and print out the result. (Tip: use
a here-document to store the text in your file):
Count the number of occurences of the word 'we'. Reformat the
text, so that each sentences is displayed as a separate paragraph. Check that
there are no multiple spaces in the text, replacing any with single
spaces.
When we use groups, the // operator
returns a list of all the text strings that have been matched. Modify our
example program matchtest2.plx, so that it produces its output from this list,
rather than using special variables. If we want to sort a list of words into
alphabetical order, one simple and quite effective way is to write a program
that performs a 'bubble sort': working through the whole list, it compares each
pair of consecutive words; if it finds them in the wrong order, it swaps them
over. On reaching the end of the list it repeats the process - unless the
previous scan didn't yield any swaps, in which case the list is already properly
ordered. Use regular expressions along with the other techniques you've seen so
far, and write this program so that it will work with a list of words separated
by newline characters. One small hint - the pos() function may come in useful
here. You can use this to adjust the position of the \G boundary, for example:
pos($var) = 10 will set it just after the tenth character in $var. A subsequent
global search will therefore start from this point.