Beginning Perl - Working with RegExps (
Page 5 of 6 )
Now that we've
matched a string, what do we do with it? Well, sometimes it's just useful to
know whether a string contains a given pattern or not. However, a lot of the
time we're going to be doing search-and-replace operations on text. We'll
explain how to do that here. We'll also cover some of the more advanced areas of
dealing with regular expressions.
Substitution Now we know all about
matching text, substitution is very easy. Why? Because all of the clever things
are in the 'search' part, rather than the 'replace': all the character classes,
quantifiers and so on only make sense when matching. You can't substitute, say,
a word with any number of digits. So, all we need to do is take the 'old' text,
Our match, and tell perl what we want to replace it with. This we do with the
s/// operator.
The s is
for 'substitute' - between the first two slashes, we put our regular expression
as before. Before the final slash, we put our text replacement. Just as with
matching, we can use the =~ operator to apply it to a
certain string. If this is not given, it applies to the default variable $_ :
#!/usr/bin/perl # subst1.plx use warnings; use
strict; $_ = "Awake! Awake! Fear, Fire, Foes! Awake! Fire, Foes!
Awake!"; # Tolkien, Lord of the Rings s/Foes/Flee/; print
$_,"\n";
Here we have substituted the first
occurrence of 'Foes' with the word 'Flee'. Had we wanted to change every
occurrence, we would have needed to use another modifier. Just as the /i modifier for matching case-insensitively, the /g modifier on a substitution acts globally:
#!/usr/bin/perl# subst1.plxuse warnings;use strict; $_ =
"Awake! Awake! Fear, Fire, Foes! Awake! Fire, Foes! Awake!";# Tolkien, Lord of
the Rings
s/Foes/Flee/g;
print $_,"\n";
> perl subst1.plx Awake! Awake!
Fear, Fire, Flee! Awake! Fire, Flee! Awake! > Like the left-hand
side of the substitution, the right-hand side also works like a double-quoted
string and is thus subject to variable interpolation. One useful thing, though,
is that we can use the backreference variables we collected during the match on
the right hand side. So, for instance, to swap the first two words in a string,
we would say something like this:
#!/usr/bin/perl # subst2.plx use warnings; use
strict; $_ = "there are two major products that come out of Berkeley: LSD and
UNIX"; # Jeremy Anderson s/(\w+)\s+(\w+)/$2 $1/; print $_,
"?\n";
>perl subst2.plx are there
two major products that come out of Berkeley: LSD and
UNIX? >
What would happen if we tried doing that globally?
Well, let's do it and see:
#!/usr/bin/perl# subst2.plxuse warnings;use strict; $_ =
"there are two major products that come out of Berkeley: LSD and UNIX";# Jeremy
Anderson
s/(\w+)\s+(\w+)/$2 $1/g;
print $_, "?\n";
>perl subst2.plx are there major
two that products out come Berkeley of: and LSD
UNIX? >
Here, every word in a pair is swapped with its
neighbor. When processing a global match, perl always starts where the previous
match left off.
Changing Delimiters You may have
noticed that // and s/// looks
like q// and qq// . Well, just
like q// and qq// , we can
change the delimiters when matching and substituting to increase the readability
of our regular expressions. The same rules apply: Any non-word character can be
the delimiter, and paired delimiters such as <> ,
() , {}, and [] may be used - with two provisos.
First, if you
change the delimiters on // , you must put an m in front of it. (m for 'match'). This is so that perl can
still recognize it as a regular expression, rather than a block or comment or
anything else. Second, if you use paired delimiters with substitution, you
must use two pairs:
s/old text/new text/g;
becomes:
s{old text}{new text}g;
You may, however, leave spaces
or new lines between the pairs for the sake of clarity:
s{old text} {new text}g;
The prime example of when you
would want to do this is when you are dealing with file paths, which contain a
lot of slashes. If you are, for instance, moving files on your Unix system from
/usr/local/share/ to /usr/share/
, you may want to munge the file names like this:
s/\/usr\/local\/share\//\/usr\/share\//g;
However, it's far easier and
far less ugly to change the delimiters in this case:
s#/usr/local/share/#/usr/share/#g;
Modifiers We've already seen the /i modifier used to
indicate that a match should be case insensitive. We've also seen the /g modifier to apply a substitution. What other modifiers are
there?
/m - treat the
string as multiple lines. Normally, ^ and $ match the very start and very end of the string. If the
/m modifier is in play, then they will match the starts
and ends of individual lines (separated by \n ). For
example, given the string: "one\ntwo" , the pattern
/^two$/ will not match, but /^two$/m
will. /s - treat the string as a single line.
Normally, . does not match a new line character; when
/s is given, then it will. /g
- as well as globally replacing in a substitution, allows us to match multiple
times. When using this modifier, placing the \G anchor
at the beginning of the regexp will anchor it to the end point of the last
match. /x - allow the use of whitespace and comments
inside a match. Regular
expressions can get quite fiendish to read at times. The /x modifier is one way to stop them becoming so. For instance,
if you're matching a string in a log file that contains a time, followed by a
computer name in square brackets, then a message, the expression you'll create
to extract the information may easily end up looking like this:
# Time in $1, machine name in $2, text in
$3/^([0-2]\d:[0-5]\d:[0-5]\d)\s+\[([^\]]+)\]\s+(.*)$/
However, if you use the /x
modifier, you can stretch it out as follows:
/^( # First group: time [0-2]\d : [0-5]\d : [0-5]\d )\s+\[
# Square bracket ( # Second group: machine name [^\]]+ # Anything that isn't a
square bracket )\] # End square bracket
\s+ ( # Third group: everything else .* )$/x
Another way to tidy this up is
to put each of the groups into variables and interpolate them:
Split We briefly saw split earlier on in the
chapter, where we used it to break up a string into a list of words. In fact, we
only saw it in a very simple form. Strictly speaking, it was a bit of a cheat to
use it at all. We didn't see it then, but split was
actually using a regular expression to do its stuff!
Using split on its own is equivalent to saying:
split /\s+/, $_;
which breaks the default
string $_ into a list
of substrings, using whitespace as a delimiter. However, we can also specify our
own regular expression: perl goes through the string, breaking it whenever the
regexp matches. The delimiter itself is thrown away.
For instance, on the
UNIX operating system, configuration files are sometimes a list of fields
separated by colons. A sample line from the password file looks like
this:
kake:x:10018:10020::/home/kake:/bin/bash
To get at each field, we can
split when we see a colon:
#!/usr/bin/perl # split.plx use warnings; use
strict; my $passwd = "kake:x:10018:10020::/home/kake:/bin/bash"; my
@fields = split /:/, $passwd; print "Login name : $fields[0]\n"; print
"User ID : $fields[2]\n"; print "Home directory : $fields[5]\n";
>perl
split.plx Login name : kake User ID : 10018 Home directory :
/home/kake >
Note that the fifth field has been left empty.
Perl will recognize this as an empty field, and the numbering used for the
following entries takes account of this. So $fields[5]
returns /home/kake , as we'd otherwise expect. Be
careful though - if the line you are splitting contains empty fields at the end,
they will get dropped.
Join To do the exact opposite, we can use the join operator. This takes a specified delimiter and interposes
it between the elements of a specified array. For example:
#!/usr/bin/perl
# join.plx
use warnings;use strict; my $passwd =
"kake:x:10018:10020::/home/kake:/bin/bash";my @fields = split /:/, $passwd;print
"Login name : $fields[0]\n";print "User ID : $fields[2]\n";print "Home directory
: $fields[5]\n";
my $passwd2 = join "#", @fields; print "Original
password : $passwd\n"; print "New password : $passwd2\n";
>perl
join.plx Login name : kake User ID : 10018 Home directory :
/home/kake Original password :
kake:x:10018:10020::/home/kake:/bin/bash New password :
kake#x#10018#10020##/home/kake#/bin/bash >
Transliteration While we're looking
at regular expressions, we should briefly consider another operator. While it's
not directly associated with regexps, the transliteration operator has a lot in
common with them and adds a very useful facility to the matching and
substitution techniques we've already seen.
What this does is to
correlate the characters in its two arguments, one by one, and use these
pairings to substitute individual characters in the referenced string. It uses
the syntax tr/one/two/ and (as with the matching and
substitution operators) references the special variable $_ unless otherwise specified with =~
or !~ . In this case, it replaces all the 'o's in the
referenced string with 't's, all the 'n's with 'w's, and all the 'e's with
'o's.
Let's say you wanted to replace, for some reason, all the numbers
in a string with letters. You might say something like this:
$string =~ tr/0123456789/abcdefghij/;
This would turn, say, "2011064"
into "cabbage". You can use ranges in transliteration but not in any of the
character classes. We could write the above as:
$string =~ tr/0-9/a-j/;
The return value of this
operator is, by default, the number of characters matched with those in the
first argument. You can therefore use the transliteration operator to count the
number of occurrences of certain characters. For example, to count the number of
vowels in a string, you can use:
my $vowels = $string =~ tr/aeiou//;
Note that this will not
actually substitute any of the vowels in the variable $string. As the second argument is blank, there is no
correlation, so no substitution occurs. However, the transliteration operator
can take the /d modifier, which will delete occurrences
on the left that do not have a correlating character on the right. So, to get
rid of all spaces in a string quickly, you could use this line:
$string =~ tr/ //d;
Common
Blunders There are a few common mistakes people
tend to make when writing regexps. We've already seen that /a*b*c*/ will happily match any string at all, since it
matches each letter zero times. What else can go wrong?
Forgetting To Group/Bam{2}/ will match 'Bamm', while /(Bam){2}/ will match 'BamBam', so be careful when choosing
which one to use. The same goes for alternation: /Simple|on/ will match 'Simple' and 'on', while /Sim(ple|on)/ will match both 'Simple' and 'Simon' Group each
option separately. Getting The Anchors
Wrong^ goes at the beginning, $ goes at the end. A dollar anywhere else in the string makes
perl try and interpolate a variable. Forgetting To
Escape Special Characters . Do you want them to have a special meaning?
These are the characters to be careful of: . * ? + [ ] ( ) { }
^ $ | and of course \ itself. Not Counting
from Zero The first entry in an array is given the index zero. Counting from Zero I know, I know! All along I've
been telling you that computers start counting from zero. Nevertheless, there's
always the odd exception - the first backreference is $1
. Don't blame Perl though - it took this behavior from a language called awk which used $1 as the first
reference variable.