Beginning Perl - Escaping Special Characters (
Page 3 of 6 )
Of course,
regular expressions can be more than just words and spaces. The rest of this
chapter is going to be about the various ways we can specify more advanced
matches - where portions of the match are allowed to be one of a number of
characters, or where the match must occur at a certain position in the string.
To do this, we'll be describing the special meanings given to certain characters
- called metacharacters - and look at what
these meanings are and what sort of things we can express with them. At this
stage, we might not want to use their special meanings - we may want to
literally match the characters themselves. As you've already seen with
double-quoted strings, we can use a backslash to escape these characters'
special meanings. Hence, if you want to match '... ' in
the above text, you need your pattern to say '\.\.\. '.
For example:
> perl matchtest.plx Enter some
text to find: Ent+ The text matches the pattern 'Ent+'.
> perl
matchtest.plx Enter some text to find: Ent\+ 'Ent\+' was not
found.
We'll see later why the first one matched - due to the
special meaning of + .
These are the characters that are given special
meaning within a regular expression, which you will need to backslash if you
want to use literally:. * ? + [ ] ( ) { } ^ $ | \ Any
other characters automatically assume their literal meanings.
You can also
turn off the special meanings using the escape sequence \Q . After perl sees \Q , the 14
special characters above will automatically assume their ordinary, literal
meanings. This remains the case until perl sees either \E or the end of the pattern.
For instance, if we
wanted to adapt our matchtest program just to look for
literal strings, instead of regular expressions, we could change it to look like
this:
if (/\Q$pattern\E/) {
Now the meaning of + is turned off:
> perl
matchtest.plx Enter some text to find: Ent+ 'Ent+' was not
found. >
Note that all \Q does is
turn off the regular expression magic of those 14 characters above - it doesn't
stop, for example, variable interpolation.
Don't forget to change this back again: We'll be using matchtest.plx throughout the chapter, to demonstrate the
regular expressions we look at. We'll need that magic fully
functional!
Anchors So far, our patterns have all tried to find a match anywhere in the
string. The first way we'll extend our regular expressions is by dictating to
perl where the match must occur. We can say 'these characters must match the
beginning of the string' or 'this text must be at the end of the string'. We do
this by anchor ing the match to either
end. The two anchors we have are ^ , which appears at
the beginning of the pattern anchor a match to the beginning of the string, and
$ which appears at the end of the pattern and anchors it
to the end of the string. So, to see if our quotation ends in a full stop - and
remember that the full stop is a special character - we say something like
this:
>perl matchtest.plx Enter some text to
find: \.$ The text matches the pattern '\.$'.
That's a full
stop (which we've escaped to prevent it being treated as a special character)
and a dollar sign at the end of our pattern - to show that this must be the end
of the string.
Try, if you can, to get into the habit of reading out
regular expressions in English. Break them into pieces and say what each piece
does. Also remember to say that each piece must immediately follow the other in
the string in order to match. For instance, the above could be read 'match a
full stop immediately followed by the end of the string'.
If you can get
into this habit, you'll find that reading and understanding regular expressions
becomes a lot easier, and you'll be able to 'translate' back into Perl more
naturally as well.
Here's another example: do we have a capital I at the
beginning of the string?
> perl
matchtest.plx Enter some text to find: ^I '^I' was not
found. >
We use ^ to mean 'beginning
of the string', followed by an I. In our case, though, the character at the
beginning of the string is a " , so our pattern does not
match. If you know that what you're looking for can only occur at the beginning
or the end of the match, it's extremely efficient to use anchors. Instead of
searching through the whole string to see whether the match succeeded, perl only
needs to look at a small portion and can give up immediately if even the first
character does not match.
Let's see one more example of this, where we'll
combine looking for matches with looking through the lines in a
file:
Try it out : Rhyming
Dictionary Imagine yourself as a poor poet. In
fact, not just poor, but downright bad - so bad, you can't even think of a rhyme
for 'pink'. So, what do you do? You do what every sensible poet does in this
situation, and you write the following Perl program:
#!/usr/bin/perl # rhyming.plx use warnings; use
strict; my $syllable = "ink"; while (<>) { print if
/$syllable$/; }
We can now feed it a file of words, and find those that
end in 'ink':
For a really thorough result, you'll need to use a file
containing every word in the dictionary - be prepared to wait though if you do!
For the sake of the example however, any text-based file will do (though it'll
help if it's in English). A bobolink, in case you're wondering, is a migratory
American songbird, otherwise known as a ricebird or
reedbird.
How It Works With the loops and tests we learned in the last chapter, this
program is really very easy:
while (<>) { print if /$syllable$/;}
We've not looked at file access
yet, so you may not be familiar with the while(<>){...}
construction used here. In this example it opens a file that's been
specified on the command line, and loops through it, one line at a time, feeding
each one into the special variable $_ - this is what
we'll be matching.
Once each line of the file has been fed into $_ , we test to see if it matches the pattern, which is our
syllable, 'ink', anchored to the end of the line (with $
). If so, we print it out.
The important thing to note here is that perl
treats the 'ink' as the last thing on the line, even though there is a new line
at the end of $_ . Regular expressions typically ignore
the last new line in a string - we'll look at this behavior in more detail
later.
Shortcuts and Options All this is all very well if we know exactly what it is we're
trying to find, but finding patterns means more than just locating exact pieces
of text. We may want to find a three-digit number, the first word on the line,
four or more letters all in capitals, and so on.
We can begin to do this
using character classes - these aren't just
single characters, but something that signifies that any one of a set of
characters is acceptable. To specify this, we put the characters we consider
acceptable inside square brackets. Let's go back to our matchtest program, using the same test string:
$_ = q("I wonder what the Entish is for 'yes' and 'no'," he
thought.);
> perl matchtest.plx Enter some text to find:
w[aoi]nder The text matches the pattern 'w[aoi]nder'. > What
have we done? We've tested whether the string contains a 'w', followed by either
an 'a', an 'o', or an 'i', followed by 'nder'; in effect, we're looking for
either of 'wander', 'wonder', or 'winder'. Since the string contains 'wonder',
the pattern is matched.
Conversely, we can say that everything is
acceptable except a given sequence of characters - we can 'negate the character
class'. To do this, the character class should start with a ^ , like so:
> perl
matchtest.plx Enter some text to find: th[^eo] 'th[^eo]' was not
found. >
So, we're looking for 'th' followed by something
that is neither an 'e' or an 'o'. But all we have is 'the' and 'thought', so
this pattern does not match.
If the characters you wish to match form a
sequence in the character set you're using - ASCII or Unicode, depending on your
perl version - you can use a hyphen to specify a range of characters, rather
than spelling out the entire range. For instance, the numerals can be
represented by the character class [0-9] . A lower case
letter can be matched with [a-z] . Are there any numbers
in our quote?
> perl matchtest.plx Enter
some text to find: [0-9] '[0-9]' was not found. >
You can
use one or more of these ranges alongside other characters in a character class,
so long as they stay inside the brackets. If you wanted to match a digit and
then a letter from 'A' to 'F', you would say [0-9][A-F]
. However, to match a single hexadecimal digit, you would write [0-9A-F] or [0-9A-Fa-f] if you wished
to include lower-case letters.
Some character classes are going to come
up again and again: the digits, the letters, and the various types of
whitespace. Perl provides us with some neat shortcuts for these. Here are the
most common ones, and what they represent:
Shortcut
Expansion
Description
\d
[0-9]
Digits 0 to 9.
\w
[0-9A-Za-z_]
A 'word' character allowable in a Perl variable name.
\s
[ \t\n\r]
A whitespace character that is, a space, a tab, a newline or a
return.
also, the negative
forms of the above:
Shortcut
Expansion
Description
\D
[^0-9]
Any non-digit.
\W
[^0-9A-Za-z_]
A non-'word' character.
\S
[^ \t\n\r]
A non-blank character.
So,
if we wanted to see if there was a five-letter word in the sentence, you might
think we could do this:
> perl
matchtest.plx Enter some text to find: \w\w\w\w\w The text matches the
pattern '\w\w\w\w\w'. >
But that's not right - there are no
five-letter words in the sentence! The problem is, we've only asked for five
letters in a row, and any word with at least five letters contains five in a row
will match that pattern. We actually matched 'wonde', which was the first
possible series of five letters in a row. To actually get a five-letter word, we
might consider deciding that the word must appear in the middle of the sentence,
that is, between two spaces:
> perl
matchtest.plx Enter some text to find: \s\w\w\w\w\w\s '\s\w\w\w\w\w\s' was
not found. >
Word
Boundaries The problem with that is, when we're
looking at text, words aren't always between two spaces. They can be followed by
or preceded by punctuation, or appear at the beginning or end of a string, or
otherwise next to non-word characters. To help us properly search for words in
these cases, Perl provides the special \b metacharacter.
The interesting thing about \b is that it doesn't
actually match any character in particular. Rather, it matches the point between
something that isn't a word character (either \W or one
of the ends of the string) and something that is (a word character), hence \b for boundary. So, for example, to look for one-letter
words:
> perl matchtest.plx Enter some text
to find: \s\w\s '\s\w\s' was not found.
> perl matchtest.plx Enter some text to find:
\b\w\b The text matches the pattern '\b\w\b'.
As the I was
preceded by a quotation mark, a space wouldn't match it - but a word boundary
does the job. Later, we'll learn how to tell perl how many repetitions of a
character or group of characters we want to match without spelling it out
directly.
What, then, if we wanted to match anything at all? You might
consider something like [\w\W] or [\s\S] , for instance. Actually, this is quite a common
operation, so Perl provides an easy way of specifying it - a full stop. What
about an 'r' followed by two characters - any two characters - and then a
'h'?
> perl matchtest.plx Enter some text to
find: r..h The text matches the pattern 'r..h'. >
Is
there anything after the full stop?
> perl
matchtest.plx Enter some text to find: \.. '\..' was not
found. > What's that? One backslashed full stop to mean a full
stop, then a plain one to mean 'anything at all'.
Posix and Unicode Classes Perl 5.6.0
introduced a few more character classes into the mix - first, those defined by
the POSIX (Portable Operating Systems Interface) standard, which are therefore
present in a number of other applications. The more common character classes
here are:
Shortcut
Expansion
Description
[[:alpha:]]
[a-zA-Z]
An alphabetic character.
[[:alnum:]]
[0-9A-Za-z]
An alphabetic or numeric character.
[[:digit:]]
\d
A digit, 0-9.
[[:lower:]]
[a-z]
A lower case letter.
[[:upper:]]
[A-Z]
An upper case letter.
[[:punct:]]
[!"#$%&'()*+,-./:;<=>?@\[\\\]^_`{|}~]
A punctuation character - note the escaped characters [ , \ , and ].
The Unicode standard also defines 'properties', which apply to
some characters. For instance, the 'IsUpper ' property
can be used to match any upper-case character, in whichever language or
alphabet. If you know the property you are trying to match, you can use the
syntax \p{} to match it, for instance, the upper-case
character is \p{IsUpper} .
Alternatives Instead of giving a
series of acceptable characters, you may want to say 'match either this or
that'. The 'either-or' operator in a regular expression is the same as the
bitwise 'or' operator, | . So, to match either 'yes' or
'maybe' in our example, we could say this:
>
perl matchtest.plx Enter some text to find: yes|maybe The text matches the
pattern 'yes|maybe'. >
That's either 'yes' or 'maybe'. But
what if we wanted either 'yes' or 'yet'? To get alternatives on part of an
expression, we need to group the options. In a regular expression, grouping is
always done with parentheses:
> perl
matchtest.plx Enter some text to find: ye(s|t) The text matches the
pattern 'ye(s|t)'. >
If we have forgotten the parentheses,
we would have tried to match either 'yes' or 't'. In this case, we'd still get a
positive match, but it wouldn't be doing what we want - we'd get a match for any
string with a 't' in it, whether the words 'yes' or 'yet' were there or
not. You can match either 'this' or 'that' or 'the other' by adding more
alternatives:
> perl matchtest.plx Enter
some text to find: (this)|(that)|(the other) '(this)|(that)|(the other)' was
not found. >
However, in this case, it's more efficient to
separate out the common elements:
> perl
matchtest.plx Enter some text to find: th(is|at|e other) 'th(is|at|e
other)' was not found.
You can also nest alternatives. Say you
want to match one of these patterns:
'the'
followed by whitespace or a letter, 'or' You might put something like this:
> perl matchtest.plx Enter some text to find:
(the(\s|[a-z]))|or The text matches the pattern
'(the(\s|[a-z]))|or'.
>
It
looks fearsome, but break it down into its components. Our two alternatives
are:
the(\s|[a-z]) or The second part is easy, while the first contains 'the'
followed by two alternatives: \s and [a-z] . Hence 'either "the" followed by either a whitespace or
a lower case letter, or "or"'. We can, in fact, tidy this up a little, by
replacing (\s|[a-z]) with the less cluttered [\sa-z].
> perl
matchtest.plx Enter some text to find: (the[\sa-z])|or The text matches
the pattern '(the[\sa-z])|or'. >