Home arrow Perl Programming arrow Page 4 - Beginning Perl

Repetition - Perl

This excerpt is from Wrox's book Beginning Perl. It covers the entirety of Chapter 5 - Regular expressions. Buy this book on Amazon.com now!

TABLE OF CONTENTS:
  1. Beginning Perl
  2. Regular Expressions
  3. Escaping Special Characters
  4. Repetition
  5. Working with RegExps
  6. More Advanced Topics
By: Dev Shed
Rating: starstarstarstarstar / 12
July 14, 2000

print this article
SEARCH DEV SHED

TOOLS YOU CAN USE

advertisement
We've now moved from matching a specific character to a more general type of character - when we don't know (or don't care) exactly what the character will be. Now we're going to see what happens when we want to talk about a more general quantity of characters: more than three digits in a row; two to four capital letters, and so on. The metacharacters that we use to deal with a number of characters in a row are called quantifiers .

Indefinite Repetition
The easiest of these is the question mark. It should suggest uncertainty - something may be there, or it may not. That's exactly what it does: stating that the immediately preceding character(s) - or metacharacter(s) - may appear once, or not at all. It's a good way of saying that a particular character or group is optional. To match the word 'he or she', you can put:

> perl matchtest.plx
Enter some text to find: \bs?he\b
The text matches the pattern '\bs?he\b'.
>

To make a series of characters (or metacharacters) optional, group them in parentheses as before. Did he say 'what the Entish is' or 'what the Entish word is'? Either will do:

> perl matchtest.plx
Enter some text to find: what the Entish (word )?is
The text matches the pattern 'what the Entish (word )?is'.
>

Notice that we had to put the space inside the group: otherwise we end up with two spaces between 'Entish' and 'is', whereas our text only has one:

> perl matchtest.plx
Enter some text to find: what the Entish (word)? is
'what the Entish (word)? is' was not found.
>

As well as matching something one or zero times, you can match something one or more times. We do this with the plus sign - to match an entire word without specifying how long it should be, you can say:

> perl matchtest.plx
Enter some text to find: \b\w+\b
The text matches the pattern '\b\w+\b'.
>

In this case, we match the first available word - I.

If, on the other hand, you have something which may be there any number of times but might not be there at all - zero or one or many - you need what's called 'Kleene's star': the * quantifier. So, to find a capital letter after any - but possibly no - spaces at the start of the string, what would you do? The start of the string, then any number of whitespace characters, then a capital:

> perl matchtest.plx
Enter some text to find: ^\s*[A-Z]
'^\s*[A-Z]' was not found.

>

Of course, our test string begins with a quote, so the above pattern won't match, but, sure enough, if you take away that first quote, the pattern will match fine.
Let's review the three qualifiers:

/bea?t/
Matches either 'beat' or 'bet'
/bea+t/
Matches 'beat', 'beaat', 'beaaat'
/bea*t/
Matches 'bet', 'beat', 'beaat'

Novice Perl programmers tend to go to town on combinations of dot and star, and the results often surprise them, particularly when it comes to searching-and-replacing. We'll explain the rules of the regular expression matcher shortly, but bear the following in mind:

A regular expression should hardly ever start or finish with a starred character.

You should also consider the fact that .* and .+ in the middle of a regular expression will match as much of your string as they possibly can. We'll look more at this 'greedy' behavior later on.

Well-Defined Repetition
If you want to be more precise about how many times a character or roups of characters might be repeated, you can specify the maximum and minimum number of repeats in curly brackets. '2 or 3 spaces' can be written as follows:

> perl matchtest.plx
Enter some text to find: \s{2,3}
'\s{2,3}' was not found.
>

So we have no doubled or trebled spaces in our string. Notice how we construct that - the minimum, a comma, and the maximum, all inside braces. Omitting either the maximum or the minimum signifies 'or more' and 'or fewer' respectively. For example, {2,} denotes '2 or more', while {,3} is '3 or fewer'. In these cases, the same warnings apply as for the star operator.

Finally, you can specify exactly how many things are to be in a row by simply putting that number inside the curly brackets. Here's the five-letter-word example tidied up a little:

> perl matchtest.plx
Enter some text to find: \b\w{5}\b
'\b\w{5}\b' was not found.
>

Summary Table
To refresh your memory, here are the various metacharacters we've seen so far:

Metacharacter
Meaning
[abc]
any one of the characters a , b , or c.
[^abc]
any one character other than a , b, or c.

Table continued on following page

Metacharacter
Meaning
[a-z]
any one ASCII character between a and z.
\d \D
a digit; a non-digit.
\w \W
a 'word' character; a non-'word' character.
\s \S
a whitespace character; a non-whitespace character.
\b
the boundary between a \w character and a \W character.
.
any character (apart from a new line).
(abc)
the phrase 'abc ' as a group.
?
preceding character or group may be present 0 or 1 times.
+
preceding character or group is present 1 or more times.
*
preceding character or group may be present 0 or more times.
{x,y}
preceding character or group is present between x and y times.
{,y}
preceding character or group is present at most y times.
{x,}
preceding character or group is present at least x times.
{x}
preceding character or group is present x times.

Backreferences
What if we want to know what a certain regular expression matched? It was easy when we were matching literal strings: we knew that 'Case' was going to match those four letters and nothing else. But now, what matches? If we have /\w{3}/, which three word characters are getting matched?

Perl has a series of special variables in which it stores anything that's matched with a group in parentheses. Each time it sees a set of parentheses, it copies the matched text inside into a numbered variable - the first matched group goes in $1 , the second group in $2 , and so on. By looking at these variables, which we call the backreference variables, we can see what triggered various parts of our match, and we can also extract portions of the data for later use.

First, though, let's rewrite our test program so that we can see what's in those variables:

Try it out : A Second Pattern Tester

#!/usr/bin/perl
# matchtest2.plx
use warnings;
use strict;
$_ = '1: A silly sentence (495,a) *BUT* one which will be useful. (3)';
print "Enter a regular expression: ";
my $pattern = <STDIN>;
chomp($pattern);

if (/$pattern/) {
print "The text matches the pattern '$pattern'.\n";
print "\$1 is '$1'\n" if defined $1;
print "\$2 is '$2'\n" if defined $2;
print "\$3 is '$3'\n" if defined $3;
print "\$4 is '$4'\n" if defined $4;
print "\$5 is '$5'\n" if defined $5;
} else {
print "'$pattern' was not found.\n";
}

Note that we use a backslash to escape the first 'dollar' symbol in each print statement, thus displaying the actual symbol, while leaving the second in each to display the contents of the appropriate variable.

We've got our special variables in place, and we've got a new sentence to do our matching on. Let's see what's been happening:

> perl matchtest2.plx
Enter a regular expression: ([a-z]+)
The text matches the pattern '([a-z]+)'.
$1 is 'silly'

> perl matchtest2.plx
Enter a regular expression: (\w+)
The text matches the pattern '(\w+)'.

$1 is '1'

> perl matchtest2.plx
Enter a regular expression: ([a-z]+)(.*)([a-z]+)
The text matches the pattern '([a-z]+)(.*)([a-z]+)'.
$1 is 'silly'
$2 is ' sentence (495,a) *BUT* one which will be usefu'
$3 is 'l'

> perl matchtest2.plx

Enter a regular expression: e(\w|n\w+)
The text matches the pattern 'e(\w|n\w+)'.
$1 is 'n'

How It Works
By printing out what's in each of the groups, we can see exactly what caused perl to start and stop matching, and when. If we look carefully at these results, we'll find that they can tell us a great deal about how perl handles regular expressions.

How the Engine Works
We've now seen most of the syntax behind regular expression matching and plenty of examples of it in action. The code that does all the matching is called perl's 'regular expression engine'. You might now be wondering about the exact rules applied by this engine when determining whether or not a piece of text matches. And how much of it matches what. From what our examples have shown us, let us make some deductions about the engine's operation.
Our first expression, ([a-z]+) plucked out a set of one-or-more lower-case letters. The first such set that perl came across was 'silly '. The next character after 'y ' was a space, and so no longer matched the expression.

Rule one: Once the engine starts matching, it will keep matching a character at a time for as long as it can. Once it sees something that doesn't match, however, it has to stop. In this example, it can never get beyond a character that is not a lower case letter. It has to stop as soon as it encounters one.

Next, we looked for a series of word characters, using (\w+ ). The engine started looking at the beginning of the string and found one, '1'. The next character was not a word character (it was a colon), and so the engine had to stop.

Rule two: Unlike me, the engine is eager . It's eager to start work and eager to finish, and it starts matching as soon as possible in the string; if the first character doesn't match, try and start matching from the second. Then take every opportunity to finish as quickly as possible.

Then we tried this:([a-z]+)(.*)([a-z]+) . The result we got with this was a little strange. Let's look at it again:

> perl matchtest2.plx
Enter a regular expression: ([a-z]+)(.*)([a-z]+)
The text matches the pattern '([a-z]+)(.*)([a-z]+)'.
$1 is 'silly'
$2 is ' sentence (495,a) *BUT* one which will be usefu'
$3 is 'l'
>

Our first group was the same as what matched before - nothing new there. When we could no longer match lower case letters, we switched to matching anything we could. Now, this could take up the rest of the string, but that wouldn't allow a match for the third group. We have to leave at least one lower-case letter.

So, the engine started to reverse back along the string, giving characters up one by one. It gave up the closing bracket, the 3, then the opening bracket, and so on, until we got to the first thing that would satisfy all the groups and let the match go ahead - namely a lower-case letter: the 'l' at the end of 'useful'.

From this, we can draw up the third rule:

Rule three: Like me, in this case, the engine is greedy. If you use the + or * operators, they will try and steal as much of the string as possible. If the rest of the expression does not match, it grudgingly gives up a character at a time and tries to match again, in order to find the fullest possible match.

We can turn a greedy match into a non-greedy match by putting the ? operator after either the plus or star. For instance, let's turn this example into a non-greedy version: ([a-z]+)(.*?)([a-z]+) . This gives us an entirely different result:

> perl matchtest2.plx
Enter a regular expression: ([a-z]+)(.*?)([a-z]+)
The text matches the pattern '([a-z]+)(.*?)([a-z]+)'.
$1 is 'silly'
$2 is ' '
$3 is 'sentence'
>

Now we've shut off rule three, rule two takes over. The smallest possible match for the second group was a single space. First, it tried to get nothing at all, but then the third group would be faced with a space. This wouldn't match. So, we grudgingly accept the space and try and finish again. This time the third group has some lower case letters, and that can match as well.

What if we turn off greediness in all three groups, and say this: ([a-z]+?)(.*?)([a-z]+?)

> perl matchtest2.plx
Enter a regular expression: ([a-z]+?)(.*?)([a-z]+?)
The text matches the pattern '([a-z]+?)(.*?)([a-z]+?)'.
$1 is 's'
$2 is ''
$3 is 'i'

>

What about this? Well, the smallest possible match for the first group is the 's' of silly. We asked it to find one character or more, and so the smallest it could find was one. The second group actually matched no characters at all. This left the third group facing an 'i', which it took to complete the match.

Our last example included an alternation:

> perl matchtest2.plx

Enter a regular expression: e(\w|n\w+)
The text matches the pattern 'e(\w|n\w+)'.
$1 is 'n'
>

The engine took the first branch of the alternation and matched a single character, even though the second branch would actually satisfy greed. This leads us onto the fourth rule:

Rule four: Again like me, the regular expression engine hates decisions . If there are two branches, it will always choose the first one, even though the second one might allow it to gain a longer match.

To summarize:

The regular expression engine starts as soon as it can, grabs as much as it can, then tries to finish as soon as it can, while taking the first decision available to it.

1999 Wrox Press Limited, US and UK.



 
 
>>> More Perl Programming Articles          >>> More By Dev Shed
 

blog comments powered by Disqus
escort Bursa Bursa escort Antalya eskort
   

PERL PROGRAMMING ARTICLES

- Perl Turns 25
- Lists and Arguments in Perl
- Variables and Arguments in Perl
- Understanding Scope and Packages in Perl
- Arguments and Return Values in Perl
- Invoking Perl Subroutines and Functions
- Subroutines and Functions in Perl
- Perl Basics: Writing and Debugging Programs
- Structure and Statements in Perl
- First Steps in Perl
- Completing Regular Expression Basics
- Modifiers, Boundaries, and Regular Expressio...
- Quantifiers and Other Regular Expression Bas...
- Parsing and Regular Expression Basics
- Hash Functions

Developer Shed Affiliates

 


Dev Shed Tutorial Topics: