In this second part of a four-part series on parsing and regular expression basics in Perl, you'll learn about quantifiers, modifiers, and more. This article is excerpted from chapter one of the book Pro Perl Parsing, written by Christopher M. Frenz (Apress; ISBN: 1590595041).
As you saw in the previous section, you can create a simple regular expression by simply putting the characters or the name of a variable containing the characters you seek to match between a pair of forward slashes. However, suppose you want to match the same sequence of characters multiple times. You could write out something like this to match three instances of Yes in a row:
/YesYesYes/
But suppose you want to match 100 instances? Typing such an expression would be quite cumbersome. Luckily, the regular expression engine allows you to use quantifiers to accomplish just such a task.
The first quantifier I will discuss takes the form of{number}, wherenumberis the number of times you want the sequence matched. If you really wanted to matchYes100 times in a row, you could do so with the following regular expression:
/(Yes){100}/
To match the whole term, putting theYesin parentheses before the quantifier is important; otherwise, you would have matchedYefollowed by 100 instances ofs, since quantifiers operate only on the unit that is located directly before them in the pattern expression. All the quantifiers operate in a syntax similar to this (that is, the pattern followed by a quantifier); Table 1-1 summarizes some useful ones.
Table 1-1. Useful Quantifiers
Quantifier
Effect
X*
Zero or moreXs.
X+
One or moreXs.
X?
Xis optional.
X{5}
FiveXs.
X{5,10}
From five to tenXs.
X{5,}
FiveXs or more.
When using quantifiers, it is important to remember they will always produce the longest possible match unless otherwise instructed to do so. For example, consider the following string:
(123)123(123)
If you asked the regular expression engine to examine this string with an expression such as the following, you would find that the entire string was returned as a match, because.will match any character other than\nand because the string does begin and end with(and)as required:
/\(.*\)/
Note Parentheses are metacharacters (that is, characters with special meaning to the regular expression engine); therefore, to match either the open or close parenthesis, you must type a backslash before the character. The backslash tells the regular expression engine to treat the character as a normal character (in other words, like a, b, c, 1, 2, 3, and so on) and not interpret it as a metacharacter. Other metacharacters are\,|,[,{,^,$,*,+,., and?.
It is important to keep in mind that the default behavior of the regular expression engine is to be greedy, which is often not wanted, since conditions such as the previous example can actually be more common than you may at first think. For example, other than with parentheses, similar issues may arise in documents if you are searching for quotes or even HTML or XML tags, since different elements and nodes often begin and end with the same tags. If you wanted only the contents of the first parentheses to be matched, you need to specify a question mark (?) after your quantifier. For example, if you rewrite the regular expression as follows, you find that(123)is returned as the match:
/\(.*?\)/
Adding?after the quantifier allows you to control greediness and find the smallest possible match rather than the largest one.