Quantifiers and Other Regular Expression Basics

In this second part of a four-part series on parsing and regular expression basics in Perl, you’ll learn about quantifiers, modifiers, and more. This article is excerpted from chapter one of the book Pro Perl Parsing, written by Christopher M. Frenz (Apress; ISBN: 1590595041).

Quantifiers 

As you saw in the previous section, you can create a simple regular expression by simply putting the characters or the name of a variable containing the characters you seek to match between a pair of forward slashes. However, suppose you want to match the same sequence of characters multiple times. You could write out something like this to match three instances of Yes in a row:

/YesYesYes/

But suppose you want to match 100 instances? Typing such an expression would be quite cumbersome. Luckily, the regular expression engine allows you to use quantifiers to accomplish just such a task.

The first quantifier I will discuss takes the form of {number} , where number is the num ber of times you want the sequence matched. If you really wanted to match Yes 100 times in a row, you could do so with the following regular expression:

/(Yes){100}/

To match the whole term, putting the Yes in parentheses before the quantifier is important; otherwise, you would have matched Ye followed by 100 instances of s , since quantifiers operate only on the unit that is located directly before them in the pattern expression. All the quantifiers operate in a syntax similar to this (that is, the pattern followed by a quantifier); Table 1-1 summarizes some useful ones.

Table 1-1. Useful Quantifiers

Quantifier Effect
X* Zero or more X s.
X+ One or more X s.
X? X is optional.
X{5} Five X s.
X{5,10} From five to ten X s.
X{5,} Five X s or more.

When using quantifiers, it is important to remember they will always produce the longest possible match unless otherwise instructed to do so. For example, consider the following string:

(123)123(123)

If you asked the regular expression engine to examine this string with an expression such as the following, you would find that the entire string was returned as a match, because . will match any character other than n and because the string does begin and end with ( and ) as required:

/(.*)/


Note  Parentheses are metacharacters (that is, characters with special meaning to the regular expression engine); therefore, to match either the open or close parenthesis, you must type a backslash before the character. The backslash tells the regular expression engine to treat the character as a normal character (in other words, like a, b, c, 1, 2, 3, and so on) and not interpret it as a metacharacter. Other metacharacters are , | , [ , { , ^ , $ , * , + , . , and ? .


It is important to keep in mind that the default behavior of the regular expression engine is to be greedy, which is often not wanted, since conditions such as the previous example can actually be more common than you may at first think. For example, other than with parentheses, similar issues may arise in documents if you are searching for quotes or even HTML or XML tags, since different elements and nodes often begin and end with the same tags. If you wanted only the contents of the first parentheses to be matched, you need to specify a question mark ( ? ) after your quantifier. For example, if you rewrite the regular expression as follows, you find that (123) is returned as the match:

/(.*?)/

Adding ? after the quantifier allows you to control greediness and find the smallest possible match rather than the largest one.

{mospagebreak title=Predefined Subpatterns}

Quantifiers are not the only things that allow you to save some time and typing. The Perl regular expression engine is also able to recognize a variety of predefined subpatterns that you can use to recognize simple but common patterns. For example, suppose you simply want to match any alphanumeric character. You can write an expression containing the pattern
[a-zA-Z0-9], or you can simply use the predefined pattern specified by w. Table 1-2 lists other such useful subpatterns.

Table 1-2. Useful Subpatterns

Specifier Pattern
w

Any standard alphanumeric character or an underscore ( _ )

W Any nonalphanumeric character or an underscore ( _ )
d Any digit
D Any nondigit
s Any of n , r , t , f , and " "
S Any other than n , r , t , f , and " "
. Any other than n

These specifiers are quite common in regular expressions, especially when combined with the quantifiers listed in Table 1-1. For example, you can use w+ to match any word, use d+ to match any series of digits, or use s+ to match any type of whitespace. For example, if you want to split the contents of a tab-delimited text file (such as in Figure 1-1) into an array, you can easily perform this task using the split function as well as a regular expression involving s+ . The code for this would be as follows:

while (<>){
  
push @Array, {split /s+/ };
}

The regular expression argument provided for the split function tells the function where to split the input data and what elements to leave out of the resultant array. In this case, every time whitespace occurs, it signifies that the next nonwhitespace region should be a distinct element in the resultant array.

{mospagebreak title=Posix Character Classes}

In the previous section, you saw the classic predefined Perl patterns, but more recent versions of Perl also support some predefined subpattern types through a set of Posix character classes. Table 1-3 summarizes these classes, and I outline their usage after the table.

Table 1-3. Posix Character Classes 

Posix Class

Pattern

[:alnum:]

Any letter or digit

[:alpha:]

Any letter

[:ascii:]

Any character with a numeric encoding from 0 to 127

[:cntrl:]

Any character with a numeric encoding less than 32

[:digit:]

Any digit from 0 to 9 (d)

[:graph:]

Any letter, digit, or punctuation character

Table 1-3. Posix Character Classes (continued)

Posix Class Pattern
[:lower:] Any lowercase letter
[:print:]

Any letter, digit, punctuation, or space character

[:punct:] Any punctuation character
[:space:] Any space character ( s )
[:upper:] Any uppercase letter
[:word:] Underline or any letter or digit
[:xdigit:]

Any hexadecimal digit (that is, 0–9, a–f, or A–F)

 


Note  You can use Posix characters in conjunction with Unicode text. When doing this, however, keep in mind that using a class such as [:alpha:] may return more results than you expect, since under Unicode there are many more letters than under ASCII. This likewise holds true for other classes that match letter and digits.


The usage of Posix character classes is actually similar to the previous examples where a range of characters was defined, such as [A-F] , in that the characters must be enclosed in brackets. This is actually sometimes a point of confusion for individuals who are new to Posix character classes, because, as you saw in Table 1-3, all the classes already have brackets. This set of brackets is actually part of the class name, not part of the Perl regex. Thus, you actually need a second set, such as in the following regular expression, which will match any number of digits:

/[[:digit:]]*/

{mospagebreak title=Modifiers} 

As the name implies, modifiers allow you to alter the behavior of your pattern match in some form. Table 1-4 summarizes the available pattern modifiers.

Table 1-4. Pattern Matching Modifiers

Modifier Function
/i Makes insensitive to case
/m Allows $ and ^ to match near /n (multiline)
/x

Allows insertion of comments and whitespace in expression

/o Evaluates the expression variable only once
/s Allows . to match /n (single line)
/g Allows global matching
/gc After failed global search, allows continued matching

For example, under normal conditions, regular expressions are case-sensitive. Therefore, ABC is a completely different string from abc . However, with the aid of the pattern modifier /i , you could get the regular expression to behave in a case-insensitive manner. Hence, if you executed the following code, the action contained within the conditional would execute:

if("abc"=~/ABC/i){
   #do something
}

You can use a variety of other modifiers as well. For example, as you will see in the upcoming “Assertions” section, you can use the /m modifier to alter the behavior of the ^ and $ assertions by allowing them to match at line breaks that are internal to a string, rather than just at the beginning and ending of a string. Furthermore, as you saw earlier, the subpattern defined by . normally allows the matching of any character other than the new line metasymbol, n . If you want to allow . to match n as well, you simply need to add the /s modifier. In fact, when trying to match any multiline document, it is advisable to try the /s modifier first, since its usage will often result in simpler and faster executing code.

Another useful modifier that can become increasingly important when dealing with large loops or any situation where you repeatedly call the same regular expression is the /o modifier. Let’s consider the following piece of code:

While($string=~/$pattern/){
   #do something
}

If you executed a segment of code such as this, every time you were about to loop back through the indeterminate loop the regular expression engine would reevaluate the regular expression pattern. This is not necessarily a bad thing, because, as with any variable, the contents of the $pattern scalar may have changed since the last iteration. However, it is also possible that you have a fixed condition. In other words, the contents of $pattern will not change throughout the course of the script’s execution. In this case, you are wasting processing time reevaluating the contents of $pattern on every pass. You can avoid this slowdown by adding the /o modifier to the expression:

While($string=~/$pattern/o){
  
#do something
}

In this way, the variable will be evaluated only once; and after its evaluation, it will remain a fixed value to the regular expression engine.


Note  When using the /o modifier, make sure you never need to change the contents of the pattern variable. Any changes you make after /o has been employed will not change the pattern used by the regular expression engine.


The /x modifier can also be useful when you are creating long or complex regular expressions. This modifier allows you to insert whitespace and comments into your regular expression without the whitespace or # being interpreted as a part of the expression. The main benefit to this modifier is that it can be used to improve the readability of your code, since you could now write /w+  |  d+ /x instead of /w+|d+ / .

The /g modifier is also highly useful, since it allows for global matching to occur. That is, you can continue your search throughout the whole string and not just stop at the first match. I will illustrate this with a simple example from bioinformatics: DNA is made up of a series of four nucleotides specified by the letters A, T, C, and G. Scientists are often interested in determining the percentage of G and C nucleotides in a given DNA sequence, since this helps determine the thermostability of the DNA (see the following note).


Note  DNA consists of two complementary strands of the nucleotides A, T, C, and G. The A on one strand is always bonded to a T on the opposing strand, and the G on one strand is always bonded to the C on the opposing strand, and vice versa. One difference is that G and C are connected by three bonds, whereas A and T only two. Consequently, DNA with more GC pairs is bound more strongly and is able to withstand higher temperatures, thereby increasing its thermostability.


Thus, I will illustrate the /g modifier by writing a short script that will determine the %GC content in a given sequence of DNA. Listing 1-3 shows the Perl script I will use to accomplish this.

Listing 1-3. Determining %GC Content

#!usr/bin/perl;

$String="ATGCCGGGAAATTATAGCG";
$Count=0;

while($String=~/G|C/g){
   $Count=$Count+1;
}
$len=length($String);
$GC=$Count/$len;
print "The DNA sequence has $GC %GC Content";

As you can see, you store your DNA sequence in the scalar variable $String and then use an indeterminate loop to step through the character content of the string. Every time you encounter a G or a C in your string, you increment your counter variable ( $Count ) by 1. After you have completed your iterations, you divide the number of G s and C s by the total sequence length and print your answer. For the previous DNA sequence, the output should be as follows:

The DNA sequence has 0.473684210526316 %GC Content

Under normal conditions, when the /g modifier fails to match any more instances of a pattern within a string, the starting position of the next search is reset back to zero. However, if you specified /gc instead of just /g , your next search would not reset back to the beginning of the string, but rather begin from the position of the last match.

Please check back next week for the continuation of this article.

[gp-comments width="770" linklove="off" ]
antalya escort bayan antalya escort bayan