Perl Programming Page 4 - Quantifiers and Other Regular Expression Basics |
As the name implies, modifiers allow you to alter the behavior of your pattern match in some form. Table 1-4 summarizes the available pattern modifiers.
For example, under normal conditions, regular expressions are case-sensitive. Therefore,ABCis a completely different string fromabc. However, with the aid of the pattern modifier/i, you could get the regular expression to behave in a case-insensitive manner. Hence, if you executed the following code, the action contained within the conditional would execute: if("abc"=~/ABC/i){ You can use a variety of other modifiers as well. For example, as you will see in the upcoming “Assertions” section, you can use the/mmodifier to alter the behavior of the^ and$ assertions by allowing them to match at line breaks that are internal to a string, rather than just at the beginning and ending of a string. Furthermore, as you saw earlier, the subpattern defined by.normally allows the matching of any character other than the new line metasymbol,\n. If you want to allow.to match\nas well, you simply need to add the/smodifier. In fact, when trying to match any multiline document, it is advisable to try the/smodifier first, since its usage will often result in simpler and faster executing code. Another useful modifier that can become increasingly important when dealing with large loops or any situation where you repeatedly call the same regular expression is the/omodifier. Let’s consider the following piece of code: While($string=~/$pattern/){ If you executed a segment of code such as this, every time you were about to loop back through the indeterminate loop the regular expression engine would reevaluate the regular expression pattern. This is not necessarily a bad thing, because, as with any variable, the contents of the$patternscalar may have changed since the last iteration. However, it is also possible that you have a fixed condition. In other words, the contents of$patternwill not change throughout the course of the script’s execution. In this case, you are wasting processing time reevaluating the contents of$patternon every pass. You can avoid this slowdown by adding the/omodifier to the expression: While($string=~/$pattern/o){ In this way, the variable will be evaluated only once; and after its evaluation, it will remain a fixed value to the regular expression engine. Note When using the/omodifier, make sure you never need to change the contents of the pattern variable. Any changes you make after/ohas been employed will not change the pattern used by the regular expression engine. The/xmodifier can also be useful when you are creating long or complex regular expressions. This modifier allows you to insert whitespace and comments into your regular expression without the whitespace or#being interpreted as a part of the expression. The main benefit to this modifier is that it can be used to improve the readability of your code, since you could now write/\w+ | \d+ /xinstead of/\w+|\d+ /. The/gmodifier is also highly useful, since it allows for global matching to occur. That is, you can continue your search throughout the whole string and not just stop at the first match. I will illustrate this with a simple example from bioinformatics: DNA is made up of a series of four nucleotides specified by the letters A, T, C, and G. Scientists are often interested in determining the percentage of G and C nucleotides in a given DNA sequence, since this helps determine the thermostability of the DNA (see the following note). Note DNA consists of two complementary strands of the nucleotides A, T, C, and G. The A on one strand is always bonded to a T on the opposing strand, and the G on one strand is always bonded to the C on the opposing strand, and vice versa. One difference is that G and C are connected by three bonds, whereas A and T only two. Consequently, DNA with more GC pairs is bound more strongly and is able to withstand higher temperatures, thereby increasing its thermostability. Thus, I will illustrate the/gmodifier by writing a short script that will determine the%GCcontent in a given sequence of DNA. Listing 1-3 shows the Perl script I will use to accomplish this. Listing 1-3. Determining %GCContent #!usr/bin/perl; $String="ATGCCGGGAAATTATAGCG"; while($String=~/G|C/g){ As you can see, you store your DNA sequence in the scalar variable$Stringand then use an indeterminate loop to step through the character content of the string. Every time you encounter aG or aC in your string, you increment your counter variable ($Count) by 1. After you have completed your iterations, you divide the number ofGs andCs by the total sequence length and print your answer. For the previous DNA sequence, the output should be as follows: The DNA sequence has 0.473684210526316 %GC Content Under normal conditions, when the/gmodifier fails to match any more instances of a pattern within a string, the starting position of the next search is reset back to zero. However, if you specified/gcinstead of just/g, your next search would not reset back to the beginning of the string, but rather begin from the position of the last match. Please check back next week for the continuation of this article.
blog comments powered by Disqus |
|
|
|
|
|
|
|