After looking at the previous example, you might be wondering how you were able to capture the recognized phone number in order to print it. Looking at the output and the print statement itself should give you the idea that it had something to do with the variable $1, and indeed it did. Earlier in the chapter, I noted that parentheses could serve two purposes within Perl regular expressions. The first is to define subpatterns, and the second is to capture the substring that matches the given subpattern. These captured substrings are stored in the variables $1,$2,$3, and so on. The contents of the first set of parentheses goes into$1, the second into$2, the third into$3, and so on. Thus, in the previous example, by placing the phone number regular expression into parentheses, you are able to capture the phone number and print it by calling the$1variable.
When using nested parentheses, it is important to remember that the parentheses are given an order of precedence going from left to right, with regard to where the open parenthesis occurs. As a result, the substring is enclosed by the first open parenthesis encountered and its corresponding close parenthesis will be assigned to$1, even if it is not the first fully complete substring to be evaluated. For example, if you instead wrote the phone number regular expression as follows, the first set of parentheses would capture the entire phone number as before:
The second set would capture the area code in$2, and the third set would put the remainder of the phone number into$3.
Note If you do not want to capture any values with a set of parentheses but only specify a subpattern, you can place?: right after(but before the subpattern (for example,(?:abc)).
Parentheses are not the only way to capture portions of a string after a regular expression matching operation. In addition to specifying the contents of parentheses in variables such as$1, the regular expression engine also assigns a value to the variables$`,$&, and$'.$&is a variable that is assigned the portion of the string that the regular expression was actually able to match.$`is assigned all the contents to the left of the match, and$'is assigned all the contents to the right of the match (see Table 1-6).
Caution When dealing with situations that involve large amounts of pattern matching, it may not be advisable to use$&,$`, and$', since if they are used once they will be repeatedly generated for every match until the Perl program terminates, which can lead to a lengthy increase in the programís execution time.
Letís take some time now to explore both types of capturing in greater depth by considering the medical informatics example, mentioned earlier, of mining medical literature for chemical interactions. Listing 1-6 shows a short script that will search for predefined interaction terms and then capture the names of the chemicals involved in the interaction.
Listing 1-6. Capturing Substrings
The script begins by searching through the text until it reaches one of the predefined interaction terms. Rather than using a dictionary-type list with numerous interaction terms, alternation of the two terms found in the text is used for simplicity. When one of the interaction terms is identified, the variable$rxnis set equal to this term, and$leftand$rightare set equal to the left and right sides of the match, respectively. Conditional statements and parentheses-based string capturing are then used to capture the word before and the word after the interaction term, since these correspond to the chemical names. It is also important to note the use of the\zassertion in order to match the word before the interaction term, since this word is located at the end of the$leftstring. If you run this script, you see that the output describes the interactions explained in the initial text:
ChemicalA reacts with ChemicalB
Earlier I mentioned that in addition to basic pattern matching, you can use the =~ and !~ operations to perform substitution. The operator for this operation is s///. Substitution is similar to basic pattern matching in that it will initially seek to match a specified pattern. However, once a matching pattern is identified, the substitution will replace the part of the string that matches the pattern with another string. Consider the following:
If you execute this code, the stringa123defwill be printed. In other words, the pattern recognized by/abc/is replaced with123.
blog comments powered by Disqus