Home arrow Perl Programming arrow Page 3 - Modifiers, Boundaries, and Regular Expressions

Capturing Substrings - Perl

In this third part to a four-part series on parsing and regular expressions in Perl, you will learn about cloistered pattern modifiers, boundary assertions, troubleshooting regular expressions, and more. This article is excerpted from chapter one of the book Pro Perl Parsing, written by Christopher M. Frenz (Apress; ISBN: 1590595041).

TABLE OF CONTENTS:
  1. Modifiers, Boundaries, and Regular Expressions
  2. Boundary Assertions
  3. Capturing Substrings
  4. Troubleshooting Regexes
By: Apress Publishing
Rating: starstarstarstarstar / 1
June 03, 2010

print this article
SEARCH DEV SHED

TOOLS YOU CAN USE

advertisement

After looking at the previous example, you might be wondering how you were able to capture the recognized phone number in order to print it. Looking at the output and the print statement itself should give you the idea that it had something to do with the variable $1, and indeed it did. Earlier in the chapter, I noted that parentheses could serve two purposes within Perl regular expressions. The first is to define subpatterns, and the second is to capture the substring that matches the given subpattern. These captured substrings are stored in the variables $1,$2,$3, and so on. The contents of the first set of parentheses goes into$1, the second into$2, the third into$3, and so on. Thus, in the previous example, by placing the phone number regular expression into parentheses, you are able to capture the phone number and print it by calling the$1variable.

When using nested parentheses, it is important to remember that the parentheses are given an order of precedence going from left to right, with regard to where the open parenthesis occurs. As a result, the substring is enclosed by the first open parenthesis encountered and its corresponding close parenthesis will be assigned to$1, even if it is not the first fully complete substring to be evaluated. For example, if you instead wrote the phone number regular expression as follows, the first set of parentheses would capture the entire phone number as before:

=~/(\s?(\(?\d{3}\)?)[-\s.](?\d{3}[-.]\d{4}))/

The second set would capture the area code in$2, and the third set would put the remainder of the phone number into$3.


Note  If you do not want to capture any values with a set of parentheses but only specify a subpattern, you can place?: right after(but before the subpattern (for example,(?:abc)).


Parentheses are not the only way to capture portions of a string after a regular expression matching operation. In addition to specifying the contents of parentheses in variables such as$1, the regular expression engine also assigns a value to the variables$`,$&, and$'.$&is a variable that is assigned the portion of the string that the regular expression was actually able to match.$`is assigned all the contents to the left of the match, and$'is assigned all the contents to the right of the match (see Table 1-6).


Caution  When dealing with situations that involve large amounts of pattern matching, it may not be advisable to use$&,$`, and$', since if they are used once they will be repeatedly generated for every match until the Perl program terminates, which can lead to a lengthy increase in the programís execution time.

 


 

 

Table 1-6. Substring Capturing Variables

Variable Use
$1,$2,$3, ...

Stores captured substrings contained in parentheses

$& Stores the substring that matched the regex
$`

Stores the substring to the left of the matching regex

$'

Stores the substring to the right of the matching regex

 

Letís take some time now to explore both types of capturing in greater depth by considering the medical informatics example, mentioned earlier, of mining medical literature for chemical interactions. Listing 1-6 shows a short script that will search for predefined interaction terms and then capture the names of the chemicals involved in the interaction.

Listing 1-6. Capturing Substrings

#!usr/bin/perl;

($String=<<'ABOUTA');
   ChemicalA is used to treat cancer. ChemicalA
   reacts with ChemicalB which is found in cancer
   cells. ChemicalC inhibits ChemicalA.
ABOUTA

pos($String)=0;
while($String=~/reacts with|inhibits/ig){
    $rxn=$&;
    $left=$`;
    $right=$';
    if($left=~/(\w+)\s+\z/){
      
$Chem1=$1;
    }
    if($right=~/(\w+)/){
      
$Chem2=$1;
    }
    print "$Chem1 $rxn $Chem2\n";
}

The script begins by searching through the text until it reaches one of the predefined interaction terms. Rather than using a dictionary-type list with numerous interaction terms, alternation of the two terms found in the text is used for simplicity. When one of the interaction terms is identified, the variable$rxnis set equal to this term, and$leftand$rightare set equal to the left and right sides of the match, respectively. Conditional statements and parentheses-based string capturing are then used to capture the word before and the word after the interaction term, since these correspond to the chemical names. It is also important to note the use of the\zassertion in order to match the word before the interaction term, since this word is located at the end of the$leftstring. If you run this script, you see that the output describes the interactions explained in the initial text:

ChemicalA reacts with ChemicalB
ChemicalC inhibits ChemicalA

Substitution

Earlier I mentioned that in addition to basic pattern matching, you can use the =~ and !~ operations to perform substitution. The operator for this operation is s///. Substitution is similar to basic pattern matching in that it will initially seek to match a specified pattern. However, once a matching pattern is identified, the substitution will replace the part of the string that matches the pattern with another string. Consider the following:

$String="aabcdef";
$String=~s/abc/123/;
print $String;

If you execute this code, the stringa123defwill be printed. In other words, the pattern recognized by/abc/is replaced with123.



 
 
>>> More Perl Programming Articles          >>> More By Apress Publishing
 

blog comments powered by Disqus
escort Bursa Bursa escort Antalya eskort
   

PERL PROGRAMMING ARTICLES

- Perl Turns 25
- Lists and Arguments in Perl
- Variables and Arguments in Perl
- Understanding Scope and Packages in Perl
- Arguments and Return Values in Perl
- Invoking Perl Subroutines and Functions
- Subroutines and Functions in Perl
- Perl Basics: Writing and Debugging Programs
- Structure and Statements in Perl
- First Steps in Perl
- Completing Regular Expression Basics
- Modifiers, Boundaries, and Regular Expressio...
- Quantifiers and Other Regular Expression Bas...
- Parsing and Regular Expression Basics
- Hash Functions

Developer Shed Affiliates

 


Dev Shed Tutorial Topics: