Home arrow Perl Programming arrow Page 4 - Modifiers, Boundaries, and Regular Expressions

Troubleshooting Regexes - Perl

In this third part to a four-part series on parsing and regular expressions in Perl, you will learn about cloistered pattern modifiers, boundary assertions, troubleshooting regular expressions, and more. This article is excerpted from chapter one of the book Pro Perl Parsing, written by Christopher M. Frenz (Apress; ISBN: 1590595041).

TABLE OF CONTENTS:
  1. Modifiers, Boundaries, and Regular Expressions
  2. Boundary Assertions
  3. Capturing Substrings
  4. Troubleshooting Regexes
By: Apress Publishing
Rating: starstarstarstarstar / 1
June 03, 2010

print this article
SEARCH DEV SHED

TOOLS YOU CAN USE

advertisement

The previous examples clearly demonstrate that regular expressions are a powerful and flexible programming tool and are thus widely applicable to a wealth of programming tasks. As you can imagine, however, all this power and flexibility can often make constructing complex regular expressions quite difficult, especially when certain positions within the expression are allowed to match multiple characters and/or character combinations. The construction of robust regular expressions is something that takes practice; but while you are gaining that experience, you should keep in mind a few common types of mistakes:

  1. Make sure you choose the right wildcard: For example, if you must have one or more of a given character, make sure to use the quantifier+and not*, since*will match a missing character as well.
  2. Watch out for greediness: Remember to control greediness with?when appropriate. 
     
  3. Make sure to check your case (for example, upper or lowercase): For example, typing\Wwhen you mean\wwill result in the ability to match different things. 
     
  4. Watch out for metacharacters ( \,(, |, [, {, ^, $, *, +, ., and ?): If a metacharacter is part of your pattern, make sure you turn off its special meaning by prefixing it with\
     
  5. Check your |conditions carefully: Make sure all the possible paths are appropriate.

Even with these guidelines, debugging a complex regular expression can still be a challenge, and one of the best, although time-consuming, ways to do this can be to actually draw a visual representation of how the regular expression should work, similar to that found in the state machine figures presented earlier in the chapter (Figure 1-2 through Figure 1-8). If drawing this type of schematic seems too arduous a task, you may want to consider using theGraphViz::Regexmodule.

GraphViz::Regex

GraphViz is a graphing program developed by AT&T for the purpose of creating visual representations of structured information such as computer code (http://www.research.att.com/sw/tools/graphviz/). Leon Brocard wrote the GraphViz Perl module, which serves as a Perl-based interface to the GraphViz program. GraphViz::Regex can be useful when coding complex regular expressions, since this module is able to create visual representations of regular expressions via GraphViz. The syntax for using this module is quite straightforward and is demonstrated in the following code snippet:

Use GraphViz::Regex;

my $regex='((123|ab(c|C))';
my $graph=GraphViz::Regex->new($regex); print $graph->as_jpeg;

When you first employ theGraphViz::Regexmodule, you place a call to the new constructor, which requires a string of the regular expression that you seek a graphical representation of. The new method is then able to create a GraphViz object that corresponds to this representation and assigns the object to$graph. Lastly, you are able to print the graphical representation you created. This example displays a JPEG file, but numerous other file types are supported, including GIF, PostScript, PNG, and bitmap.


Caution  The author of the module reports that there are incompatibilities between this module and Perl versions 5.005_03 and 5.7.1.


Tip  Another great tool for debugging regular expressions comes as a component of ActiveState’s programming IDE Komodo. Komodo contains the Rx Toolkit, which allows you to enter a regular expression and a string into each of its fields and which tells you if they do or do not match as you type. This can be a rapid way to determine how well a given expression will match a given string.


Using Regexp::Common

As you can imagine, certain patterns are fairly commonplace and will likely be repeatedly utilized. This is the basis behind Regexp::Common, which is a Perl module originally authored by Damian Conway and maintained by Abigail that provides a means of accessing a variety of regular expression patterns. Since writing regular expressions can often be tricky, you may want to check this module and see if a pattern suited to your needs is available. Table 1-7 lists all the expression pattern categories available in version 2.113 of this module.

 

Table 1-7. Regexp::CommonPatterns

Pattern Types Use
Balanced

Matches strings with parenthesized delimiters

Comment Identifies code comments in 43 languages
Delimited Matches delimited text
Lingua Identifies palindromes
List Works with lists of data
Net Matches IPv4 and MAC Internet addresses
Number Works with integers and reals
Profanity Identifies obscene terms
URI Identifies diversity of URI types
Whitespace Matches leading and trailing whitespace
Zip Matches ZIP codes

 

Although Table 1-7 provides a general idea of the different types of patterns, it is a good idea to look at the module description available at CPAN (http://www.cpan.org/). The module operates by generating hash values that correspond to different patterns, and these patterns are stored in the hash
%RE. When using this module, you can access its predefined subpatterns by referencing the scalar value of a particular hash element. So, if you want to search for Perl comments in a file, you can employ the hash value stored in$RE{comments}{Perl}; or, if you want to search for real numbers, you can use
$RE{num}{real}. This two-layer hash of hash structure is fine for specifying most pattern types, but deeper layers are available in many cases. These deeper hash layers represent flags that modify the basic pattern in some form. For example, with numbers—in addition to just specifying real or integer—you can also set delimiters so that1,234is interpreted as a valid number pattern rather than just1234. I will briefly cover some types of patterns, but complete coverage of every possible option could easily fill a small book on its own. I recommend you look up the module on CPAN (http://www.cpan.org) and refer to the descriptions of the pattern types offered by each component module.

Regexp::Common::Balanced

This namespace generates regular expressions that are able to match sequences located between balanced parentheses or brackets. The basic syntax needed to access these regular expressions is as follows:

$RE{balanced}{-parens=>'()[]{}'}

The first part of this hash value refers to the basic regular expression structure needed to match text between balanced delimiters. The second part is a flag that specifies the types of parentheses you want the regular expression to recognize. In this case, it is set to work with(),[], and{}. One application of such a regular expression is in the preparation of publications that contain citations, such as “(Smith et al., 1999).” An author may want to search a document for in-text citations in order to ensure they did not miss adding any to their list of references. You can easily accomplish this by passing the filename of the document to the segment of code shown in Listing 1-7.

Listing 1-7. Pulling Out the Contents of ()from a Document

#!/usr/bin/perl -w
use Regexp::Common;

while(<>){
   
/$RE{balanced}{-parens=>'()'}{-keep}/
    and print "$1\n";
}


Note  A more detailed description of the module’s usage will follow in the sections “Standard Usage” and “Subroutine-Based Usage,” since each of the expression types can be accessed through code in the same manner.

 


 

Please check back next week for the conclusion to this article.



 
 
>>> More Perl Programming Articles          >>> More By Apress Publishing
 

blog comments powered by Disqus
escort Bursa Bursa escort Antalya eskort
   

PERL PROGRAMMING ARTICLES

- Perl Turns 25
- Lists and Arguments in Perl
- Variables and Arguments in Perl
- Understanding Scope and Packages in Perl
- Arguments and Return Values in Perl
- Invoking Perl Subroutines and Functions
- Subroutines and Functions in Perl
- Perl Basics: Writing and Debugging Programs
- Structure and Statements in Perl
- First Steps in Perl
- Completing Regular Expression Basics
- Modifiers, Boundaries, and Regular Expressio...
- Quantifiers and Other Regular Expression Bas...
- Parsing and Regular Expression Basics
- Hash Functions

Developer Shed Affiliates

 


Dev Shed Tutorial Topics: