Completing Regular Expression Basics

In this conclusion to a four-part series on parsing and regular expression basics in Perl, we finish our study of regular expressions; you’ll even learn how to create your own. This article is excerpted from chapter one of the book Pro Perl Parsing, written by Christopher M. Frenz (Apress; ISBN: 1590595041).

Regexp::Common::Comments

This module generates regular expressions that match comments inserted into computer code written in a variety of programming languages (currently 43). The syntax to call these regular expressions is as follows, where {comments} refers to the base comment matching functionality and {LANGUAGE} provides the descriptor that indicates the particular programming language:

$RE{comments}{LANGUAGE}

For example, to match Perl and C++ comments, you can use the following:

$RE{comments}{Perl}
$RE{comments}{C++}

Regexp::Common::Delimited

This base module provides the functionality required to match delimited strings. The syntax is similar to that shown for the Text::Balanced module:

$RE{delimited}{-delim=>’"’}

In this case, the -delim flag specifies the delimiter that the regular expression will search for and is a required flag, since the module does not have a default delimiter.


Note  Table 1-8 summarizes all the Regexp::Common flags.

 


 

Regexp::Common::List

The List module can match lists of data such as tab-separated lists, lists of numbers, lists of words, and so on. The type of list matched depends on the flags specified in the expression. Its syntax is as follows:

$RE{list}{-pat}{-sep}{-lastsep}

The pattern flag specifies the pattern that will correspond to each substring that is contained in the list. The pattern can be in the form of a regular expression such as w+ or can be another hash value created by the Regexp::Common module. The -sep flag defines a type of separator that may be present between consecutive list elements, such as a tab or a space (the default). The -lastsep flag specifies a separator that may be present between the last two elements in the list. By default, this value is the same as that specified by -sep . As an example, if you wanted to search a document for lists that were specified in the Item A, Item B, …, and Item N format, you could easily identify such listings using the following expression:

$RE{list}{-pat}{-sep=>’, ‘}{-lastsep=>’, and ‘}

Regexp::Common::Net

The Net module generates hash values that contain patterns designed to match IPv4 and MAC addresses, and the first hash key specifies which type to match. The next hash key allows you to specify whether the address will be decimal (default), hexadecimal, or octal. You can also use the -sep flag to specify a separator, if required. The following is a sample:

$RE{net}{IPv4}{hex}

This module comes in handy if you want to monitor the domains that different e-mails you have received originated from. This information is found in most e-mail headers in a format similar to the following:

from [64.12.116.134] by web51102.mail.yahoo.com via HTTP;
Mon, 29 Nov 2004 23:33:11 -0800 (PST)

You can easily parse this header information to find the IPv4 address 64.12.116.134 by using the following expression:

$RE{net}{IPv4}

Regexp::Common::Number

The Number module can match a variety of different number types, including integers, reals, hexadecimals, octals, binaries, and even Roman numerals. The base syntax is of the following form, but you should also be aware of a diversity of flags:

$RE{num}{real}

For example, you can apply the -base flag to change the base of the number to something other than the default of base 10. The -radix flag specifies the pattern that will serve as the decimal point in case you desire something other than the default value ( . ). If you are dealing with significant figures, you may find the -places flag useful, since it can specify the number of places after the decimal point. As in previous modules, -sep specifies separators; however, in this module, you can also specify the appropriate number of digits that should be present between separators using the -group flag. The default value for this flag is 3 , so if you specified a comma ( , ) as your separator, your expression would be able to recognize values such as 123,456,789 . The -expon flag specifies the pattern that will be used to specify that an exponent is present. The default value for this property is [Ee] .

{mospagebreak title=Universal Flags}

As you saw in the previous sections, many of the base modules have their own flags, which can be used to further refine the pattern your regular expression will match (see Table 1-8). You can use two additional flags, however, with almost all base modules. These flags are the -i flag and the -keep flag. The -i flag makes the regular expression insensitive to alphabetic case so the expression can match both lowercase and capital letters. You can use the -keep flag for pattern capturing. If you specify -keep , the entire match to the pattern is generally stored in $1 . In many cases, $2 , $3 , and other variables are also set, but these are set in a module-specific manner.

 

Table 1-8. Regexp::Common Flags

Flag Use Module(s)
-sep Specifies a separator Net and List
-lastsep Specifies the last separator of a list List
-base For numbers, makes the base something other than base 10 Number
-radix Makes a decimal point something other than . Number
-places Specifies the number of places after a decimal point Number
-group For numbers, specifies the number of digits that should be present between separators Numbers
-expon Specifies the exponent pattern Numbers
-i Makes the regular expressions case insensitive All
-keep Enables substring capturing All

 

 

Standard Usage

You can utilize the patterns located in the module in your source code in a couple of ways. The first of these ways is referred to as the standard usage method and has a syntax similar to some of the regular expressions you have already seen in that the expression is placed between the // operator. The only difference is that rather than placing your own regular expression between //, you place one of the modules hash values. Consider the following segment of text:

Bob said "Hello". James
responded "Hi, how are you".
Bob replied "Fine and you".

Now let’s save this text to a file and execute the Perl code shown in Listing 1-8, making sure to pass the name of the file you just saved as an argument.

Listing 1-8. Pulling Quotes Out of a Document

#!/usr/bin/perl -w
use Regexp::Common;

while(<>){
   
/$RE{delimited}{-delim=>’"’}{-keep}/
    and print "$1n";
}

This short piece of code will read through the contents of the file and identify all the quotes present in the text file. Since you also specified the -keep flag, you are able to capture the quotes and print them. Thus, the output for this script should be similar to the following:

"Hello "
"Hi, how are you"
"Fine and you"

{mospagebreak title=Subroutine-Based Usage} 

In addition to the standard usage, you can also access the functionality of this module through a subroutine-based interface, which allows you to perform a matching operation with a syntax similar to a procedural call. If you were to recode the previous example using this alternative syntax, it would look like Listing 1-9.

Listing 1-9. Pulling Quotes Out via a Subroutine

#!/usr/bin/perl -w
use Regexp::Common ‘RE_ALL';

while(<>){
    $_ =~ RE_delimited(-delim=>’"’,-keep)
    and print "$1n";
}

You should note several important things here if you choose to use this syntax instead. The first is that when you call the Regexp::Common module, you must append RE_ALL to the end of the line so Perl is able to recognize the alternative syntax. Without this, you will receive a compilation error that says the subroutines are undefined. The second noteworthy thing is that you must explicitly write $_=~ in order to perform the required matching operation. Lastly, you should also note that the flags are read in as arguments separated by commas. Accessing the regular expressions this way can lead to faster execution times since this method does not return objects to be interpolated but, rather, actual regular expressions.

In-Line Matching and Substitution

I will cover these two methods together since they have similar syntax and use an object-oriented interface. In terms of basic pattern matching, they offer no real advantage other than allowing you to create code that may be somewhat more user-friendly to read; their syntax is as follows:

if($RE{num}{int}->matches($SomeNumber)){
    print "$SomeNumber is an Integer";
}

This interface allows you to easily perform substitutions on a string without chang ing the original string. For example:

$SubstitutedString=$RE{num}{real}->subs($Original=>$Substitution);

In this case, $SubstitutedString is a new string that is going to be assigned the value of the $Original string with all substitutions already made, and the $Substitution string specifies the string that is going to be put in place of the characters that were able to match the pattern.

{mospagebreak title=Creating Your Own Expressions}

The Regexp::Common module does not limit you to just the patterns that come with it. You also have the ability to create your own regular expressions, at run time, for use within the Regexp::Common module. For example, Regexp::Common does not yet support phone numbers, so let’s begin to create a Regexp::Common phone number entry (see Listing 1-10).

Listing 1-10. Creating Your Own Regexp::Common Expression

#!/usr/bin/perl -w
use Regexp::Common qw /pattern/;

pattern name=>[qw(phone)],
    create=>q/(?k:s?((d{3}))[-s.](d{3}[-.]d{4}))/;

 

while(<>){
    /$RE{phone}{-keep}/ and print "$1n";
}


Note  You may have noticed that the pattern contains the sequence of characters ?k: in it. Under normal circumstances, capturing through parentheses is not preserved in Regexp::Common , since capturing parentheses are processed out. The ?k: sequence tells the module not to process out these parentheses when the -keep flag is present. This is why you were able to print phone numbers by using $1 in the previous example.


To begin, you must first tell Perl you are going to utilize the pattern subroutine of the Regexp::Common module. Next, you must create a name argument that will specify the name of the pattern and any flags it may take. In this case, the pattern is named phone . If you want to add additional names and/or flags, you can specify them as follows:

pattern name=[qw(phone book -flag)]

This specifies an entry of $RE{phone}{book}{-flag} .

After you name your pattern, you must next specify a value for the create argument. This argument is the only other required argument and can take either a string that is to be returned as a pattern (as previously) or a reference to a subroutine that will create the pattern. Also, two optional arguments also take subroutine references. These arguments are match and subs , and the provided subroutine will dictate what occurs when the methods match and subs , the matching and substitution methods (respectively), are called. Lastly, one more optional argument, version , can be assigned a Perl version number.

If the version of Perl is older than the supplied argument, the script will not run and a fatal error will be returned.

Summary

This chapter covered how to syntactically construct regular expressions and how you can call upon these expressions within your Perl scripts. Furthermore, I discussed the roles of the different quantifiers, assertions, and predefined subpatterns, as well as how best to debug regular expressions. Lastly, the chapter covered how the Perl module Regexp::Common works and how you can utilize it to locate elements of interest.

Now that you have an idea of how you can use regular expressions to match, and hence identify, portions of strings, you are more prepared to tackle the topics of tokens and grammars in greater depth as you delve into the next chapter. Chapter 2 will introduce you to the idea of generative grammars by covering the Chomsky hierarchy of grammars. The upcoming chapter will also demonstrate how you can use Perl code in conjunction with a grammar to generate sentences that comply with the rules specified in the grammar.  

[gp-comments width="770" linklove="off" ]

antalya escort bayan antalya escort bayan Antalya escort diyarbakir escort