Parsing and Regular Expression Basics

Parsing helps us to extract the information we need from the mounds of data available. Regular expressions assist us in the hunt. This four-part article series will show you how to use Perl with regular expressions to accomplish your parsing tasks. It is excerpted from chapter one of the book Pro Perl Parsing, written by Christopher M. Frenz (Apress; ISBN: 1590595041).

The dawn of a new age is upon us, an information age, in which an ever-increasing and seemingly endless stream of new information is continuously generated. Information discovery and knowledge advancements occur at such rates that an ever-growing num ber of specialties is appearing, and in many fields it is impossible even for experts to master everything there is to know. Anyone who has ever typed a query into an Internet search engine has been a firsthand witness to this information explosion. Even the most mundane terms will likely return hundreds, if not thousands, of hits. The sciences, especially in the areas of genomics and proteomics, are generating seemingly insurmountable mounds of data.

Yet, one must also consider that this generated data, while not easily accessible to all, is often put to use, resulting in the creation of new ideas to generate even more knowledge or in the creation of more efficient means of data generation. Although the old adage “knowledge is power” holds true, and almost no one will deny that the knowledge gained has been beneficial, the sheer volume of information has created quite a quandary. Finding information that is exactly relevant to your specific needs is often not a simple task. Take a minute to think about how many searches you performed in which all the hits returned were both useful and easily accessible (for example, were among the top matches, were valid links, and so on). More than likely, your search attempts did not run this smoothly, and you needed to either modify your query or buckle down and begin to dig for the resources of interest.

Thus, one of the pressing questions of our time has been how do we deal with all of this data so we can efficiently find the information that is currently of interest to us? The most obvious answer to this question has been to use the power of computers to store these giant catalogs of information (for example, databases) and to facilitate searches through this data. This line of reasoning has led to the birth of various fields of informatics (for example, bioinformatics, health informatics, business informatics, and so on). These fields are geared around the purpose of developing powerful methods for storing and retrieving data as well as analyzing it.

In this book, I will explain one of the most fundamental techniques required to perform this type of data extraction and analysis, the technique of parsing. To do this, I will show how to utilize the Perl programming language, which has a rich history as a powerful text processing language. Furthermore, Perl is already widely used in many fields of informatics, and many robust parsing tools are readily available for Perl programmers in the form of CPAN modules. In addition to examining the actual parsing methods themselves, I will also cover many of these modules.

{mospagebreak title=Parsing and Lexing}

Before I begin covering how you can use Perl to accomplish your parsing tasks, it is essential to have a clear understanding of exactly what parsing is and how you can utilize it. Therefore, I will define parsing as the action of splitting up a data set into smaller, more meaningful units and uncovering some form of meaningful structure from the sequence of these units. To understand this point, consider the structure of a tab-delimited data file. In this type of file, data is stored in columns, and a tab separates consecutive columns (see Figure 1-1).


Figure 1-1.  A tab-delimited file

Reviewing this file, your eyes most likely focus on the numbers in each column and ignore the whitespace found between the columns. In other words, your eyes perform a parsing task by allowing you to visualize distinct columns of data. Rather than just taking the whole data set as a unit, you are able to break up the data set into columns of numbers that are much more meaningful than a giant string of numbers and tabs. While this example is simplistic, we carry out parsing actions such as this every day. Whenever we see, read, or hear anything, our brains must parse the input in order to make some kind of logical sense out of it. This is why parsing is such a crucial technique for a computer programmer—there will often be a need to parse data sets and other forms of input so that applications can work with the information presented to them.

The following are common types of parsed data:

  1. Data TypeText files
  2. CSV files 
     
  3. HTML 
     
  4. XML 
     
  5. RSS files 
     
  6. Command-line arguments 
     
  7. E-mail/Web page headers 
     
  8. HTTP headers 
     
  9. POP3 headers 
     
  10. SMTP headers 
     
  11. IMAP headers

To get a better idea of just how parsing works, you first need to consider that in order to parse data you must classify the data you are examining into units. These units are referred to as tokens, and their identification is called lexing. In Figure 1-1, the units are numbers, and a tab separates each unit; for many lexing tasks, such whitespace identification is adequate. However, for certain sophisticated parsing tasks, this breakdown may not be as straightforward. A recursive approach may also be warranted, since in more nested structures it becomes possible to find units within units. Math equations such as 4*(3+2) provide an ideal example of this. Within the parentheses, 3 and 2 behave as their own distinct units; however, when it comes time to multiply by 4, (3+2) can be considered as a single unit. In fact, it is in dealing with nested structures such as this example that full-scale parsers prove their worth. As you will see later in the “Using Regular Expressions” section, simpler parsing tasks (in other words, those with a known finite structure) often do not require full-scale parsers but can be accomplished with regular expressions and other like techniques.


Note  Examples of a well-known lexer and parser are the C-based Lex and Yacc programs that generally come bundled with Unix-based operating systems.


Parse::Lex

Before moving on to more in-depth discussions of parsers, I will introduce the Perl module Parse::Lex , which you can use to perform lexing tasks such as lexing the math equation listed previously.


Tip  Parse::Lex and the other Perl modules used in this book are all available from CPAN ( http:// www.cpan.org ). If you are unfamiliar with working with CPAN modules, you can find information about downloading and installing Perl modules on a diversity of operating systems at http://search.cpan. org/~jhi/perl-5.8.0/pod/perlmodinstall.pod . If you are using an ActiveState Perl distribution, you can also install Perl modules using the Perl Package Manager (PPM). You can obtain information about its use at http://aspn.activestate.com/ASPN/docs/ ActivePerl/faq/ActivePerl-faq2.html .

For more detailed information about CPAN and about creating and using Perl modules, you will find that Writing Perl Modules for CPAN (Apress, 2002) by Sam Tregar is a great reference.


Philipe Verdret authored this module; the most current version as of this book’s publication is version 2.15. Parse::Lex is an object-oriented lexing module that allows you to split input into various tokens that you define. Take a look at the basics of how this module works by examining Listing 1-1, which will parse simple math equations, such as 18.2+43/6.8.

Listing 1-1. Using Parse::Lex

#!/usr/bin/perl

use Parse::Lex;
#defines the tokens
@token=qw(
   BegParen [(]
   EndParen [)]
   Operator [-+*/^]
   Number   [-?d+|-?d+.d*]
   );
$lexer=Parse::Lex->new(@token); #Specifies the lexer
$lexer->from(STDIN); #Specifies the input source

TOKEN:
while(1){ #1 will be returned unless EOI
   $token=$lexer->next;
   if(not $lexer->eoi){
     
print $token->name . " " . $token->text . " " . "n" ;
   }
   else {last TOKEN;}
}

The first step in using this module is to create definitions of what constitutes an acceptable token. Token arguments for this module usually consist of a token name argument, such as the previous BegParen , followed by a regular expression. Within the module itself, these tokens are stored as instances of the Parse::Token class. After you specify your tokens, you next need to specify how your lexer will operate. You can accom plish this by passing a list of arguments to the lexer via the new method. In Listing 1-1, this list of arguments is contained in the @token array. When creating the argument list, it is important to consider the order in which the token definitions are placed, since an input value will be classified as a token of the type that it is first able to match. Thus, when using this module, it is good practice to list the strictest definitions first and then move on to the more general definitions. Otherwise, the general definitions may match values before the stricter comparisons even get a chance to be made.

Once you have specified the criteria that your lexer will operate on, you next define the source of input into the lexer by using the from method. The default for this property is STDIN , but it could also be a filename, a file handle, or a string of text (in quotes). Next you loop through the values in your input until you reach the eoi (end of input) condition and print the token and corresponding type. If, for example, you entered the command-line argument 43.4*15^2 , the output should look like this:

Number 43.4
Operator *
Number 15
Operator ^
Number 2

In Chapter 3, where you will closely examine the workings of full-fledged parsers, I will employ a variant of this routine to aid in building a math equation parser.

Regular expressions are one of the most useful tools for lexing, but they are not the only method. As mentioned earlier, for some cases you can use whitespace identification, and for others you can bring dictionary lists into play. The choice of lexing method depends on the application. For applications where all tokens are of a similar type, like the tab-delimited text file discussed previously, whitespace pattern matching is probably the best bet. For cases where multiple token types may be employed, regular expressions or dictionary lists are better bets. For most cases, regular expressions are the best since they are the most versatile. Dictionary lists are better suited to more specialized types of lexing, where it is important to identify only select tokens.

One such example where a dictionary list is useful is in regard to the recent bioinformatics trend of mining medical literature for chemical interactions. For instance, many scientists are interested in the following:

<Chemical A> <operates on> <Chemical B>

In other words, they just want to determine how chemical A interacts with chemical B. When considering this, it becomes obvious that the entire textual content of any one scientific paper is not necessary to tokenize and parse. Thus, an informatician coding such a routine might want to use dictionary lists to identify the chemicals as well as to identify terms that describe the interaction. A dictionary list would be a listing of all the possible values for a given element of a sentence. For example, rather than operates on , I could also fill in reacts with , interacts with , or a variety of other terms and have a program check for the occurrence of any of those terms. Later, in the section “Capturing Substrings,” I will cover this example in more depth.

{mospagebreak title=Using Regular Expressions}

As you saw in the previous Parse::Lex example, regular expressions provide a robust tool for token identification, but their usefulness goes far beyond that. In fact, for many simple parsing tasks, a regular expression alone may be adequate to get the job done. For example, if you want to perform a simple parsing/data extraction task such as parsing out an e-mail address found on a Web page, you can easily accomplish this by using a regular expression. All you need is to create a regular expression that identifies a pattern similar to the following:

[alphanumeric characters]@[alphanumeric characters.com]


Caution  The previous expression is a simplification provided to illustrate the types of pattern matching for which you can use regular expressions. A more real-world e-mail matching expression would need to be more complex to account for other factors such as alternate endings (for example, .net , .gov ) as well as the presence of metacharacters in either alphanumeric string. Additionally, a variety of less-common alternative e-mail address formats may also warrant consideration.


The following sections will explain how to create such regular expressions in the format Perl is able to interpret. To make regular expressions and their operation a little less mysterious, however, I will approach this topic by first explaining how Perl’s regular expression engine operates. Perl’s regular expression engine functions by using a programming paradigm known as a state machine, described in depth next.

A State Machine

A simple definition of a state machine is one that will sequentially read in the symbols of an input word. After reading in a symbol, it will decide whether the current state of the machine is one of acceptance or nonacceptance. The machine will then read in the next symbol and make another state decision based upon the previous state and the current symbol. This process will continue until all symbols in the word are considered. Perl’s regular expression engine operates as a state machine (sometimes referred to as an automaton) for a given string sequence (that is, the word). In order to match the expression, all of the acceptable states (that is, characters defined in the regular expression) in a given path must be determined to be true. Thus, when you write a regular expression, you are really providing the criteria the differing states of the automaton need to match in order to find a matching string sequence. To clarify this, let’s consider the pattern /123/ and the string 123 and manually walk through the procedure the regular expression engine would perform. Such a pattern is representative of the simplest type of case for your state machine. That is, the state machine will operate in a completely linear manner. Figure 1-2 shows a graphical representation of this state machine.


Note  It is interesting to note that a recursive descent parser evaluates the regular expressions you author. For more information on recursive descent parsers, see Chapter 5.



Figure 1-2.  A state machine designed to match the pattern /123/

In this case, the regular expression engine begins by examining the first character of the string, which is a 1 . In this case, the required first state of the automaton is also a 1 . Therefore, a match is found, and the engine moves on by comparing the second character, which is a 2 , to the second state. Also in this case, a match is found, so the third character is examined and another match is made. When this third match is made, all states in the state machine are satisfied, and the string is deemed a match to the pattern.

In this simple case, the string, as written, provided an exact match to the pattern. Yet, this is hardly typical in the real world, so it is important to also consider how the regular expression will operate when the character in question does not match the criterion of a particular state in the state machine. In this instance, I will use the same pattern ( /123/ ) and hence the same state machine as in the previous example, only this time I will try to find a match within the string 4512123 (see Figure 1-3).

This time the regular expression engine begins by comparing the first character in the string, 4 , with the first state criterion. Since the criterion is a 1 , no match is found. When this mismatch occurs, the regular expression starts over by trying to compare the string contents beginning with the character in the second position (see Figure 1-4).

As in the first case, no match is found between criterion for the first state and the character in question ( 5 ), so the engine moves on to make a comparison beginning with the third character in the string (see Figure 1-5).


Figure 1-3.  The initial attempt at comparing the string 4512123 to the pattern /123/


Figure 1-4.  The second attempt at comparing the string 4512123 to the pattern /123/


Figure 1-5.  The third attempt at comparing the string 4512123 to the pattern /123/

In this case, since the third character is a 1 , the criterion for the first state is satisfied, and thus the engine is able to move on to the second state. The criterion for the second state is also satisfied, so therefore the engine will next move on to the third state. The 1 in the string, however, does not match the criterion for state 3, so the engine then tries to match the fourth character of the string, 2 , to the first state (see Figure 1-6).


Figure 1-6.  The fourth attempt at comparing the string 4512123 to the pattern /123/

As in previous cases, the first criterion is not satisfied by the 2 , and consequently the regular expression engine will begin to examine the string beginning with the fifth character. The fifth character satisfies the criterion for the first state, and therefore the engine proceeds on to the second state. In this case, a match for the criterion is also present, and the engine moves on to the third state. The final character in the string matches the third state criterion, and hence a match to the pattern is made (see Figure 1-7).


Figure 1-7.  A match is made to the pattern /123/.

The previous two examples deal with a linear state machine. However, you are not limited to this type of regular expression setup. It is possible to establish alternate paths within the regular expression engine. You can set up these alternate paths by using the alternation (“or”) operator ( | ) and/or parentheses, which define subpatterns. I will cover more about the specific meanings of regular expression syntaxes in the upcoming sections “Pattern Matching,” “Quantifiers,” and “Predefined Subpatterns.” For now, consider the expression /123|1b(c|C)/ , which specifies that the matching pattern can be 123 , 1bc , or 1bC (see Figure 1-8).


Figure 1-8.  The state machine defined by the pattern /123|1b(c|C)/


Note  Parentheses not only define subpatterns but can also capture substrings, which I will discuss in the upcoming “Capturing Substrings” section.


As you can see, this state machine can follow multiple paths to reach the goal of a complete match. It can choose to take the top path of 123, or can choose to take one of the bottom paths of 1bc or 1bC . To get an idea of how this works, consider the string 1bc and see how the state machine would determine this to be a match. It would first find that the 1 matches the first state condition, so it would then proceed to match the next character ( b ) to the second state condition of the top path ( 2 ). Since this is not a match, the regular expression engine will backtrack to the location of the true state located before the “or” condition. The engine will backtrack further, in this case to the starting point, only if all the available paths are unable to provide a correct match. From this point, the regular expression engine will proceed down an alternate path, in this case the bottom one. As the engine traverses down this path, the character b is a match for the second state of the bottom path. At this point, you have reached a second “or” condition, so the engine will check for matches along the top path first. In this case, the engine is able to match the character c with the required state c , so no further backtracking is required, and the string is considered a perfect match.

When specifying regular expression patterns, it is also beneficial to be aware of the notations [] and [^] , since these allow you to specify ranges of characters that will serve as an acceptable match or an unacceptable one. For instance, if you had a pattern containing [ABCDEF] or [A-F] , then A , B , C , D , E , and F would all be acceptable matches. However, a or G would not be, since both are not included in the acceptable range.


Tip  Perl’s regular expression patterns are case-sensitive by default. So, A is different from a unless a modifier is used to declare the expression case-insensitive. See the “Modifiers” section for more details.


If you want to specify characters that would be unacceptable, you can use the [^] syntax. For example, if you want the expression to be true for any character but A , B , C , D , E , and F , you can use one of the following expressions: [^ABCDEF] or [^A-F] .

{mospagebreak title=Pattern Matching}

Now that you know how the regular expression engine functions, let’s look at how you can invoke this engine to perform pattern matches within Perl code. To perform pattern matches, you need to first acquaint yourself with the binding operators, =~ and !~. The string you seek to bind (match) goes on the left, and the operator that it is going to be bound to goes on the right. You can employ three types of operators on the right side of this statement. The first is the pattern match operator, m// , or simply // (the m is implied and can be left out), which will test to see if the string value matches the supplied expression, such as 123 matching /123/ , as shown in Listing 1-2. The remaining two are s/// and tr/// , which will allow for substitution and transliteration, respectively. For now, I will focus solely on matching and discuss the other two alternatives later. When using =~ , a value will be returned from this operation that indicates whether the regular expression operator was able to successfully match the string. The !~ functions in an identical manner, but it checks to see if the string is unable to match the specified operator. Therefore, if a =~ operation returns that a match was successful, the corresponding !~ operation will not return a successful result, and vice versa. Let’s examine this a little closer by considering the simple Perl script in Listing 1-2.

Listing 1-2.  Performing Some Basic Pattern Matching 

#!/usr/bin/perl

$string1="123";
$string2="ABC";
$pattern1="123";

if($string1=~/$pattern1/){
    print "123=123n";
}

if($string2!~/123/){
    print "ABC does not match /123/n";
}

if("234"=~/$pattern1|ABC/){
    print "This is 123 or ABCn";
}
else{print "This is neither 123 nor ABC";}

The script begins by declaring three different scalar variables; the first two hold string values that will be matched against various regular expressions, and the third serves as storage for a regular expression pattern. Next you use a series of conditional statements to evaluate the strings against a series of regular expressions. In the first conditional, the value stored in $string1 matches the pattern stored in $pattern1 , so the print statement is able to successfully execute. In the next conditional, $string2 does not match the supplied pattern, but the operation was conducted using the !~ operator, which tests for mismatches, and thus this print statement can also execute. The third conditional does not return a match, since the string 234 does not match either alternative in the regular expression. Accordingly, in this case the print statement of the else condition will instead execute. A quick look at the output of this script confirms that the observed behavior is in agreement with what was anticipated:

123=12 3
ABC does not match /123/
This is neither 123 nor ABC

Operations similar to these serve as the basis of pattern matching in Perl. However, the basic types of patterns you have learned to create so far have only limited usefulness. To gain more robust pattern matching capabilities, you will now build on these basic concepts by further exploring the richness of the Perl regular expression syntax.

Please check back next week for the continuation of this article.

[gp-comments width="770" linklove="off" ]

antalya escort bayan antalya escort bayan Antalya escort