Before I begin covering how you can use Perl to accomplish your parsing tasks, it is essential to have a clear understanding of exactly what parsing is and how you can utilize it. Therefore, I will define parsing as the action of splitting up a data set into smaller, more meaningful units and uncovering some form of meaningful structure from the sequence of these units. To understand this point, consider the structure of a tab-delimited data file. In this type of file, data is stored in columns, and a tab separates consecutive columns (see Figure 1-1).
Reviewing this file, your eyes most likely focus on the numbers in each column and ignore the whitespace found between the columns. In other words, your eyes perform a parsing task by allowing you to visualize distinct columns of data. Rather than just taking the whole data set as a unit, you are able to break up the data set into columns of numbers that are much more meaningful than a giant string of numbers and tabs. While this example is simplistic, we carry out parsing actions such as this every day. Whenever we see, read, or hear anything, our brains must parse the input in order to make some kind of logical sense out of it. This is why parsing is such a crucial technique for a computer programmer—there will often be a need to parse data sets and other forms of input so that applications can work with the information presented to them.
The following are common types of parsed data:
To get a better idea of just how parsing works, you first need to consider that in order to parse data you must classify the data you are examining into units. These units are referred to as tokens, and their identification is called lexing. In Figure 1-1, the units are numbers, and a tab separates each unit; for many lexing tasks, such whitespace identification is adequate. However, for certain sophisticated parsing tasks, this breakdown may not be as straightforward. A recursive approach may also be warranted, since in more nested structures it becomes possible to find units within units. Math equations such as 4*(3+2) provide an ideal example of this. Within the parentheses, 3 and 2 behave as their own distinct units; however, when it comes time to multiply by 4, (3+2) can be considered as a single unit. In fact, it is in dealing with nested structures such as this example that full-scale parsers prove their worth. As you will see later in the “Using Regular Expressions” section, simpler parsing tasks (in other words, those with a known finite structure) often do not require full-scale parsers but can be accomplished with regular expressions and other like techniques.
Note Examples of a well-known lexer and parser are the C-based Lex and Yacc programs that generally come bundled with Unix-based operating systems.
Before moving on to more in-depth discussions of parsers, I will introduce the Perl module Parse::Lex, which you can use to perform lexing tasks such as lexing the math equation listed previously.
Tip Parse::Lexand the other Perl modules used in this book are all available from CPAN (http:// www.cpan.org). If you are unfamiliar with working with CPAN modules, you can find information about downloading and installing Perl modules on a diversity of operating systems athttp://search.cpan. org/~jhi/perl-5.8.0/pod/perlmodinstall.pod. If you are using an ActiveState Perl distribution, you can also install Perl modules using the Perl Package Manager (PPM). You can obtain information about its use athttp://aspn.activestate.com/ASPN/docs/ ActivePerl/faq/ActivePerl-faq2.html.
For more detailed information about CPAN and about creating and using Perl modules, you will find that Writing Perl Modules for CPAN (Apress, 2002) by Sam Tregar is a great reference.
Philipe Verdret authored this module; the most current version as of this book’s publication is version 2.15.Parse::Lexis an object-oriented lexing module that allows you to split input into various tokens that you define. Take a look at the basics of how this module works by examining Listing 1-1, which will parse simple math equations, such as 18.2+43/6.8.
Listing 1-1. Using Parse::Lex
The first step in using this module is to create definitions of what constitutes an acceptable token. Token arguments for this module usually consist of a token name argument, such as the previousBegParen, followed by a regular expression. Within the module itself, these tokens are stored as instances of theParse::Tokenclass. After you specify your tokens, you next need to specify how your lexer will operate. You can accomplish this by passing a list of arguments to the lexer via the new method. In Listing 1-1, this list of arguments is contained in the@tokenarray. When creating the argument list, it is important to consider the order in which the token definitions are placed, since an input value will be classified as a token of the type that it is first able to match. Thus, when using this module, it is good practice to list the strictest definitions first and then move on to the more general definitions. Otherwise, the general definitions may match values before the stricter comparisons even get a chance to be made.
Once you have specified the criteria that your lexer will operate on, you next define the source of input into the lexer by using thefrommethod. The default for this property isSTDIN, but it could also be a filename, a file handle, or a string of text (in quotes). Next you loop through the values in your input until you reach theeoi(end of input) condition and print the token and corresponding type. If, for example, you entered the command-line argument43.4*15^2, the output should look like this:
In Chapter 3, where you will closely examine the workings of full-fledged parsers, I will employ a variant of this routine to aid in building a math equation parser.
Regular expressions are one of the most useful tools for lexing, but they are not the only method. As mentioned earlier, for some cases you can use whitespace identification, and for others you can bring dictionary lists into play. The choice of lexing method depends on the application. For applications where all tokens are of a similar type, like the tab-delimited text file discussed previously, whitespace pattern matching is probably the best bet. For cases where multiple token types may be employed, regular expressions or dictionary lists are better bets. For most cases, regular expressions are the best since they are the most versatile. Dictionary lists are better suited to more specialized types of lexing, where it is important to identify only select tokens.
One such example where a dictionary list is useful is in regard to the recent bioinformatics trend of mining medical literature for chemical interactions. For instance, many scientists are interested in the following:
<Chemical A> <operates on> <Chemical B>
In other words, they just want to determine how chemical A interacts with chemical B. When considering this, it becomes obvious that the entire textual content of any one scientific paper is not necessary to tokenize and parse. Thus, an informatician coding such a routine might want to use dictionary lists to identify the chemicals as well as to identify terms that describe the interaction. A dictionary list would be a listing of all the possible values for a given element of a sentence. For example, rather thanoperates on, I could also fill inreacts with,interacts with, or a variety of other terms and have a program check for the occurrence of any of those terms. Later, in the section “Capturing Substrings,” I will cover this example in more depth.
blog comments powered by Disqus