As you saw in the previous Parse::Lex example, regular expressions provide a robust tool for token identification, but their usefulness goes far beyond that. In fact, for many simple parsing tasks, a regular expression alone may be adequate to get the job done. For example, if you want to perform a simple parsing/data extraction task such as parsing out an e-mail address found on a Web page, you can easily accomplish this by using a regular expression. All you need is to create a regular expression that identifies a pattern similar to the following:
[alphanumeric characters]@[alphanumeric characters.com]
Caution The previous expression is a simplification provided to illustrate the types of pattern matching for which you can use regular expressions. A more real-world e-mail matching expression would need to be more complex to account for other factors such as alternate endings (for example,.net,.gov) as well as the presence of metacharacters in either alphanumeric string. Additionally, a variety of less-common alternative e-mail address formats may also warrant consideration.
The following sections will explain how to create such regular expressions in the format Perl is able to interpret. To make regular expressions and their operation a little less mysterious, however, I will approach this topic by first explaining how Perl’s regular expression engine operates. Perl’s regular expression engine functions by using a programming paradigm known as a state machine, described in depth next.A State Machine
A simple definition of a state machine is one that will sequentially read in the symbols of an input word. After reading in a symbol, it will decide whether the current state of the machine is one of acceptance or nonacceptance. The machine will then read in the next symbol and make another state decision based upon the previous state and the current symbol. This process will continue until all symbols in the word are considered. Perl’s regular expression engine operates as a state machine (sometimes referred to as an automaton) for a given string sequence (that is, the word). In order to match the expression, all of the acceptable states (that is, characters defined in the regular expression) in a given path must be determined to be true. Thus, when you write a regular expression, you are really providing the criteria the differing states of the automaton need to match in order to find a matching string sequence. To clarify this, let’s consider the pattern /123/ and the string 123 and manually walk through the procedure the regular expression engine would perform. Such a pattern is representative of the simplest type of case for your state machine. That is, the state machine will operate in a completely linear manner. Figure 1-2 shows a graphical representation of this state machine.
Note It is interesting to note that a recursive descent parser evaluates the regular expressions you author. For more information on recursive descent parsers, see Chapter 5.
In this case, the regular expression engine begins by examining the first character of the string, which is a1. In this case, the required first state of the automaton is also a1. Therefore, a match is found, and the engine moves on by comparing the second character, which is a2, to the second state. Also in this case, a match is found, so the third character is examined and another match is made. When this third match is made, all states in the state machine are satisfied, and the string is deemed a match to the pattern.
In this simple case, the string, as written, provided an exact match to the pattern. Yet, this is hardly typical in the real world, so it is important to also consider how the regular expression will operate when the character in question does not match the criterion of a particular state in the state machine. In this instance, I will use the same pattern (/123/) and hence the same state machine as in the previous example, only this time I will try to find a match within the string4512123(see Figure 1-3).
This time the regular expression engine begins by comparing the first character in the string,4, with the first state criterion. Since the criterion is a1, no match is found. When this mismatch occurs, the regular expression starts over by trying to compare the string contents beginning with the character in the second position (see Figure 1-4).
As in the first case, no match is found between criterion for the first state and the character in question (5), so the engine moves on to make a comparison beginning with the third character in the string (see Figure 1-5).
In this case, since the third character is a1, the criterion for the first state is satisfied, and thus the engine is able to move on to the second state. The criterion for the second state is also satisfied, so therefore the engine will next move on to the third state. The1in the string, however, does not match the criterion for state 3, so the engine then tries to match the fourth character of the string,2, to the first state (see Figure 1-6).
As in previous cases, the first criterion is not satisfied by the2, and consequently the regular expression engine will begin to examine the string beginning with the fifth character. The fifth character satisfies the criterion for the first state, and therefore the engine proceeds on to the second state. In this case, a match for the criterion is also present, and the engine moves on to the third state. The final character in the string matches the third state criterion, and hence a match to the pattern is made (see Figure 1-7).
The previous two examples deal with a linear state machine. However, you are not limited to this type of regular expression setup. It is possible to establish alternate paths within the regular expression engine. You can set up these alternate paths by using the alternation (“or”) operator (|) and/or parentheses, which define subpatterns. I will cover more about the specific meanings of regular expression syntaxes in the upcoming sections “Pattern Matching,” “Quantifiers,” and “Predefined Subpatterns.” For now, consider the expression/123|1b(c|C)/, which specifies that the matching pattern can be123,1bc, or1bC(see Figure 1-8).
Note Parentheses not only define subpatterns but can also capture substrings, which I will discuss in the upcoming “Capturing Substrings” section.
As you can see, this state machine can follow multiple paths to reach the goal of a complete match. It can choose to take the top path of123,or can choose to take one of the bottom paths of1bc or1bC. To get an idea of how this works, consider the string1bc and see how the state machine would determine this to be a match. It would first find that the1 matches the first state condition, so it would then proceed to match the next character (b) to the second state condition of the top path (2). Since this is not a match, the regular expression engine will backtrack to the location of the true state located before the “or” condition. The engine will backtrack further, in this case to the starting point, only if all the available paths are unable to provide a correct match. From this point, the regular expression engine will proceed down an alternate path, in this case the bottom one. As the engine traverses down this path, the characterbis a match for the second state of the bottom path. At this point, you have reached a second “or” condition, so the engine will check for matches along the top path first. In this case, the engine is able to match the charactercwith the required statec, so no further backtracking is required, and the string is considered a perfect match.
When specifying regular expression patterns, it is also beneficial to be aware of the notations and[^], since these allow you to specify ranges of characters that will serve as an acceptable match or an unacceptable one. For instance, if you had a pattern containing[ABCDEF]or[A-F], thenA,B,C,D,E, andFwould all be acceptable matches. However,aorG would not be, since both are not included in the acceptable range.
Tip Perl’s regular expression patterns are case-sensitive by default. So,Ais different fromaunless a modifier is used to declare the expression case-insensitive. See the “Modifiers” section for more details.
If you want to specify characters that would be unacceptable, you can use the[^]syntax. For example, if you want the expression to be true for any character butA,B,C,D,E, andF, you can use one of the following expressions:[^ABCDEF]or[^A-F].
blog comments powered by Disqus