Perl
  Home arrow Perl arrow Page 4 - Beginning Perl
Dev Shed Forums 
Administration  
AJAX  
Apache  
BrainDump  
DHTML  
Flash  
Java  
JavaScript  
Multimedia  
MySQL  
Oracle  
Perl  
PHP  
Practices  
Python  
Reviews  
Security  
Smartphone Development  
Style-Sheets  
Web Services  
XML  
Zend  
Zope  
Mobile Linux 
App Generation ROI 
IBM® developerWorks 
Forums Sitemap 
E-Commerce Hosting 
Linux Web Hosting 
Managed Hosting 
Small Business Hosting 
VPS Hosting 
Weekly Newsletter

 
Developer Updates  
Free Website Content 
 RSS  Articles
 RSS  Forums
 RSS  All Feeds
Write For Us Get Paid 
Request Media Kit
Contact Us 
Site Map 
Privacy Policy 
Support 
 USERNAME
 
 PASSWORD
 
 
  >>> SIGN UP!  
  Lost Password? 
Google.com  
PERL

Beginning Perl
By: Dev Shed
  • Search For More Articles!
  • Disclaimer
  • Author Terms
  • Rating: starstarstarstarstar / 12
    2000-07-14


    Table of Contents:
  • Beginning Perl
  • Regular Expressions
  • Escaping Special Characters
  • Repetition
  • Working with RegExps
  • More Advanced Topics

  • Rate this Article: Poor Best 
      ADD THIS ARTICLE TO:
      error-file:tidyout.log Del.ici.ous error-file:tidyout.log Digg
      error-file:tidyout.log Blink error-file:tidyout.log Simpy
      error-file:tidyout.log Google error-file:tidyout.log Spurl
      error-file:tidyout.log Y! MyWeb error-file:tidyout.log Furl
    Email Me Similar Content When Posted
    Add Developer Shed Article Feed To Your Site
    Email Article To Friend
    Print Version Of Article
    PDF Version Of Article

     
     
    ADVERTISEMENT


    Beginning Perl - Repetition
    (Page 4 of 6 )

    We've now moved from matching a specific character to a more general type of character - when we don't know (or don't care) exactly what the character will be. Now we're going to see what happens when we want to talk about a more general quantity of characters: more than three digits in a row; two to four capital letters, and so on. The metacharacters that we use to deal with a number of characters in a row are called quantifiers .

    Indefinite Repetition
    The easiest of these is the question mark. It should suggest uncertainty - something may be there, or it may not. That's exactly what it does: stating that the immediately preceding character(s) - or metacharacter(s) - may appear once, or not at all. It's a good way of saying that a particular character or group is optional. To match the word 'he or she', you can put:

    > perl matchtest.plx
    Enter some text to find: \bs?he\b
    The text matches the pattern '\bs?he\b'.
    >

    To make a series of characters (or metacharacters) optional, group them in parentheses as before. Did he say 'what the Entish is' or 'what the Entish word is'? Either will do:

    > perl matchtest.plx
    Enter some text to find: what the Entish (word )?is
    The text matches the pattern 'what the Entish (word )?is'.
    >

    Notice that we had to put the space inside the group: otherwise we end up with two spaces between 'Entish' and 'is', whereas our text only has one:

    > perl matchtest.plx
    Enter some text to find: what the Entish (word)? is
    'what the Entish (word)? is' was not found.
    >

    As well as matching something one or zero times, you can match something one or more times. We do this with the plus sign - to match an entire word without specifying how long it should be, you can say:

    > perl matchtest.plx
    Enter some text to find: \b\w+\b
    The text matches the pattern '\b\w+\b'.
    >

    In this case, we match the first available word - I.

    If, on the other hand, you have something which may be there any number of times but might not be there at all - zero or one or many - you need what's called 'Kleene's star': the * quantifier. So, to find a capital letter after any - but possibly no - spaces at the start of the string, what would you do? The start of the string, then any number of whitespace characters, then a capital:

    > perl matchtest.plx
    Enter some text to find: ^\s*[A-Z]
    '^\s*[A-Z]' was not found.

    >

    Of course, our test string begins with a quote, so the above pattern won't match, but, sure enough, if you take away that first quote, the pattern will match fine.
    Let's review the three qualifiers:

    /bea?t/
    Matches either 'beat' or 'bet'
    /bea+t/
    Matches 'beat', 'beaat', 'beaaat'Ö
    /bea*t/
    Matches 'bet', 'beat', 'beaat'Ö

    Novice Perl programmers tend to go to town on combinations of dot and star, and the results often surprise them, particularly when it comes to searching-and-replacing. We'll explain the rules of the regular expression matcher shortly, but bear the following in mind:

    A regular expression should hardly ever start or finish with a starred character.

    You should also consider the fact that .* and .+ in the middle of a regular expression will match as much of your string as they possibly can. We'll look more at this 'greedy' behavior later on.

    Well-Defined Repetition
    If you want to be more precise about how many times a character or roups of characters might be repeated, you can specify the maximum and minimum number of repeats in curly brackets. '2 or 3 spaces' can be written as follows:

    > perl matchtest.plx
    Enter some text to find: \s{2,3}
    '\s{2,3}' was not found.
    >

    So we have no doubled or trebled spaces in our string. Notice how we construct that - the minimum, a comma, and the maximum, all inside braces. Omitting either the maximum or the minimum signifies 'or more' and 'or fewer' respectively. For example, {2,} denotes '2 or more', while {,3} is '3 or fewer'. In these cases, the same warnings apply as for the star operator.

    Finally, you can specify exactly how many things are to be in a row by simply putting that number inside the curly brackets. Here's the five-letter-word example tidied up a little:

    > perl matchtest.plx
    Enter some text to find: \b\w{5}\b
    '\b\w{5}\b' was not found.
    >

    Summary Table
    To refresh your memory, here are the various metacharacters we've seen so far:

    Metacharacter
    Meaning
    [abc]
    any one of the characters a , b , or c.
    [^abc]
    any one character other than a , b, or c.

    Table continued on following page

    Metacharacter
    Meaning
    [a-z]
    any one ASCII character between a and z.
    \d \D
    a digit; a non-digit.
    \w \W
    a 'word' character; a non-'word' character.
    \s \S
    a whitespace character; a non-whitespace character.
    \b
    the boundary between a \w character and a \W character.
    .
    any character (apart from a new line).
    (abc)
    the phrase 'abc ' as a group.
    ?
    preceding character or group may be present 0 or 1 times.
    +
    preceding character or group is present 1 or more times.
    *
    preceding character or group may be present 0 or more times.
    {x,y}
    preceding character or group is present between x and y times.
    {,y}
    preceding character or group is present at most y times.
    {x,}
    preceding character or group is present at least x times.
    {x}
    preceding character or group is present x times.

    Backreferences
    What if we want to know what a certain regular expression matched? It was easy when we were matching literal strings: we knew that 'Case' was going to match those four letters and nothing else. But now, what matches? If we have /\w{3}/, which three word characters are getting matched?

    Perl has a series of special variables in which it stores anything that's matched with a group in parentheses. Each time it sees a set of parentheses, it copies the matched text inside into a numbered variable - the first matched group goes in $1 , the second group in $2 , and so on. By looking at these variables, which we call the backreference variables, we can see what triggered various parts of our match, and we can also extract portions of the data for later use.

    First, though, let's rewrite our test program so that we can see what's in those variables:

    Try it out : A Second Pattern Tester

    #!/usr/bin/perl
    # matchtest2.plx
    use warnings;
    use strict;
    $_ = '1: A silly sentence (495,a) *BUT* one which will be useful. (3)';
    print "Enter a regular expression: ";
    my $pattern = <STDIN>;
    chomp($pattern);

    if (/$pattern/) {
    print "The text matches the pattern '$pattern'.\n";
    print "\$1 is '$1'\n" if defined $1;
    print "\$2 is '$2'\n" if defined $2;
    print "\$3 is '$3'\n" if defined $3;
    print "\$4 is '$4'\n" if defined $4;
    print "\$5 is '$5'\n" if defined $5;
    } else {
    print "'$pattern' was not found.\n";
    }

    Note that we use a backslash to escape the first 'dollar' symbol in each print statement, thus displaying the actual symbol, while leaving the second in each to display the contents of the appropriate variable.

    We've got our special variables in place, and we've got a new sentence to do our matching on. Let's see what's been happening:

    > perl matchtest2.plx
    Enter a regular expression: ([a-z]+)
    The text matches the pattern '([a-z]+)'.
    $1 is 'silly'

    > perl matchtest2.plx
    Enter a regular expression: (\w+)
    The text matches the pattern '(\w+)'.

    $1 is '1'

    > perl matchtest2.plx
    Enter a regular expression: ([a-z]+)(.*)([a-z]+)
    The text matches the pattern '([a-z]+)(.*)([a-z]+)'.
    $1 is 'silly'
    $2 is ' sentence (495,a) *BUT* one which will be usefu'
    $3 is 'l'

    > perl matchtest2.plx

    Enter a regular expression: e(\w|n\w+)
    The text matches the pattern 'e(\w|n\w+)'.
    $1 is 'n'

    How It Works
    By printing out what's in each of the groups, we can see exactly what caused perl to start and stop matching, and when. If we look carefully at these results, we'll find that they can tell us a great deal about how perl handles regular expressions.

    How the Engine Works
    We've now seen most of the syntax behind regular expression matching and plenty of examples of it in action. The code that does all the matching is called perl's 'regular expression engine'. You might now be wondering about the exact rules applied by this engine when determining whether or not a piece of text matches. And how much of it matches what. From what our examples have shown us, let us make some deductions about the engine's operation.
    Our first expression, ([a-z]+) plucked out a set of one-or-more lower-case letters. The first such set that perl came across was 'silly '. The next character after 'y ' was a space, and so no longer matched the expression.

    Rule one: Once the engine starts matching, it will keep matching a character at a time for as long as it can. Once it sees something that doesn't match, however, it has to stop. In this example, it can never get beyond a character that is not a lower case letter. It has to stop as soon as it encounters one.

    Next, we looked for a series of word characters, using (\w+ ). The engine started looking at the beginning of the string and found one, '1'. The next character was not a word character (it was a colon), and so the engine had to stop.

    Rule two: Unlike me, the engine is eager . It's eager to start work and eager to finish, and it starts matching as soon as possible in the string; if the first character doesn't match, try and start matching from the second. Then take every opportunity to finish as quickly as possible.

    Then we tried this:([a-z]+)(.*)([a-z]+) . The result we got with this was a little strange. Let's look at it again:

    > perl matchtest2.plx
    Enter a regular expression: ([a-z]+)(.*)([a-z]+)
    The text matches the pattern '([a-z]+)(.*)([a-z]+)'.
    $1 is 'silly'
    $2 is ' sentence (495,a) *BUT* one which will be usefu'
    $3 is 'l'
    >

    Our first group was the same as what matched before - nothing new there. When we could no longer match lower case letters, we switched to matching anything we could. Now, this could take up the rest of the string, but that wouldn't allow a match for the third group. We have to leave at least one lower-case letter.

    So, the engine started to reverse back along the string, giving characters up one by one. It gave up the closing bracket, the 3, then the opening bracket, and so on, until we got to the first thing that would satisfy all the groups and let the match go ahead - namely a lower-case letter: the 'l' at the end of 'useful'.

    From this, we can draw up the third rule:

    Rule three: Like me, in this case, the engine is greedy. If you use the + or * operators, they will try and steal as much of the string as possible. If the rest of the expression does not match, it grudgingly gives up a character at a time and tries to match again, in order to find the fullest possible match.

    We can turn a greedy match into a non-greedy match by putting the ? operator after either the plus or star. For instance, let's turn this example into a non-greedy version: ([a-z]+)(.*?)([a-z]+) . This gives us an entirely different result:

    > perl matchtest2.plx
    Enter a regular expression: ([a-z]+)(.*?)([a-z]+)
    The text matches the pattern '([a-z]+)(.*?)([a-z]+)'.
    $1 is 'silly'
    $2 is ' '
    $3 is 'sentence'
    >

    Now we've shut off rule three, rule two takes over. The smallest possible match for the second group was a single space. First, it tried to get nothing at all, but then the third group would be faced with a space. This wouldn't match. So, we grudgingly accept the space and try and finish again. This time the third group has some lower case letters, and that can match as well.

    What if we turn off greediness in all three groups, and say this: ([a-z]+?)(.*?)([a-z]+?)

    > perl matchtest2.plx
    Enter a regular expression: ([a-z]+?)(.*?)([a-z]+?)
    The text matches the pattern '([a-z]+?)(.*?)([a-z]+?)'.
    $1 is 's'
    $2 is ''
    $3 is 'i'

    >

    What about this? Well, the smallest possible match for the first group is the 's' of silly. We asked it to find one character or more, and so the smallest it could find was one. The second group actually matched no characters at all. This left the third group facing an 'i', which it took to complete the match.

    Our last example included an alternation:

    > perl matchtest2.plx

    Enter a regular expression: e(\w|n\w+)
    The text matches the pattern 'e(\w|n\w+)'.
    $1 is 'n'
    >

    The engine took the first branch of the alternation and matched a single character, even though the second branch would actually satisfy greed. This leads us onto the fourth rule:

    Rule four: Again like me, the regular expression engine hates decisions . If there are two branches, it will always choose the first one, even though the second one might allow it to gain a longer match.

    To summarize:

    The regular expression engine starts as soon as it can, grabs as much as it can, then tries to finish as soon as it can, while taking the first decision available to it.

    ©1999 Wrox Press Limited, US and UK.



     
     
    >>> More Perl Articles          >>> More By Dev Shed
     

       

    PERL ARTICLES

    - More Perl Bits
    - Perl, Bit by Bit
    - Basic Charting with Perl
    - Using Getopt::Long: More Command Line Option...
    - Command Line Options in Perl: Using Getopt::...
    - Web Access with LWP
    - More Templating Tools for Perl
    - Site Layout with Perl Templating Tools
    - Build a Perl RSS Aggregator with Templating ...
    - Looping, Security, and Templating Tools
    - Perl: Bon Voyage Lists and Hashes
    - Templating Tools
    - Perl: Number Crunching
    - Perl Debuggers in Detail
    - Debugging Perl





    © 2003-2010 by Developer Shed. All rights reserved. DS Cluster 8 Hosted by Hostway
    For more Enterprise Application Development news, visit eWeek