Web Development

  Homes arrow Web Development arrow Beginning Perl Part 3 - Repetition
 Webmaster Tools
 
Base64 Encoding 
Browser Settings 
CSS Coder 
CSS Navigation Menu 
Datetime Converter 
DHTML Tooltip 
Dig Utility 
DNS Utility 
Dropdown Menu 
Fetch Content 
Fetch Header 
Floating Layer 
htaccess Generator 
HTML to PHP 
HTML Encoder 
HTML Entities 
IP Convert 
Meta Tags 
Password Encryption
 
Password Strength
 
Pattern Extractor 
Ping Utility 
Pop-Up Window 
Regex Extractor 
Regex Match 
Scrollbar Color 
Source Viewer 
Syntax Highlighting 
URL Encoding 
Web Safe Colors 
Forums Sitemap 
Weekly Newsletter
 
Developer Updates  
Free Website Content 
 RSS  Articles
 RSS  Forums
 RSS  All Feeds
Write For Us 
Contact Us 
Site Map 
Privacy Policy 
Support 
 USERNAME
 
 PASSWORD
 
 
  >>> SIGN UP!  
  Lost Password? 
WEB DEVELOPMENT

Beginning Perl Part 3 - Repetition
By: Developer Shed
  • Search For More Articles!
  • Disclaimer
  • Author Terms
  • Rating:  stars stars stars stars stars / 0
    2004-04-19

    Table of Contents:

    Rate this Article: Poor Best 
      ADD THIS ARTICLE TO:
      Del.ici.ous Digg
      Blink Simpy
      Google Spurl
      Y! MyWeb Furl
    Email Me Similar Content When Posted
    Add Developer Shed Article Feed To Your Site
    Email Article To Friend
    Print Version Of Article
    PDF Version Of Article
     
     

    SEARCH DEV MECHANIC

    TOOLS YOU CAN USE

    advertisement

    Beginning Perl Part 3 - Repetition
    by Wrox Books

    The metacharacters that we use to deal with a number of characters in a row are called quantifiers.

    Indefinite Repetition

    The easiest of these is the question mark. It should suggest uncertainty - something may be there, or it may not. That's exactly what it does: stating that the immediately preceding character(s) - or metacharacter(s) - may appear once, or not at all. It's a good way of saying that a particular character or group is optional. To match the word 'he or she', you can put:
    > perl matchtest.plx
    Enter some text to find: \bs?he\b
    The text matches the pattern '\bs?he\b'.
    >
    

    To make a series of characters (or metacharacters) optional, group them in parentheses as before. Did he say 'what the Entish is' or 'what the Entish word is'? Either will do:

    > perl matchtest.plx
    Enter some text to find: what the Entish (word )?is
    The text matches the pattern 'what the Entish (word )?is'.
    >
    

    Notice that we had to put the space inside the group: otherwise we end up with two spaces between 'Entish' and 'is', whereas our text only has one:

    > perl matchtest.plx
    Enter some text to find: what the Entish (word)? is
    'what the Entish (word)? is' was not found.
    >
    

    As well as matching something one or zero times, you can match something one or more times. We do this with the plus sign - to match an entire word without specifying how long it should be, you can say:

    > perl matchtest.plx
    Enter some text to find: \b\w+\b
    The text matches the pattern '\b\w+\b'.
    >
    

    In this case, we match the first available word - I.

    If, on the other hand, you have something which may be there any number of times but might not be there at all - zero or one or many - you need what's called 'Kleene's star': the * quantifier. So, to find a capital letter after any - but possibly no - spaces at the start of the string, what would you do? The start of the string, then any number of whitespace characters, then a capital:

    > perl matchtest.plx
    Enter some text to find: ^\s*[A-Z]
    '^\s*[A-Z]' was not found.
    >
    

    Of course, our test string begins with a quote, so the above pattern won't match, but, sure enough, if you take away that first quote, the pattern will match fine.Let's review the three qualifiers:

    Novice Perl programmers tend to go to town on combinations of dot and star, and the results often surprise them, particularly when it comes to searching-and-replacing. We'll explain the rules of the regular expression matcher shortly, but bear the following in mind:

    A regular expression should hardly ever start or finish with a starred character.

    You should also consider the fact that .* and .+ in the middle of a regular expression will match as much of your string as they possibly can. We'll look more at this 'greedy' behavior later on.

    Well-Defined Repetition

    If you want to be more precise about how many times a character or roups of characters might be repeated, you can specify the maximum and minimum number of repeats in curly brackets. '2 or 3 spaces' can be written as follows:

    > perl matchtest.plx
    Enter some text to find: \s{2,3}
    '\s{2,3}' was not found.
    >
    

    So we have no doubled or trebled spaces in our string. Notice how we construct that - the minimum, a comma, and the maximum, all inside braces. Omitting either the maximum or the minimum signifies 'or more' and 'or fewer' respectively. For example, {2,} denotes '2 or more', while {,3} is '3 or fewer'. In these cases, the same warnings apply as for the star operator.

    Finally, you can specify exactly how many things are to be in a row by simply putting that number inside the curly brackets. Here's the five-letter-word example tidied up a little:

    > perl matchtest.plx
    Enter some text to find: \b\w{5}\b
    '\b\w{5}\b' was not found.
    >
    

    Summary Table

    To refresh your memory, here are the various metacharacters we've seen so far:

    Backreferences

    What if we want to know what a certain regular expression matched? It was easy when we were matching literal strings: we knew that 'Case' was going to match those four letters and nothing else. But now, what matches? If we have /\w{3}/, which three word characters are getting matched?

    Perl has a series of special variables in which it stores anything that's matched with a group in parentheses. Each time it sees a set of parentheses, it copies the matched text inside into a numbered variable - the first matched group goes in $1 , the second group in $2 , and so on. By looking at these variables, which we call the backreference variables, we can see what triggered various parts of our match, and we can also extract portions of the data for later use.

    First, though, let's rewrite our test program so that we can see what's in those variables:

    Try it out : A Second Pattern Tester

    #!/usr/bin/perl
    # matchtest2.plx
    use warnings;
    use strict;
    $_ = '1: A silly sentence (495,a) *BUT* one which will be useful. (3)';
    print "Enter a regular expression: ";
    my $pattern = ;
    chomp($pattern);  
    if (/$pattern/) {
    print "The text matches the pattern '$pattern'.\n";
    print "\$1 is '$1'\n" if defined $1;
    print "\$2 is '$2'\n" if defined $2;
    print "\$3 is '$3'\n" if defined $3;
    print "\$4 is '$4'\n" if defined $4;
    print "\$5 is '$5'\n" if defined $5;
    } else {
    print "'$pattern' was not found.\n";
    }  
    

    Note that we use a backslash to escape the first 'dollar' symbol in each print statement, thus displaying the actual symbol, while leaving the second in each to display the contents of the appropriate variable.

    We've got our special variables in place, and we've got a new sentence to do our matching on. Let's see what's been happening:

    > perl matchtest2.plx
    Enter a regular expression: ([a-z]+)
    The text matches the pattern '([a-z]+)'.
    $1 is 'silly'
    > perl matchtest2.plx
    Enter a regular expression: (\w+)
    The text matches the pattern '(\w+)'.
    $1 is '1'
    > perl matchtest2.plx
    Enter a regular expression: ([a-z]+)(.*)([a-z]+)
    The text matches the pattern '([a-z]+)(.*)([a-z]+)'.
    $1 is 'silly'
    $2 is ' sentence (495,a) *BUT* one which will be usefu'
    $3 is 'l'
    > perl matchtest2.plx
    Enter a regular expression: e(\w|n\w+)
    The text matches the pattern 'e(\w|n\w+)'.
    $1 is 'n'
    

    How It Works

    By printing out what's in each of the groups, we can see exactly what caused perl to start and stop matching, and when. If we look carefully at these results, we'll find that they can tell us a great deal about how perl handles regular expressions.

    How the Engine Works

    We've now seen most of the syntax behind regular expression matching and plenty of examples of it in action. The code that does all the matching is called perl's 'regular expression engine'. You might now be wondering about the exact rules applied by this engine when determining whether or not a piece of text matches. And how much of it matches what. From what our examples have shown us, let us make some deductions about the engine's operation.

    Our first expression, ([a-z]+) plucked out a set of one-or-more lower-case letters. The first such set that perl came across was 'silly '. The next character after 'y ' was a space, and so no longer matched the expression.

    Rule one: Once the engine starts matching, it will keep matching a character at a time for as long as it can. Once it sees something that doesn't match, however, it has to stop. In this example, it can never get beyond a character that is not a lower case letter. It has to stop as soon as it encounters one.

    Next, we looked for a series of word characters, using (\w+ ). The engine started looking at the beginning of the string and found one, '1'. The next character was not a word character (it was a colon), and so the engine had to stop.

    Rule two: Unlike me, the engine is eager. It's eager to start work and eager to finish, and it starts matching as soon as possible in the string; if the first character doesn't match, try and start matching from the second. Then take every opportunity to finish as quickly as possible.

    Then we tried this:([a-z]+)(.*)([a-z]+) . The result we got with this was a little strange. Let's look at it again:

    > perl matchtest2.plx
    Enter a regular expression: ([a-z]+)(.*)([a-z]+)
    The text matches the pattern '([a-z]+)(.*)([a-z]+)'.
    $1 is 'silly'
    $2 is ' sentence (495,a) *BUT* one which will be usefu'
    $3 is 'l'
    >
    

    Our first group was the same as what matched before - nothing new there. When we could no longer match lower case letters, we switched to matching anything we could. Now, this could take up the rest of the string, but that wouldn't allow a match for the third group. We have to leave at least one lower-case letter.

    So, the engine started to reverse back along the string, giving characters up one by one. It gave up the closing bracket, the 3, then the opening bracket, and so on, until we got to the first thing that would satisfy all the groups and let the match go ahead - namely a lower-case letter: the 'l' at the end of 'useful'.

    From this, we can draw up the third rule:

    Rule three: Like me, in this case, the engine is greedy. If you use the + or * operators, they will try and steal as much of the string as possible. If the rest of the expression does not match, it grudgingly gives up a character at a time and tries to match again, in order to find the fullest possible match.

    We can turn a greedy match into a non-greedy match by putting the ? operator after either the plus or star. For instance, let's turn this example into a non-greedy version: ([a-z]+)(.*?)([a-z]+) . This gives us an entirely different result:

    > perl matchtest2.plx
    Enter a regular expression: ([a-z]+)(.*?)([a-z]+)
    The text matches the pattern '([a-z]+)(.*?)([a-z]+)'.
    $1 is 'silly'
    $2 is ' '
    $3 is 'sentence'
    >
    

    Now we've shut off rule three, rule two takes over. The smallest possible match for the second group was a single space. First, it tried to get nothing at all, but then the third group would be faced with a space. This wouldn't match. So, we grudgingly accept the space and try and finish again. This time the third group has some lower case letters, and that can match as well.

    What if we turn off greediness in all three groups, and say this: ([a-z]+?)(.*?)([a-z]+?)

    > perl matchtest2.plx
    Enter a regular expression: ([a-z]+?)(.*?)([a-z]+?)
    The text matches the pattern '([a-z]+?)(.*?)([a-z]+?)'.
    $1 is 's'
    $2 is '' 
    $3 is 'i' 
    >
    

    What about this? Well, the smallest possible match for the first group is the 's' of silly. We asked it to find one character or more, and so the smallest it could find was one. The second group actually matched no characters at all. This left the third group facing an 'i', which it took to complete the match.

    Our last example included an alternation:

    > perl matchtest2.plx
    Enter a regular expression: e(\w|n\w+)
    The text matches the pattern 'e(\w|n\w+)'.
    $1 is 'n'
    >
    

    The engine took the first branch of the alternation and matched a single character, even though the second branch would actually satisfy greed. This leads us onto the fourth rule:

    Rule four: Again like me, the regular expression engine hates decisions . If there are two branches, it will always choose the first one, even though the second one might allow it to gain a longer match.

    To summarize:

    The regular expression engine starts as soon as it can, grabs as much as it can, then tries to finish as soon as it can, while taking the first decision available to it.


    DISCLAIMER: The content provided in this article is not warranted or guaranteed by Developer Shed, Inc. The content provided is intended for entertainment and/or educational purposes in order to introduce to the reader key ideas, concepts, and/or product reviews. As such it is incumbent upon the reader to employ real-world tactics for security and implementation of best practices. We are not liable for any negative consequences that may result from implementing any information covered in our articles or tutorials. If this is a hardware review, it is not recommended to open and/or modify your hardware.

    More Web Development Articles
    More By Developer Shed

       

    WEB DEVELOPMENT ARTICLES

    - On Page SEO for New Domains
    - Improve Your Site`s Speed
    - Safari Books Online Review
    - Creating an Estore From the Ground Up
    - Most Common SEO Mistakes Developers Make
    - Making the Most of Your Titles and Meta Desc...
    - Five Ways Using Flash Can Damage Your Site
    - A Web Designer`s Guide to Colors
    - Use Webstarts to Create a Free Site
    - More Than Just Looks. How Your Web Design C...
    - How to Design Content Pages
    - Mint Review
    - Make Your WordPress Website Look Professional
    - How to Create a Mobile Web Site
    - Meta Tags: Still Useful?

    Developer Shed Affiliates

     



    © 2003-2018 by Developer Shed. All rights reserved. DS Cluster - Follow our Sitemap