Web Development

  Homes arrow Web Development arrow Beginning Perl Part 2 - Escaping Special Char...
 Webmaster Tools
 
Base64 Encoding 
Browser Settings 
CSS Coder 
CSS Navigation Menu 
Datetime Converter 
DHTML Tooltip 
Dig Utility 
DNS Utility 
Dropdown Menu 
Fetch Content 
Fetch Header 
Floating Layer 
htaccess Generator 
HTML to PHP 
HTML Encoder 
HTML Entities 
IP Convert 
Meta Tags 
Password Encryption
 
Password Strength
 
Pattern Extractor 
Ping Utility 
Pop-Up Window 
Regex Extractor 
Regex Match 
Scrollbar Color 
Source Viewer 
Syntax Highlighting 
URL Encoding 
Web Safe Colors 
Forums Sitemap 
Weekly Newsletter
 
Developer Updates  
Free Website Content 
 RSS  Articles
 RSS  Forums
 RSS  All Feeds
Write For Us 
Contact Us 
Site Map 
Privacy Policy 
Support 
 USERNAME
 
 PASSWORD
 
 
  >>> SIGN UP!  
  Lost Password? 
WEB DEVELOPMENT

Beginning Perl Part 2 - Escaping Special Characters
By: Developer Shed
  • Search For More Articles!
  • Disclaimer
  • Author Terms
  • Rating:  stars stars stars stars stars / 0
    2004-04-19

    Table of Contents:

    Rate this Article: Poor Best 
      ADD THIS ARTICLE TO:
      Del.ici.ous Digg
      Blink Simpy
      Google Spurl
      Y! MyWeb Furl
    Email Me Similar Content When Posted
    Add Developer Shed Article Feed To Your Site
    Email Article To Friend
    Print Version Of Article
    PDF Version Of Article
     
     

    SEARCH DEV MECHANIC

    TOOLS YOU CAN USE

    advertisement

    Beginning Perl Part 2 - Escaping Special Characters
    by Wrox Books

    At this stage, we might not want to use their special meanings - we may want to literally match the characters themselves. As you've already seen with double-quoted strings, we can use a backslash to escape these characters' special meanings. Hence, if you want to match '... ' in the above text, you need your pattern to say '\.\.\. '. For example:
    > perl matchtest.plx
    Enter some text to find: Ent+
    The text matches the pattern 'Ent+'.
    > perl matchtest.plx
    Enter some text to find: Ent\+
    'Ent\+' was not found.
    

    We'll see later why the first one matched - due to the special meaning of +.

    These are the characters that are given special meaning within a regular expression, which you will need to backslash if you want to use literally:. * ? + [ ] ( ) { } ^ $ | \ Any other characters automatically assume their literal meanings.

    You can also turn off the special meanings using the escape sequence \Q . After perl sees \Q , the 14 special characters above will automatically assume their ordinary, literal meanings. This remains the case until perl sees either \E or the end of the pattern.

    For instance, if we wanted to adapt our matchtest program just to look for literal strings, instead of regular expressions, we could change it to look like this:

    if (/\Q$pattern\E/) {  
    

    Now the meaning of + is turned off:

    > perl matchtest.plx
    Enter some text to find: Ent+
    'Ent+' was not found.
    > 
    

    Note that all \Q does is turn off the regular expression magic of those 14 characters above - it doesn't stop, for example, variable interpolation.

    Don't forget to change this back again: We'll be using matchtest.plx throughout the chapter, to demonstrate the regular expressions we look at. We'll need that magic fully functional!

    Anchors

    So far, our patterns have all tried to find a match anywhere in the string. The first way we'll extend our regular expressions is by dictating to perl where the match must occur. We can say 'these characters must match the beginning of the string' or 'this text must be at the end of the string'. We do this by anchor ing the match to either end.

    The two anchors we have are ^ , which appears at the beginning of the pattern anchor a match to the beginning of the string, and $ which appears at the end of the pattern and anchors it to the end of the string. So, to see if our quotation ends in a full stop - and remember that the full stop is a special character - we say something like this:

    >perl matchtest.plx
    Enter some text to find: \.$
    The text matches the pattern '\.$'.
    

    That's a full stop (which we've escaped to prevent it being treated as a special character) and a dollar sign at the end of our pattern - to show that this must be the end of the string.

    Try, if you can, to get into the habit of reading out regular expressions in English. Break them into pieces and say what each piece does. Also remember to say that each piece must immediately follow the other in the string in order to match. For instance, the above could be read 'match a full stop immediately followed by the end of the string'.

    If you can get into this habit, you'll find that reading and understanding regular expressions becomes a lot easier, and you'll be able to 'translate' back into Perl more naturally as well.

    Here's another example: do we have a capital I at the beginning of the string?

    > perl matchtest.plx
    Enter some text to find: ^I
    '^I' was not found.
    >
    

    We use ^ to mean 'beginning of the string', followed by an I. In our case, though, the character at the beginning of the string is a " , so our pattern does not match. If you know that what you're looking for can only occur at the beginning or the end of the match, it's extremely efficient to use anchors. Instead of searching through the whole string to see whether the match succeeded, perl only needs to look at a small portion and can give up immediately if even the first character does not match.

    Let's see one more example of this, where we'll combine looking for matches with looking through the lines in a file:

    Try it out : Rhyming Dictionary

    Imagine yourself as a poor poet. In fact, not just poor, but downright bad - so bad, you can't even think of a rhyme for 'pink'. So, what do you do? You do what every sensible poet does in this situation, and you write the following Perl program:

    #!/usr/bin/perl
    # rhyming.plx
    use warnings;
    use strict;
    my $syllable = "ink";
    while (<>) {
    print if /$syllable$/;
    }  
    

    We can now feed it a file of words, and find those that end in 'ink':

    >perl rhyming.plx wordlist.txt
    blink
    bobolink
    brink
    chink
    clink
    >
    

    For a really thorough result, you'll need to use a file containing every word in the dictionary - be prepared to wait though if you do! For the sake of the example however, any text-based file will do (though it'll help if it's in English). A bobolink, in case you're wondering, is a migratory American songbird, otherwise known as a ricebird or reedbird.

    How It Works

    With the loops and tests we learned in the last chapter, this program is really very easy:

    while (<>) { print if /$syllable$/;}  
    

    We've not looked at file access yet, so you may not be familiar with the while(<>){...} construction used here. In this example it opens a file that's been specified on the command line, and loops through it, one line at a time, feeding each one into the special variable $_ - this is what we'll be matching.

    Once each line of the file has been fed into $_ , we test to see if it matches the pattern, which is our syllable, 'ink', anchored to the end of the line (with $ ). If so, we print it out.

    The important thing to note here is that perl treats the 'ink' as the last thing on the line, even though there is a new line at the end of $_ . Regular expressions typically ignore the last new line in a string - we'll look at this behavior in more detail later.

    Shortcuts and Options

    All this is all very well if we know exactly what it is we're trying to find, but finding patterns means more than just locating exact pieces of text. We may want to find a three-digit number, the first word on the line, four or more letters all in capitals, and so on.

    We can begin to do this using character classes - these aren't just single characters, but something that signifies that any one of a set of characters is acceptable. To specify this, we put the characters we consider acceptable inside square brackets. Let's go back to our matchtest program, using the same test string:

    $_ = q("I wonder what the Entish is for 'yes' and 'no'," he thought.);  
    > perl matchtest.plx
    Enter some text to find: w[aoi]nder
    The text matches the pattern 'w[aoi]nder'.
    >
    

    What have we done? We've tested whether the string contains a 'w', followed by either an 'a', an 'o', or an 'i', followed by 'nder'; in effect, we're looking for either of 'wander', 'wonder', or 'winder'. Since the string contains 'wonder', the pattern is matched.

    Conversely, we can say that everything is acceptable except a given sequence of characters - we can 'negate the character class'. To do this, the character class should start with a ^ , like so:

    > perl matchtest.plx
    Enter some text to find: th[^eo]
    'th[^eo]' was not found.
    > 
    

    So, we're looking for 'th' followed by something that is neither an 'e' or an 'o'. But all we have is 'the' and 'thought', so this pattern does not match.

    If the characters you wish to match form a sequence in the character set you're using - ASCII or Unicode, depending on your perl version - you can use a hyphen to specify a range of characters, rather than spelling out the entire range. For instance, the numerals can be represented by the character class [0-9] . A lower case letter can be matched with [a-z] . Are there any numbers in our quote?

    > perl matchtest.plx
    Enter some text to find: [0-9]
    '[0-9]' was not found.
    >
    

    You can use one or more of these ranges alongside other characters in a character class, so long as they stay inside the brackets. If you wanted to match a digit and then a letter from 'A' to 'F', you would say [0-9][A-F] . However, to match a single hexadecimal digit, you would write [0-9A-F] or [0-9A-Fa-f] if you wished to include lower-case letters.

    Some character classes are going to come up again and again: the digits, the letters, and the various types of whitespace. Perl provides us with some neat shortcuts for these. Here are the most common ones, and what they represent:

    also, the negative forms of the above:

    So, if we wanted to see if there was a five-letter word in the sentence, you might think we could do this:

    > perl matchtest.plx
    Enter some text to find: \w\w\w\w\w
    The text matches the pattern '\w\w\w\w\w'.
    >
    

    But that's not right - there are no five-letter words in the sentence! The problem is, we've only asked for five letters in a row, and any word with at least five letters contains five in a row will match that pattern. We actually matched 'wonde', which was the first possible series of five letters in a row. To actually get a five-letter word, we might consider deciding that the word must appear in the middle of the sentence, that is, between two spaces:

    > perl matchtest.plx
    Enter some text to find: \s\w\w\w\w\w\s
    '\s\w\w\w\w\w\s' was not found.
    >
    

    Word Boundaries

    The problem with that is, when we're looking at text, words aren't always between two spaces. They can be followed by or preceded by punctuation, or appear at the beginning or end of a string, or otherwise next to non-word characters. To help us properly search for words in these cases, Perl provides the special \b metacharacter. The interesting thing about \b is that it doesn't actually match any character in particular. Rather, it matches the point between something that isn't a word character (either \W or one of the ends of the string) and something that is (a word character), hence \b for boundary. So, for example, to look for one-letter words:

    > perl matchtest.plx
    Enter some text to find: \s\w\s
    '\s\w\s' was not found.
    > perl matchtest.plx
    Enter some text to find: \b\w\b
    The text matches the pattern '\b\w\b'.
    

    As the I was preceded by a quotation mark, a space wouldn't match it - but a word boundary does the job. Later, we'll learn how to tell perl how many repetitions of a character or group of characters we want to match without spelling it out directly.

    What, then, if we wanted to match anything at all? You might consider something like [\w\W] or [\s\S] , for instance. Actually, this is quite a common operation, so Perl provides an easy way of specifying it - a full stop. What about an 'r' followed by two characters - any two characters - and then a 'h'?

    > perl matchtest.plx
    Enter some text to find: r..h
    The text matches the pattern 'r..h'.
    >
    

    Is there anything after the full stop?

    > perl matchtest.plx
    Enter some text to find: \..
    '\..' was not found.
    >
    

    What's that? One backslashed full stop to mean a full stop, then a plain one to mean 'anything at all'.

    Posix and Unicode Classes

    Perl 5.6.0 introduced a few more character classes into the mix - first, those defined by the POSIX (Portable Operating Systems Interface) standard, which are therefore present in a number of other applications. The more common character classes here are:

    The Unicode standard also defines 'properties', which apply to some characters. For instance, the 'IsUpper ' property can be used to match any upper-case character, in whichever language or alphabet. If you know the property you are trying to match, you can use the syntax \p{} to match it, for instance, the upper-case character is \p{IsUpper} .

    Alternatives

    Instead of giving a series of acceptable characters, you may want to say 'match either this or that'. The 'either-or' operator in a regular expression is the same as the bitwise 'or' operator, | . So, to match either 'yes' or 'maybe' in our example, we could say this:

    > perl matchtest.plx
    Enter some text to find: yes|maybe
    The text matches the pattern 'yes|maybe'.
    >
    

    That's either 'yes' or 'maybe'. But what if we wanted either 'yes' or 'yet'? To get alternatives on part of an expression, we need to group the options. In a regular expression, grouping is always done with parentheses:

    > perl matchtest.plx
    Enter some text to find: ye(s|t)
    The text matches the pattern 'ye(s|t)'.
    >
    

    If we have forgotten the parentheses, we would have tried to match either 'yes' or 't'. In this case, we'd still get a positive match, but it wouldn't be doing what we want - we'd get a match for any string with a 't' in it, whether the words 'yes' or 'yet' were there or not.

    You can match either 'this' or 'that' or 'the other' by adding more alternatives:

    > perl matchtest.plx
    Enter some text to find: (this)|(that)|(the other)
    '(this)|(that)|(the other)' was not found.
    > 
    

    However, in this case, it's more efficient to separate out the common elements:

    > perl matchtest.plx
    Enter some text to find: th(is|at|e other)
    'th(is|at|e other)' was not found. 
    

    You can also nest alternatives. Say you want to match one of these patterns:

    'the' followed by whitespace or a letter,
    'or'

    You might put something like this:

    > perl matchtest.plx
    Enter some text to find: (the(\s|[a-z]))|or
    The text matches the pattern '(the(\s|[a-z]))|or'.
    >
    

    It looks fearsome, but break it down into its components. Our two alternatives are:

    the(\s|[a-z])
    or

    The second part is easy, while the first contains 'the' followed by two alternatives: \s and [a-z] . Hence 'either "the" followed by either a whitespace or a lower case letter, or "or"'. We can, in fact, tidy this up a little, by replacing (\s|[a-z]) with the less cluttered [\sa-z].

    > perl matchtest.plx
    Enter some text to find: (the[\sa-z])|or
    The text matches the pattern '(the[\sa-z])|or'.
    >
    


    DISCLAIMER: The content provided in this article is not warranted or guaranteed by Developer Shed, Inc. The content provided is intended for entertainment and/or educational purposes in order to introduce to the reader key ideas, concepts, and/or product reviews. As such it is incumbent upon the reader to employ real-world tactics for security and implementation of best practices. We are not liable for any negative consequences that may result from implementing any information covered in our articles or tutorials. If this is a hardware review, it is not recommended to open and/or modify your hardware.

    More Web Development Articles
    More By Developer Shed

       

    WEB DEVELOPMENT ARTICLES

    - On Page SEO for New Domains
    - Improve Your Site`s Speed
    - Safari Books Online Review
    - Creating an Estore From the Ground Up
    - Most Common SEO Mistakes Developers Make
    - Making the Most of Your Titles and Meta Desc...
    - Five Ways Using Flash Can Damage Your Site
    - A Web Designer`s Guide to Colors
    - Use Webstarts to Create a Free Site
    - More Than Just Looks. How Your Web Design C...
    - How to Design Content Pages
    - Mint Review
    - Make Your WordPress Website Look Professional
    - How to Create a Mobile Web Site
    - Meta Tags: Still Useful?

    Developer Shed Affiliates

     



    © 2003-2018 by Developer Shed. All rights reserved. DS Cluster - Follow our Sitemap