Title

Previous Chapter
Next Chapter

Links
Sections
Chapters
Copyright

Sections

Pattern Delimiters

The Matching Operator (m//)

The Substitution Operator (s///)

The Translation Operator (tr///)

The Binding Operators (=~ and !~)

How to Create Patterns

Pattern Examples

Summary

Review Questions

Review Exercises

Chapters

ERRATA

Welcome!

Introduction

Part I: Basic Perl

01-Getting Your Feet Wet

02-Numeric and String Literals

03-Variables

04-Operators

05-Functions

06-Statements

07-Control Statements

08-References

Part II: Intermediate Perl

09-Using Files

10-Regular Expressions

11-Creating Reports

Part III: Advanced Perl

12-Using Special Variables

13-Handling Errors and Signals

14-What Are Objects?

15-Perl Modules

16-Debugging Perl

17-Command line Options

Part IV: Perl and the Internet

18-Using Internet Protocols

ftplib.pl

19-What is CGI?

20-Form Processing

21-Using Perl with Web Servers

22-Internet Resources

Appendixes

A-Review Questions

B-Glossary

C-Function List

D-The Windows Registry

E-What's On the CD?

     

10 - Regular Expressions

You can use a regular expression to find patterns in strings: for example, to look for a specific name in a phone list or all of the names that start with the letter a. Pattern matching is one of Perl's most powerful and probably least understood features. But after you read this chapter, you'll be able to handle regular expressions almost as well as a Perl guru. With a little practice, you'll be able to do some incredibly handy things.

There are three main uses for regular expressions in Perl: matching, substitution, and translation. The matching operation uses the m// operator, which evaluates to a true or false value. The substitution operation substitutes one expression for another; it uses the s/// operator. The translation operation translates one set of characters to another and uses the tr/// operator. These operators are summarized in Table 10.1.

Table 10.1 - Perl's Regular Expression Operators
Operator Description
m/PATTERN/ This operator returns true if PATTERN is found in $_.
s/PATTERN/REPLACEMENT/ This operator replaces the sub- string matched by PATTERN with REPLACEMENT.
tr/CHARACTERS/REPLACEMENTS/ This operator replaces characters specified by CHARACTERS with the characters in REPLACEMENTS.

All three regular expression operators work with $_ as the string to search. You can use the binding operators (see the section "The Binding Operators (=~ and !~)" later in this section) to search a variable other than $_.

Both the matching (m//) and the substitution (s///) operators perform variable interpolation on the PATTERN and REPLACEMENT strings. This comes in handy if you need to read the pattern from the keyboard or a file.

If the match pattern evaluates to the empty string, the last valid pattern is used. So, if you see a statement like print if //; in a Perl program, look for the previous regular expression operator to see what the pattern really is. The substitution operator also uses this interpretation of the empty pattern.

In this chapter, you learn about pattern delimiters and then about each type of regular expression operator. After that, you learn how to create patterns in the section"How to Create Patterns" .. Then, the "Pattern Examples" section shows you some situations and how regular expressions can be used to resolve the situations.

Pattern Delimiters

Every regular expression operator allows the use of alternative pattern delimiters. A delimiter marks the beginning and end of a given pattern. In the following statement,

m//;
you see two of the standard delimiters - the slashes (//). However, you can use any character as the delimiter. This feature is useful if you want to use the slash character inside your pattern. For instance, to match a file you would normally use:

m/\/root\/home\/random.dat/
This match statement is hard to read because all of the slashes seem to run together (some programmers say they look like teepees). If you use an alternate delimiter, if might look like this:

m!/root/home/random.dat!
or

m{/root/home/random.dat}
You can see that these examples are a little clearer. The last example also shows that if a left bracket is used as the starting delimiter, then the ending delimiter must be the right bracket.

Errata Note
The printed version of this book shows the above examples as m!\/root\/home\/random.dat! and as m{\/root\/home\/random.dat}. While I was writing the book it did not occur to be that the / character was not a metacharacter and only needed to be escaped because of the delimiters. Obviously, if the / character is the delimiter, it needs to be escaped in order to use it inside the pattern. However, if an alternative delimiter is used, it no longer needs to be escaped. - this fact was pointed out to me by Garen Deve.

Both the match and substitution operators let you use variable interpolation. You can take advantage of this to use a single-quoted string that does not require the slash to be escaped. For instance:

$file = '/root/home/random.dat';
m/$file/; 
You might find that this technique yields clearer code than simply changing the delimiters.

If you choose the single quote as your delimiter character, then no variable interpolation is performed on the pattern. However, you still need to use the backslash character to escape any of the meta-characters discussed in the "How to Create Patterns" section later in this chapter.

Tip
I tend to avoid delimiters that might be confused with characters in the pattern. For example, using the plus sign as a delimiter (m+abc+) does not help program readability. A casual reader might think that you intend to add two expressions instead of matching them.

Caution
The ? has a special meaning when used as a match pattern delimiter. It works like the / delimiter except that it matches only once between calls to the reset() function. This feature may be removed in future versions of Perl, so avoid using it.

The next few sections look at the matching, substitution, and translation operators in more detail.

The Matching Operator (m//)

The matching operator (m//) is used to find patterns in strings. One of its more common uses is to look for a specific string inside a data file. For instance, you might look for all customers whose last name is "Johnson" or you might need a list of all names starting with the letter s.

The matching operator only searches the $_ variable. This makes the match statement shorter because you don't need to specify where to search. Here is a quick example:

$_ = "AAA bbb AAA";
print "Found bbb\n" if  m/bbb/;
The print statement is executed only if the bbb character sequence is found in the $_ variable. In this particular case, bbb will be found, so the program will display the following:

Found bbb
The matching operator allows you to use variable interpolation in order to create the pattern. For example:

$needToFind = "bbb";
$_ = "AAA bbb AAA";
print "Found bbb\n" if  m/$needToFind/;
Using the matching operator is so commonplace that Perl allows you to leave off the m from the matching operator as long as slashes are used as delimiters:

$_ = "AAA bbb AAA";
print "Found bbb\n" if  /bbb/;
Using the matching operator to find a string inside a file is very easy because the defaults are designed to facilitate this activity. For example:

$target = "M";

open(INPUT, "<findstr.dat");

while (<INPUT>) {
     if (/$target/) {
         print "Found $target on line $.";
     }
}
close(INPUT);
Note
The $. special variable keeps track of the record number. Every time the diamond operators read a line, this variable is incremented.

This example reads every line in an input searching for the letter M. When an M is found, the print statement is executed. The print statement prints the letter that is found and the line number it was found on.

The Matching Options

The matching operator has several options that enhance its utility. The most useful option is probably the capability to ignore case and to create an array of all matches in a string. Table 10.2 shows the options you can use with the matching operator.

Table 10.2 - Options for the Matching Operator
Option Description
g This option finds all occurrences of the pattern in the string. A list of matches is returned or you can iterate over the matches using a loop statement.
i This option ignores the case of characters in the string.
m This option treats the string as multiple lines. Perl does some optimization by assuming that $_ contains a single line of input. If you know that it contains multiple newline characters, use this option to turn off the optimization.
o This option compiles the pattern only once. You can achieve some small performance gains with this option. It should be used with variable interpolation only when the value of the variable will not change during the lifetime of the program.
s This option treats the string as a single line.
x This option lets you use extended regular expressions. Basically, this means that Perl will ignore whitespace that's not escaped with a backslash or within a character class. I highly recommend this option so you can use spaces to make your regular expressions more readable. See the section "Example: Extension Syntax" later in this chapter for more information.

All options are specified after the last pattern delimiter. For instance, if you want the match to ignore the case of the characters in the string, you can do this:

$_ = "AAA BBB AAA";
print "Found bbb\n" if  m/bbb/i;
This program finds a match even though the pattern uses lowercase and the string uses uppercase because the /i option was used, telling Perl to ignore the case.

The result from a global pattern match can be assigned to an array variable or used inside a loop. This feature comes in handy after you learn about meta-characters in the section called "How to Create Patterns" later in this chapter.

The Substitution Operator (s///)

The substitution operator (s///) is used to change strings. It requires two operands, like this:

s/a/z/;
This statement changes the first a in $_ into a z. Not too complicated, huh? Things won't get complicated until we start talking about regular expressions in earnest in the section "How to Create Patterns" later in the chapter.

You can use variable interpolation with the substitution operator just as you can with the matching operator. For instance:

$needToReplace   = "bbb";
$replacementText = "1234567890";
$_ = "AAA bbb AAA";
$result = s/$needToReplace/$replacementText/;
Note
You can use variable interpolation in the replacement pattern as shown here, but none of the meta-characters described later in the chapter can be used in the replacement pattern.

This program changes the $_ variable to hold "AAA 1234567890 AAA" instead of its original value, and the $result variable will be equal to 1 - the number of substitutions made.

Frequently, the substitution operator is used to remove substrings. For instance, if you want to remove the "bbb" sequence of characters from the $_ variable, you could do this:

s/bbb//;
By replacing the matched string with nothing, you have effectively deleted it.

If brackets of any type are used as delimiters for the search pattern, you need to use a second set of brackets to enclose the replacement pattern. For instance:

$_ = "AAA bbb AAA";
$result = s{bbb}{1234567890};

The Substitution Options

Like the matching operator, the substitution operator has several options. One interesting option is the capability to evaluate the replacement pattern as an expression instead of a string. You could use this capability to find all numbers in a file and multiply them by a given percentage, for instance. Or you could repeat matched strings by using the string repetition operator. Table 10.3 shows all of the options you can use with the substitution operator.

Table 10.3 - Options for the Substitution Operator
Option Description
e This option forces Perl to evaluate the replacement pattern as an expression.
g This option replaces all occurrences of the pattern in the string.
i This option ignores the case of characters in the string.
m This option treats the string as multiple lines. Perl does some optimization by assuming that $_ contains a single line of input. If you know that it contains multiple newline characters, use this option to turn off the optimization.
o This option compiles the pattern only once. You can achieve some small performance gains with this option. It should be used with variable interpolation only when the value of the variable will not change during the lifetime of the program.
s This option treats the string as a single line.
x This option lets you use extended regular expressions. Basically, this means that Perl ignores whitespace that is not escaped with a backslash or within a character class. I highly recommend this option so you can use spaces to make your regular expressions more readable. See the section "Example: Extension Syntax" later in this chapter for more information.

The /e option changes the interpretation of the pattern delimiters. If used, variable interpolation is active even if single quotes are used. In addition, if back quotes are used as delimiters, the replacement pattern is executed as a DOS or UNIX command. The output of the command is then used as the replacement text.

The Translation Operator (tr///)

The translation operator (tr///) is used to change individual characters in the $_ variable. It requires two operands, like this:

tr/a/z/;
This statement translates all occurrences of a into z. If you specify more than one character in the match character list, you can translate multiple characters at a time. For instance:

tr/ab/z/;
translates all a and all b characters into the z character. If the replacement list of characters is shorter than the target list of characters, the last character in the replacement list is repeated as often as needed. However, if more than one replacement character is given for a matched character, only the first is used. For instance:

tr/WWW/ABC/;
results in all W characters being converted to an A character. The rest of the replacement list is ignored.

Unlike the matching and substitution operators, the translation operator doesn't perform variable interpolation.

Note
The tr operator gets its name from the UNIX tr utility. If you are familiar with the tr utility, then you already know how to use the tr operator.Z

The UNIX sed utility uses a y to indicate translations. To make learning Perl easier for sed users, y is supported as a synonym for tr.

The Translation Options

The translation operator has options different from the matching and substitution operators. You can delete matched characters, replace repeated characters with a single character, and translate only characters that don't match the character list. Table 10.4 shows the translation options.

Table 10.4 - Options for the Translation Operator
Option Description
c This option complements the match character list. In other words, the translation is done for every character that does not match the character list.
d This option deletes any character in the match list that does not have a corresponding character in the replacement list.
s This option reduces repeated instances of matched characters to a single instance of that character.

Normally, if the match list is longer than the replacement list, the last character in the replacement list is used as the replacement for the extra characters. However, when the d option is used, the matched characters are simply deleted.

If the replacement list is empty, then no translation is done. The operator will still return the number of characters that matched, though. This is useful when you need to know how often a given letter appears in a string. This feature also can compress repeated characters using the s option.

Tip
UNIX programmers may be familiar with using the tr utility to convert lowercase characters to uppercase characters, or vice versa. Perl now has the lc() and uc() functions that can do this much quicker.

The Binding Operators (=~ and !~)

The search, modify, and translation operations work on the $_ variable by default. What if the string to be searched is in some other variable? That's where the binding operators come into play. They let you bind the regular expression operators to a variable other than $_. There are two forms of the binding operator: the regular =~ and its complement !~. The following small program shows the syntax of the =~ operator:

$scalar       = "The root has many leaves";
$match        = $scalar =~ m/root/;
$substitution = $scalar =~ s/root/tree/;
$translate    = $scalar =~ tr/h/H/;

print("\$match        = $match\n");
print("\$substitution = $substitution\n");
print("\$translate    = $translate\n");
print("\$scalar       = $scalar\n");
This program displays the following:

$match        = 1
$substitution = 1
$translate    = 2
$scalar       = THe tree Has many leaves
This example uses all three of the regular expression operators with the regular binding operator. Each of the regular expression operators was bound to the $scalar variable instead of $_. This example also shows the return values of the regular expression operators. If you don't need the return values, you could do this:

$scalar = "The root has many leaves";
print("String has root.\n") if $scalar =~ m/root/;
$scalar =~ s/root/tree/;
$scalar =~ tr/h/H/;
print("\$scalar = $scalar\n");
This program displays the following:

String has root.
$scalar = THe tree Has many leaves
The left operand of the binding operator is the string to be searched, modified, or transformed; the right operand is the regular expression operator to be evaluated.

The complementary binding operator is valid only when used with the matching regular expression operator. If you use it with the substitution or translation operator, you get the following message if you're using the -w command-line option to run Perl:

Useless use of not in void context at test.pl line 4.
You can see that the !~ is the opposite of =~ by replacing the =~ in the previous example:

$scalar = "The root has many leaves";
print("String has root.\n") if $scalar !~ m/root/;
$scalar =~ s/root/tree/;
$scalar =~ tr/h/H/;
print("\$scalar = $scalar\n");
This program displays the following:

$scalar = THe tree Has many leaves
The first print line does not get executed because the complementary binding operator returns false.

How to Create Patterns

So far in this chapter, you've read about the different operators used with regular expressions, and you've seen how to match simple sequences of characters. Now we'll look at the wide array of meta-characters that are used to harness the full power of regular expressions. Meta-characters are characters that have an additional meaning above and beyond their literal meaning. For example, the period character can have two meanings in a pattern. First, it can be used to match a period character in the searched string - this is its literal meaning. And second, it can be used to match any character in the searched string except for the newline character - this is its meta-meaning.

When creating patterns, the meta-meaning will always be the default. If you really intend to match the literal character, you need to prefix the meta-character with a backslash. You might recall that the backslash is used to create an escape sequence.

Patterns can have many different components. These components all combine to provide you with the power to match any type of string. The following list of components will give you a good idea of the variety of ways that patterns can be created. The section "Pattern Examples" later in this chapter shows many examples of these rules in action.

The power of patterns is that you don't always know in advance the value of the string that you will be searching. If you need to match the first word in a string that was read in from a file, you probably have no idea how long it might be; therefore, you need to build a pattern. You might start with the \w symbolic character class, which will match any single alphanumeric or underscore character. So, assuming that the string is in the $_ variable, you can match a one-character word like this:

m/\w/;
If you need to match both a one-character word and a two-character word, you can do this:

m/\w|\w\w/;
This pattern says to match a single word character or two consecutive word characters. You could continue to add alternation components to match the different lengths of words that you might expect to see, but there is a better way.

You can use the + quantifier to say that the match should succeed only if the component is matched one or more times. It is used this way:

m/\w+/;
If the value of $_ was "AAA BBB", then m/\w+/; would match the "AAA" in the string. If $_ was blank, full of whitespace, or full of other non-word characters, an undefined value would be returned.

The preceding pattern will let you determine if $_ contains a word but does not let you know what the word is. In order to accomplish that, you need to enclose the matching components inside parentheses. For example:

m/(\w+)/;
By doing this, you force Perl to store the matched string into the $1 variable. The $1 variable can be considered as pattern memory.

This introduction to pattern components describes most of the details you need to know in order to create your own patterns or regular expressions. However, some of the components deserve a bit more study. The next few sections look at character classes, quantifiers, pattern memory, pattern precedence, and the extension syntax. Then the rest of the chapter is devoted to showing specific examples of when to use the different components.

Example: Character Classes

A character class defines a type of character. The character class [0123456789] defines the class of decimal digits, and [0-9a-f] defines the class of hexadecimal digits. Notice that you can use a dash to define a range of consecutive characters. Character classes let you match any of a range of characters; you don't know in advance which character will be matched. This capability to match non-specific characters is what meta-characters are all about.

You can use variable interpolation inside the character class, but you must be careful when doing so. For example,

$_ = "AAABBBCCC";
$charList = "ADE";
print "matched" if m/[$charList]/;
will display

matched
This is because the variable interpolation results in a character class of [ADE]. If you use the variable as one-half of a character range, you need to ensure that you don't mix numbers and digits. For example,

$_ = "AAABBBCCC";
$charList = "ADE";
print "matched" if m/[$charList-9]/;
will result in the following error message when executed:

/[ADE-9]/: invalid [] range in regexp at test.pl line 4.
At times, it's necessary to match on any character except for a given character list. This is done by complementing the character class with the caret. For example,

$_ = "AAABBBCCC";
print "matched" if m/[^ABC]/;
will display nothing. This match returns true only if a character besides A, B, or C is in the searched string. If you complement a list with just the letter A,

$_ = "AAABBBCCC";
print "matched" if m/[^A]/;
then the string "matched" will be displayed because B and C are part of the string - in other words, a character besides the letter A.

Perl has shortcuts for some character classes that are frequently used. Here is a list of what I call symbolic character classes:

You can use these symbols inside other character classes but not as endpoints of a range. For example, you can do the following:

$_ = "\tAAA";
print "matched" if m/[\d\s]/;
which will display

matched
because the value of $_ includes the tab character.

Tip
Meta-characters that appear inside the square brackets that define a character class are used in their literal sense. They lose their meta-meaning. This may be a little confusing at first. In fact, I have a tendency to forget this when evaluating patterns.

Note
I think that most of the confusion regarding regular expressions lies in the fact that each character of a pattern might have several possible meanings. The caret could be an anchor, it could be a caret, or it could be used to complement a character class. Therefore, it is vital that you decide which context any given pattern character or symbol is in before assigning a meaning to it.

Example: Quantifiers

Perl provides several different quantifiers that let you specify how many times a given component must be present before the match is true. They are used when you don't know in advance how many characters need to be matched. Table 10.6 lists the different quantifiers that can be used.

Table 10.6 - The Six Types of Quantifiers
Quantifier Description
* The component must be present zero or more times.
+ The component must be present one or more times.
? The component must be present zero or one times.
{n} The component must be present n times.
{n,} The component must be present at least n times.
{n,m} The component must be present at least n times and no more than m times.

If you need to match a word whose length is unknown, you need to use the + quantifier. You can't use an * because a zero length word makes no sense. So, the match statement might look like this:

m/^\w+/;
This pattern will match "QQQ" and "AAAAA" but not "" or " BBB ". In order to account for the leading whitespace, which may or not be at the beginning of a string, you need to use the asterisk (*) quantifier in conjunction with the \s symbolic character class in the following way:

m/\s*\w+/;
Tip
Be careful when using the * quantifier because it can match an empty string, which might not be your intention. The pattern /b*/ will match any string - even one without any b characters.

Errata Note
The printed version of this book has the first match statement as
m/\w+/;
, notice that pattern anchor was left out.

At times, you may need to match an exact number of components. The following match statement will be true only if five words are present in the $_ variable:

$_ = "AA AB AC AD AE";
m/^(\w+\W+){5}$/;
In this example, we are matching at least one word character followed by zero or more non-word characters. Notice that Perl considers the end of a string as a non-word character. The {5} quantifier is used to ensure that that combination of components is present five times.

Errata Note
The printed version of the book used the pattern m/(\w+\s*){5}/; in order to match the five words. This is incorrect since the pattern \w+\s* matches a single character (remember that * matches zero or more instances of a character). Therefore m/(\w+\s*){5}/; matches "AAAA" as well as "A A A A A".

The * and + quantifiers are greedy. They match as many characters as possible. This may not always be the behavior that you need. You can create non-greedy components by following the quantifier with a ?.

Use the following file specification in order to look at the * and + quantifiers more closely:

$_ = '/user/Jackie/temp/names.dat';
The regular expression .* will match the entire file specification. This can be seen in the following small program:

$_ = '/user/Jackie/temp/names.dat';
m/.*/;
print $&;
This program displays

/user/Jackie/temp/names.dat
You can see that the * quantifier is greedy. It matched the whole string. If you add the ? modifier to make the .* component non-greedy, what do you think the program would display?

$_ = '/user/Jackie/temp/names.dat';
m/.*?/;
print $&;
This program displays nothing because the least amount of characters that the * matches is zero. If we change the * to a +, then the program will display

/
Next, let's look at the concept of pattern memory, which lets you keep bits of matched string around after the match is complete.

Example: Pattern Memory

Matching arbitrary numbers of characters is fine, but without the capability to find out what was matched, patterns would be not very useful. Perl lets you enclose pattern components inside parentheses in order to store the string that matched the components into pattern memory. You might also hear pattern memory referred to as pattern buffers. This memory persists after the match statement is finished executing so that you can assign the matched values to other variables.

You saw a simple example of this earlier right after the component descriptions. That example looked for the first word in a string and stored it into the first buffer, $1. The following small program

$_ =  "AAA BBB CCC";
m/(\w+)/;
print("$1\n");
will display

AAA

You can use as many buffers as you need. Each time you add a set of parentheses, another buffer is used. The pattern matched by the first set is placed into $1. The pattern matched by the second set is placed into $2. And so on.

If you want to find all the words in the string, you need to use the /g match option. In order to find all the words, you can use a loop statement that loops until the match operator returns false.

$_ =  "AAA BBB CCC";

while (m/(\w+)/g) {
    print("$1\n");
}
The program will display

AAA
BBB
CCC
If looping through the matches is not the right approach for your needs, perhaps you need to create an array consisting of the matches.

$_ =  "AAA BBB CCC";
@matches = m/(\w+)/g;
print("@matches\n");
The program will display

AAA BBB CCC
Perl also has a few special variables to help you know what matched and what did not. These variables will occasionally save you from having to add parentheses to find information.

Tip
If you need to save the value of the matched strings stored in the pattern memory, make sure to assign them to other variables. Pattern memory is local to the enclosing block and lasts only until another match is done.

Example: Pattern Precedence

Pattern components have an order of precedence just as operators do. If you see the following pattern:

m/a|b+/
it's hard to tell if the pattern should be

 m/(a|b)+/  # match any sequence of  "a" and "b" characters
             # in any order.
or

m/a|(b+)/   # match either the "a" character or the "b" character
            # repeated one or more times.
The order of precedence shown in Table 10.7 is designed to solve problems like this. By looking at the table, you can see that quantifiers have a higher precedence than alternation. Therefore, the second interpretation is correct.

Table 10.7 - The Pattern Component Order of Precedence
Precedence Level Component
1 Parentheses
2 Quantifiers
3 Sequences and Anchors
4 Alternation

Tip
You can use parentheses to affect the order that components are evaluated because they have the highest precedence. However, unless you use the extended syntax, you will be affecting the pattern memory.

Example: Extension Syntax

The regular expression extensions are a way to significantly add to the power of patterns without adding a lot of meta-characters to the proliferation that already exists. By using the basic (?...) notation, the regular expression capabilities can be greatly extended.

At this time, Perl recognizes five extensions. These vary widely in functionality - from adding comments to setting options. Table 10.8 lists the extensions and gives a short description of each.

Table 10.8 - Five Extension Components
Extension Description
(?# TEXT) This extension lets you add comments to your regular expression. The TEXT value is ignored.
(?:...) This extension lets you add parentheses to your regular expression without causing a pattern memory position to be used.
(?=...) This extension lets you match values without including them in the $& variable.
(?!...) This extension lets you specify what should not follow your pattern. For instance, /blue(?!bird)/ means that "bluebox" and "bluesy" will be matched but not "bluebird".
(?sxi) This extension lets you specify an embedded option in the pattern rather than adding it after the last delimiter. This is useful if you are storing patterns in variables and using variable interpolation to do the matching.

By far the most useful feature of extended mode, in my opinion, is the ability to add comments directly inside your patterns. For example, would you rather a see a pattern that looks like this:

# Match a string with two words. $1 will be the
# first word. $2 will be the second word.
m/^\s*(\w+)\W+(\w+)\s*$/;
or one that looks like this:

m/
    (?# This pattern will match any string with two)
    (?# and only two words in it. The matched words)
    (?# will be available in $1 and $2 if the match)
    (?# is successful.)

    ^      (?# Anchor this match to the beginning)
           (?# of the string)

    \s*    (?# skip over any whitespace characters)
           (?# use the * because there may be none)

    (\w+)  (?# Match the first word, we know it's)
           (?# the first word because of the anchor)
           (?# above. Place the matched word into)
           (?# pattern memory.)

    \W+    (?# Match at least one non-word)
           (?# character, there may be more than one)

    (\w+)  (?# Match another word, put into pattern)
           (?# memory also.)

    \s*    (?# skip over any whitespace characters)
           (?# use the * because there may be none)

    $      (?# Anchor this match to the end of the)
           (?# string. Because both ^ and $ anchors)
           (?# are present, the entire string will)
           (?# need to match the pattern. A)
           (?# sub-string that fits the pattern will)
           (?# not match.)
/x;
Of course, the commented pattern is much longer, but they take the same amount of time to execute. In addition, it will be much easier to maintain the commented pattern because each component is explained. When you know what each component is doing in relation to the rest of the pattern, it becomes easy to modify its behavior when the need arises.

Extensions also let you change the order of evaluation without affecting pattern memory. For example,

m/(?:a|b)+/;
matches the a or b characters repeated one or more times in any order. The pattern memory will not be affected.

At times, you might like to include a pattern component in your pattern without including it in the $& variable that holds the matched string. The technical term for this is a zero-width positive look-ahead assertion. You can use this to ensure that the string following the matched component is correct without affecting the matched value. For example, if you have some data that looks like this:

David    Veterinarian 56
Jackie  Orthopedist 34
Karen Veterinarian 28
and you want to find all veterinarians and store the value of the first column, you can use a look-ahead assertion. This will do both tasks in one step. For example:

while (<>) {
    push(@array, $&) if m/^\w+(?=\s+Vet)/;
}

print("@array\n");
This program will display:

David Karen
Let's look at the pattern with comments added using the extended mode. In this case, it doesn't make sense to add comments directly to the pattern because the pattern is part of the if statement modifier. Adding comments in that location would make the comments hard to format. So let's use a different tactic.

$pattern = '^\w+     (?# Match the first word in the string)

            (?=\s+   (?# Use a look-ahead assertion to match)
                     (?# one or more whitespace characters)

               Vet)  (?# In addition to the whitespace, make)
                     (?# sure that the next column starts)
                     (?# with the character sequence "Vet")
           ';

while (<>) {
    push(@array, $&) if m/$pattern/x;
}

print("@array\n");
Here we used a variable to hold the pattern and then used variable interpolation in the pattern with the match operator. You might want to pick a more descriptive variable name than $pattern, however.

Tip
Although the Perl documentation does not mention it, I believe you have only one look-ahead assertion per pattern, and it must be the last pattern component.

The last extension that we'll discuss is the zero-width negative assertion. This type of component is used to specify values that shouldn't follow the matched string. For example, using the same data as in the previous example, you can look for everyone who is not a veterinarian. Your first inclination might be to simply replace the (?=...) with the (?!...) in the previous example.

 while (<>) {
    push(@array, $&) if m/^\w+(?!\s+Vet)/;
}

print("@array\n");
Unfortunately, this program displays

Davi Jackie Kare
which is not what you need. The problem is that Perl is looking at the last character of the word to see if it matches the Vet character sequence. In order to correctly match the first word, you need to explicitly tell Perl that the first word ends at a word boundary, like this:

while (<>) {
    push(@array, $&) if m/^\w+\b(?!\s+Vet)/;
}

print("@array\n");
This program displays

Jackie
which is correct.

Tip
There are many ways of matching any value. If the first method you try doesn't work, try breaking the value into smaller components and match each boundary. If all else fails, you can always ask for help on the comp.lang.perl.misc newsgroup.

Pattern Examples

In order to demonstrate many different patterns, I will depart from the standard example format in this section. Instead, I will explain a matching situation in italicized text and then a possible resolution will immediately follow. After the resolution, I'll add some comments to explain how the match is done. In all of these examples, the string to search will be in the $_ variable.

Example: Using the Match Operator

Example: Using the Substitution Operator

Example: Using the Translation Operator

Example: Using the Split() Function

Summary

This chapter introduced you to regular expressions or patterns, regular expression operators, and the binding operators. There are three regular expression operators - m//, s///, and tr/// - which are used to match, substitute, and translate and use the $_ variable as the default operand. The binding operators, =~ and !~, are used to bind the regular expression operators to a variable other than $_.

While the slash character is the default pattern delimiter, you can use any character in its place. This feature is useful if the pattern contains the slash character. If you use an opening bracket or parenthesis as the beginning delimiter, use the closing bracket or parenthesis as the ending delimiter. Using the single-quote as the delimiter will turn off variable interpolation for the pattern.

The matching operator has six options: /g, /i, /m, /o, /s, and /x. These options were described in Table 10.2. I've found that the /x option is very helpful for creating maintainable, commented programs. The /g option, used to find all matches in a string, is also very useful. And, of course, the capability to create case-insensitive patterns using the /i option is crucial in many cases.

The substitution operator has the same options as the matching operator and one more - the /e option. The /e option lets you evaluate the replacement pattern and use the new value as the replacement string. If you use back-quotes as delimiters, the replacement pattern will be executed as a DOS or UNIX command, and the resulting output will become the replacement string.

The translation operator has three options: /c, /d, and /s. These options are used to complement the match character list, delete characters not in the match character list, and eliminate repeated characters in a string. If no replacement list is specified, the number of matched characters will be returned. This is handy if you need to know how many times a given character appears in a string.

The binding operators are used to force the matching, substitution, and translation operators to search a variable other than $_. The =~ operator can be used with all three of the regular expression operators, while the !~ operator can be used only with the matching operator.

Quite a bit of space was devoted to creating patterns, and the topic deserves even more space. This is easily one of the more involved features of the Perl language. One key concept is that a character can have multiple meanings. For example, the plus sign can mean a plus sign in one instance (its literal meaning), and in another it means match something one or more times (its meta-meaning).

You learned about regular expression components and that they can be combined in an infinite number of ways. Table 10.5 listed most of the meta-meanings for different characters. You read about character classes, alternation, quantifiers, anchors, pattern memory, word boundaries, and extended components.

The last section of the chapter was devoted to presenting numerous examples of how to use regular expressions to accomplish specific goals. Each situation was described, and a pattern that matched that situation was shown. Some commentary was given for each example.

In the next chapter, you'll read about how to present information by using formats. Formats are used to help relieve some of the programming burden from the task of creating reports.

Review Questions

  1. Can you use variable interpolation with the translation operator?

  2. What happens if the pattern is empty?

  3. What variable does the substitution operator use as its default?

  4. Will the following line of code work?

     m{.*];
  5. What is the /g option of the substitution operator used for?

  6. What does the \d meta-character sequence mean?

  7. What is the meaning of the dollar sign in the following pattern?

    /AA[.<]$]ER/
  8. What is a word boundary?

  9. What will be displayed by the following program?

    $_ = 'AB AB AC';
    print m/c$/i;

Review Exercises

  1. Write a pattern that matches either "top" or "topgun".

  2. Write a program that accepts input from STDIN and changes all instances of the letter a into the letter b.

  3. Write a pattern that stores the first character to follow a tab into pattern memory.

  4. Write a pattern that matches the letter g between 3 and 7 times.

  5. Write a program that finds repeated words in an input file and prints the repeated word and the line number on which it was found.

  6. Create a character class for octal numbers.

  7. Write a program that uses the translation operator to remove repeated instances of the tab character and then replaces the tab character with a space character.

  8. Write a pattern that matches either "top" or "topgun" using a zero-width positive look-ahead assertion.

Top of Page | Sections | Chapters | Copyright