Chapter 2. Advanced Regular Expressions

Regular expressions, or just regexes, are at the core of Perl’s text processing, and certainly are one of the features that made Perl so popular. All Perl programmers pass through a stage where they try to program everything as regexes, and when that’s not challenging enough, everything as a single regex. Perl’s regexes have many more features than I can, or want, to present here, so I include those advanced features I find most useful and expect other Perl programmers to know about without referring to perlre, the documentation page for regexes.

Readable Regexes, /x and (?#…)

Regular expressions have a much deserved reputation of being hard to read. Regexes have their own terse language that uses as few characters as possible to represent virtually infinite numbers of possibilities, and that’s just counting the parts that most people use everyday.

Luckily for other people, Perl gives me the opportunity to make my regexes much easier to read. Given a little bit of formatting magic, not only will others be able to figure out what I’m trying to match, but a couple weeks later, so will I. We touched on this lightly in Learning Perl, but it’s such a good idea that I’m going to say more about it. It’s also in Perl Best Practices.

When I add the /x flag to either the match or substitution operators, Perl ignores literal whitespace in the pattern. This means that I spread out the parts of my pattern to make the pattern more discernible. Gisle Aas’s HTTP::Date module parses a date by trying several different regexes. Here’s one of his regular expressions, although I’ve modified it to appear on a single line, arbitrarily wrapped to fit on this page:

/^(\d\d?)(?:\s+|[-\/])(\w+)(?:\s+|[-\/])(\d+)(?:(?:\s+|:)
(\d\d?):(\d\d)(?::(\d\d))?)?\s*([-+]?\d{2,4}|(?![APap][Mm]\b)
[A-Za-z]+)?\s*(?:\(\w+\))?\s*$/

Quick: Can you tell which one of the many date formats that parses? Me neither. Luckily, Gisle uses the /x flag to break apart the regex and add comments to show me what each piece of the pattern does. With /x, Perl ignores literal whitespace and Perl-style comments inside the regex. Here’s Gisle’s actual code, which is much easier to understand:

/^
         (\d\d?)               # day
(?:\s+|[-\/])
         (\w+)                 # month
(?:\s+|[-\/])
         (\d+)                 # year
         (?:
   (?:\s+|:)       # separator before clock
(\d\d?):(\d\d)     # hour:min
(?::(\d\d))?       # optional seconds
         )?                    # optional clock
\s*
         ([-+]?\d{2,4}|(?![APap][Mm]\b)[A-Za-z]+)? # timezone
\s*
         (?:\(\w+\))?          # ASCII representation of timezone in parens.
\s*$
/x

Under /x, to match whitespace I have to specify it explicitly, either using \s, which matches any whitespace; any of \f\r\n\t\R; or their octal or hexadecimal sequences, such as \040 or \x20 for a literal space. Likewise, if I need a literal hash symbol, #, I have to escape it too, \#.

I don’t have to use /x to put comments in my regex. The (?#COMMENT) sequence does that for me. It probably doesn’t make the regex any more readable at first glance, though. I can mark the parts of a string right next to the parts of the pattern that represent it. Just because you can use (?#) doesn’t mean you should. I think the patterns are much easier to read with /x:

my $isbn = '0-596-10206-2';

$isbn =~ m/\A(\d+)(?#group)-(\d+)(?#publisher)-(\d+)(?#item)-([\dX])\z/i;

print <<"HERE";
Group code:     $1
Publisher code: $2
Item:           $3
Checksum:       $4
HERE

Those are just Perl features though. It’s still up to me to present the regex in a way that other people can understand it, just as I should do with any other code.

These explicated regexes can take up quite a bit of screen space, but I can hide them like any other code. I can create the regex as a string or create a regular expression object with qr// and return it:

sub isbn_regex {
    qr/ \A
        (\d+)  #group
            -
        (\d+)  #publisher
            -
        (\d+)  #item
            -
        ([\dX])
        \z
    /ix;
    }

I could get the regex and interpolate it into the match or substitution operators:

my $regex = isbn_regex();
if( $isbn =~ m/$regex/ ) {
    print "Matched!\n";
    }

Since I return a regular expression object, I can bind to it directly to perform a match:

my $regex = isbn_regex();
if( $isbn =~ $regex ) {
    print "Matched!\n";
    }

But, I can also just skip the $regex variable and bind to the return value directly. It looks odd but it works:

if( $isbn =~ isbn_regex() ) {
    print "Matched!\n";
    }

If I do that, I can move all of the actual regular expressions out of the way. Not only that, I now should have a much easier time testing the regular expressions since I can get to them much more easily in the test programs.

Global Matching

In Learning Perl we told you about the /g flag that you can use to make all possible substitutions, but it’s more useful than that. I can use it with the match operator, where it does different things in scalar and list context. We told you that the match operator returns true if it matches and false otherwise. That’s still true (we wouldn’t have lied to you), but it’s not just a boolean value. The list context behavior is the most useful. With the /g flag, the match operator returns all of the captures:

$_ = "Just another Perl hacker,";
my @words = /(\S+)/g; # "Just" "another" "Perl" "hacker,"

Even though I only have one set of captures in my regular expression, it makes as many matches as it can. Once it makes a match, Perl starts where it left off and tries again. I’ll say more on that in a moment. I often run into another Perl idiom that’s closely related to this, in which I don’t want the actual matches, but just a count:

my $word_count = () = /(\S+)/g;

This uses a little-known but important rule: the result of a list assignment is the number of elements in the list on the righthand side. In this case, that’s the number of elements the match operator returns. This only works for a list assignment, which is assigning from a list on the righthand side to a list on the lefthand side. That’s why I have the extra () in there.

In scalar context, the /g flag does some extra work we didn’t tell you about earlier. During a successful match, Perl remembers its position in the string, and when I match against that same string again, Perl starts where it left off in that string. It returns the result of one application of the pattern to the string:

$_ = "Just another Perl hacker,";
my @words = /(\S+)/g; # "Just" "another" "Perl" "hacker,"

while( /(\S+)/g ) { # scalar context
    print "Next word is '$1'\n";
    }

When I match against that same string again, Perl gets the next match:

Next word is 'Just'
Next word is 'another'
Next word is 'Perl'
Next word is 'hacker,'

I can even look at the match position as I go along. The built-in pos() operator returns the match position for the string I give it (or $_ by default). Every string maintains its own position. The first position in the string is 0, so pos() returns undef when it doesn’t find a match and has been reset, and this only works when I’m using the /g flag (since there’s no point in pos() otherwise):

$_ = "Just another Perl hacker,";
my $pos = pos( $_ );            # same as pos()
print "I'm at position [$pos]\n"; # undef

/(Just)/g;
$pos = pos();
print "[$1] ends at position $pos\n"; # 4

When my match fails, Perl resets the value of pos() to undef. If I continue matching, I’ll start at the beginning (and potentially create an endless loop):

my( $third_word ) = /(Java)/g;
print "The next position is " . pos() . "\n";

As a side note, I really hate these print statements where I use the concatenation operator to get the result of a function call into the output. Perl doesn’t have a dedicated way to interpolate function calls, so I can cheat a bit. I call the function in an anonymous array constructor, [ ... ], then immediately dereference it by wrapping @{ ... } around it:

print "The next position is @{ [ pos( $line ) ] }\n";

The pos() operator can also be an lvalue, which is the fancy programming way of saying that I can assign to it and change its value. I can fool the match operator into starting wherever I like. After I match the first word in $line, the match position is somewhere after the beginning of the string. After I do that, I use index to find the next h after the current match position. Once I have the offset for that h, I assign the offset to pos($line) so the next match starts from that position:

my $line = "Just another regex hacker,";

$line =~ /(\S+)/g;
print "The first word is $1\n";
print "The next position is @{ [ pos( $line ) ] }\n";

pos( $line ) = index( $line, 'h', pos( $line) );

$line =~ /(\S+)/g;
print "The next word is $1\n";
print "The next position is @{ [ pos( $line ) ] }\n";

Global Match Anchors

So far, my subsequent matches can “float”, meaning they can start matching anywhere after the starting position. To anchor my next match exactly where I left off the last time, I use the \G anchor. It’s just like the beginning of string anchor, \A, except for where \G anchors at the current match position. If my match fails, Perl resets pos() and I start at the beginning of the string.

In this example, I anchor my pattern with \G. I have a word match, \w+. I use the /x flag to spread out the parts to enhance readability. My match only gets the first four words, since it can’t match the comma (it’s not in \w) after the first hacker. Since the next match must start where I left off, which is the comma, and the only thing I can match is whitespace or word characters, I can’t continue. That next match fails, and Perl resets the match position to the beginning of $line:

my $line = "Just another regex hacker, Perl hacker,";

while( $line =~ /  \G \s* (\w+)  /xg ) {
    print "Found the word '$1'\n";
    print "Pos is now @{ [ pos( $line ) ] }\n";
    }

I have a way to get around Perl resetting the match position. If I want to try a match without resetting the starting point even if it fails, I can add the /c flag, which simply means to not reset the match position on a failed match. I can try something without suffering a penalty. If that doesn’t work, I can try something else at the same match position. This feature is a poor man’s lexer. Here’s a simple-minded sentence parser:

my $line = "Just another regex hacker, Perl hacker, and that's it!\n";

while( 1 ) {
    my( $found, $type ) = do {
        if( $line =~ /\G([a-z]+(?:'[ts])?)/igc )
                        { ( $1, "a word"           ) }
        elsif( $line =~ /\G (\n) /xgc             )
                        { ( $1, "newline char"     ) }
        elsif( $line =~ /\G (\s+) /xgc            )
                        { ( $1, "whitespace"       ) }
        elsif( $line =~ /\G ( [[:punct:]] ) /xgc  )
                        { ( $1, "punctuation char" ) }
        else
                        { last;                      }
        };

    print "Found a $type [$found]\n";
    }

Look at that example again. What if I want to add more things I could match? I could add another branch to the decision structure. That’s no fun. That’s a lot of repeated code structure doing the same thing: match something, then return $1 and a description. It doesn’t have to be like that, though. I rewrite this code to remove the repeated structure. I can store the regexes in the @items array. I use qr// to create the regexes, and I put the regexes in the order that I want to try them. The foreach loop goes through them successively until it finds one that matches. When it finds a match, it prints a message using the description and whatever showed up in $1. If I want to add more tokens, I just add their description to @items:

#!/usr/bin/perl
use strict;
use warnings;

my $line = "Just another regex hacker, Perl hacker, and that's it!\n";

my @items = (
    [ qr/\G([a-z]+(?:'[ts])?)/i, "word"        ],
    [ qr/\G(\n)/,                "newline"     ],
    [ qr/\G(\s+)/,               "whitespace"  ],
    [ qr/\G([[:punct:]])/,       "punctuation" ],
    );

LOOP: while( 1 ) {
    MATCH: foreach my $item ( @items ) {
        my( $regex, $description ) = @$item;

        next MATCH unless $line =~ /$regex/gc;

        print "Found a $description [$1]\n";
        last LOOP if $1 eq "\n";

        next LOOP;
        }
    }

Look at some of the things going on in this example. All matches need the /gc flags, so I add those flags to the match operator inside the foreach loop. I add it there because those flags don’t affect the pattern, they affect the match operator.

My regex to match a “word”, however, also needs the /i flag. I can’t add that to the match operator because I might have other branches that don’t want it. The code inside the block labeled MATCH doesn’t know how it’s going to get $regex, so I shouldn’t create any code that forces me to form $regex in a particular way.

Recursive Regular Expressions

Perl’s feature that we call “regular expressions” really aren’t; we’ve known this ever since Perl allowed backreferences (\1 and so on). With v5.10, there’s no pretending since we now have recursive regular expressions that can do things such as balance parentheses, parse HTML, and decode JSON. There are several pieces to this that should please the subset of Perlers who tolerate everything else in the language so they can run a single pattern that does everything.

Repeating a Subpattern

Perl v5.10 added the (?PARNO) to refer to the pattern in a particular capture group. When I use that, the pattern in that capture group must match at that spot.

First, I start with a naïve program that tries to match something between quote marks. This program isn’t the way I should do it, but I’ll get to a correct way in a moment:

#!/usr/local/perls/perl-5.18.1/bin/perl
#!/usr/bin/perl
# quotes.pl

use v5.10;

$_ =<<'HERE';
Amelia said "I am a camel"
HERE

say "Matched [$+{said}]!" if m/
    ( ['"] )
    (?<said>.*?)
    ( ['"] )
    /x;

Here I repeated the subpattern ( ['"] ). In other code, I would probably immediately recognize that as a chance to move repeated code into a subroutine. I might think that I can solve this problem with a simple backreference:

#!/usr/local/perls/perl-5.18.1/bin/perl
#!/usr/bin/perl
# quotes_backreference.pl

use v5.10;

$_ =<<'HERE';
Amelia said "I am a camel"
HERE

say "Matched [$+{said}]!" if m/
    ( ['"] )
    (?<said>.*?)
    ( \1 )
    /x;

That works in this simple case. The \1 matches exactly the text matched in the first capture group. If it matched a double quote mark, it has to match a double quote mark again. Hold that thought, though, because the target text is not as simple as that. I want to follow a different path. Instead of using the backreference, I’ll refer to a subpattern with the (?PARNO) syntax:

#!/usr/local/perls/perl-5.18.1/bin/perl
#!/usr/bin/perl
# quotes_parno.pl

use v5.10;

$_ =<<'HERE';
Amelia said 'I am a camel'
HERE

say "Matched [$+{said}]!" if m/
    ( ['"] )
    (?<said>.*?)
    (?1)
    /x;

This works, at least as much as the first try in quotes.pl does. The (?1) uses the same pattern in that capture group, ( ['"] ). I don’t have to repeat the pattern. However, this means that it might match a double quote mark in the first capture group but a single quote mark in the second. Repeating the pattern instead of the matched text might be what you want, but not in this case.

There’s another problem though. If the data have nested quotes, repeating the pattern can get confused:

#!/usr/bin/perl
# quotes_nested.pl

use v5.10;

$_ =<<'HERE';
He said 'Amelia said "I am a camel"'
HERE

say "Matched [$+{said}]!" if m/
    ( ['"] )
    (?<said>.*?)
    (?1)
    /x;

This matches only part of what I want it to match:

% perl quotes_nested.pl
Matched [Amelia said ]!

One problem is that I’m repeating the subpattern outside of the subpattern I’m repeating; it gets confused by the nested quotes. The other problem is that I’m not accounting for nesting. I change the pattern so I can match all of the quotes, assuming that they are nested:

#!/usr/bin/perl
# quotes_nested.pl

use v5.10;

$_ =<<'HERE';
He said 'Amelia said "I am a camel"'
HERE

say "Matched [$+{said}]!" if m/
    (?<said>                       # $1
    (?<quote>['"])
            (?:
                [^'"]++
                |
                (?<said> (?1) )
            )*
    \g{quote}
    )
    /x;

say join "\n", @{ $-{said} };

When I run this, I get both quotes:

% perl quotes_nested.pl
Matched ['Amelia said "I am a camel"']!
'Amelia said "I am a camel"'
"I am a camel"

This pattern is quite a change though. First, I use a named capture. The regular expression still makes this available in the numbered capture buffers, so this is also $1:

(?<said>                       # $1
...
)

My next layer is another named capture to match the quote, and a backreference to that name to match the same quote again:

(?<said>                       # $1
(?<quote>['"])
...
\g{quote}
)

Now come the the tricky stuff. I want to match the stuff inside the quote marks, but if I run into another quote, I want to match that on it’s own as if it’s a single element. To do that, I have an alternation I group with noncapturing parentheses:

(?:
    [^'"]++
    |
    (?<said> (?1) )
)*

The [^'"]++ matches one or more characters that aren’t one of those quote marks. The ++ quantifier prevents the regular expression engine from backtracking.

If it doesn’t match a non-quote mark, it tries the other side of the alternation, (?<said> (?1) ). The (?1) repeats the subpattern in the first capture group ($1). However, it repeats that pattern as an independent pattern, which is:

(?<said>                       # $1
...
)

It matches the whole thing again, but when it’s done, it discards its captures, named or otherwise, so it doesn’t affect the superpattern. That means that I can’t remember its captures, so I have to wrap that part in it’s own named capture, reusing the said name.

I modify my string to include levels levels of nesting:

_

I looks like it doesn’t match the inner most quote because it only outputs two of them:

% perl quotes_three_nested.pl
Matched ["Top Level 'Middle Level "Bottom Level" Middle' Outside"]!
"Top Level 'Middle Level "Bottom Level" Middle' Outside"
'Middle Level "Bottom Level" Middle'

However, the pattern repeated in (?1) is independent, so once in there, none of those matches make it into the capture buffers for the whole pattern. I can fix that though. The (?{ CODE }) construct—an experimental feature—allows me to run code during a regular expression. I can use it to output the substring I just matched each time I run the pattern. Along with that, I’ll switch from using (?1), which refers to the first capture group, to (?R), which goes back to the start of the whole pattern:

#!/usr/bin/perl
# nested_show_matches.pl

use v5.10;

$_ =<<'HERE';
Outside "Top Level 'Middle Level "Bottom Level" Middle' Outside"
HERE

say "Matched [$+{said}]!" if m/
    (?<said>
    (?<quote>['"])
            (?:
                [^'"]++
                |
                (?R)
            )*
    \g{quote}
    )
    (?{ say "Inside regex: $+{said}" })
    /x;

Each time I run the pattern, even through (?R), I output the current value of $+{said}. In the subpatterns, that variable is localized to the subpattern and disappears at the end of the subpattern, although not before I can output it:

% perl nested_show_matches.pl
Inside regex: "Bottom Level"
Inside regex: 'Middle Level "Bottom Level" Middle'
Inside regex: "Top Level 'Middle Level "Bottom Level" Middle' Outside"
Matched ["Top Level 'Middle Level "Bottom Level" Middle' Outside"]!

I can see that in each level, the pattern recurses. It goes deeper into the strings, matches at the bottom level, then works its way back up.

I take this one step further by using the (?(DEFINE)...) feature to create and name subpatterns that I can use later:

#!/usr/bin/perl
# nested_define.pl

use v5.10;

$_ =<<'HERE';
Outside "Top Level 'Middle Level "Bottom Level" Middle' Outside"
HERE

say "Matched [$+{said}]!" if m/
    (?(DEFINE)
        (?<QUOTE> ['"])
        (?<NOT_QUOTE> [^'"])
    )
    (?<said>
    (?<quote>(?&QUOTE))
            (?:
                (?&NOT_QUOTE)++
                |
                (?R)
            )*
    \g{quote}
    )
    (?{ say "Inside regex: $+{said}" })
    /x;

Inside the (?(DEFINE)...), it looks like I have named captures but those are really named subpatterns. They don’t match until I call them with (?&NAME) later.

I don’t like that say inside the pattern, just as I don’t particularly like subroutines that output anything. Instead of that, I create an array before I use the match operator and push each match onto it. The $^N variable has the substring from the previous capture buffer. It’s handy because I don’t have to count or know names, so I don’t need a named capture for said:

#!/usr/bin/perl
# nested_carat_n.pl

use v5.10;

$_ =<<'HERE';
Outside "Top Level 'Middle Level "Bottom Level" Middle' Outside"
HERE

my @matches;

say "Matched!" if m/
    (?(DEFINE)
        (?<QUOTE_MARK> ['"])
        (?<NOT_QUOTE_MARK> [^'"])
    )
    (
    (?<quote>(?&QUOTE_MARK))
        (?:
            (?&NOT_QUOTE_MARK)++
            |
            (?R)
        )*
    \g{quote}
    )
    (?{ push @matches, $^N })
    /x;

say join "\n", @matches;

I get almost the same output:

% perl nested_carat_n.pl
Matched!
"Bottom Level"
'Middle Level "Bottom Level" Middle'
"Top Level 'Middle Level "Bottom Level" Middle' Outside"

If I can define some parts of the pattern with names, I can go even further by giving a name to not just QUOTE_MARK and NOT_QUOTE_MARK, but everything that makes up a quote:

#!/usr/bin/perl
# nested_grammar.pl

use v5.10;

$_ =<<'HERE';
Outside "Top Level 'Middle Level "Bottom Level" Middle' Outside"
HERE

my @matches;

say "Matched!" if m/
    (?(DEFINE)
        (?<QUOTE_MARK> ['"])
        (?<NOT_QUOTE_MARK> [^'"])
        (?<QUOTE>
            (
                (?<quote>(?&QUOTE_MARK))

                (?:
                    (?&NOT_QUOTE_MARK)++
                    |
                    (?&QUOTE)
                )*
                \g{quote}
            )
            (?{ push @matches, $^N })
        )
    )

    (?&QUOTE)
    /x;

say join "\n", @matches;

Almost everything is in the (?(DEFINE)...), but nothing happens until I call (?&QUOTE) at the end to actually match the subpattern I defined with that name.

Pause for a moment. While worrying about the features and how they work, you might have missed what just happened. I started with a regular expression; now I have a grammar! I can define tokens and recurse.

I have one more feature to show before I can get to the really good example. The special variable $^R holds the result of the previously evaluated (?{...}). That is, the value of the last evaluated expression in (?{...}) ends up in $^R. Even better, I can affect $^R how I like because it is writeable.

Now that I know that, I can modify my program to build up the array of matches by returning an array reference of all submatches at the end of my (?{...}). Each time I have that (?{...}), I add the substring in $^N to the values I remembered previously. It’s a kludgey way of building an array, but it demonstrates the feature:

#!/usr/bin/perl
# nested_grammar_r.pl

use Data::Dumper;
use v5.10;

$_ =<<'HERE';
Outside "Top Level 'Middle Level "Bottom Level" Middle' Outside"
HERE

my @matches;
local $^R = [];

say "Matched!" if m/
    (?(DEFINE)
        (?<QUOTE_MARK> ['"])
        (?<NOT_QUOTE_MARK> [^'"])
        (?<QUOTE>
            (
                (?<quote>(?&QUOTE_MARK))

                (?:
                    (?&NOT_QUOTE_MARK)++
                    |
                    (?&QUOTE)
                )*
                \g{quote}
            )
            (?{ [ @{$^R}, $^N ] })
        )
    )

    (?&QUOTE) (?{ @matches = @{ $^R } })
    /x;


say join "\n", @matches;

Before the match, I set the value of $^R to be an empty anonymous array. At the end of the QUOTE definition, I create a new anonymous array with the values already inside $^R and the new value in $^N. That new anonymous array is the last evaluated expression and becomes the new value of $^R. At the end of the pattern, I assign the values in $^R to @matches so I have them after the match ends.

Now that I have all of that, I can get to the code I want to show you and which I’m not going to explain. Randal Schwartz used these features to write a minimal JSON parser as a Perl regular expression (but really a grammar) that he posted to Perlmonks as http://www.perlmonks.org/?node_id=995856. He created this as a minimal parser for a very specific client need where the JSON data are compact, appear on a single line, and are limited to ASCII:

#!/usr/bin/perl
use Data::Dumper qw(Dumper);

my $FROM_JSON = qr{

(?&VALUE) (?{ $_ = $^R->[1] })

(?(DEFINE)

(?<OBJECT>
  (?{ [$^R, {}] })
  \{
    (?: (?&KV) # [[$^R, {}], $k, $v]
      (?{  #warn Dumper { obj1 => $^R };
     [$^R->[0][0], {$^R->[1] => $^R->[2]}] })
      (?: , (?&KV) # [[$^R, {...}], $k, $v]
        (?{ # warn Dumper { obj2 => $^R };
       [$^R->[0][0], {%{$^R->[0][1]}, $^R->[1] => $^R->[2]}] })
      )*
    )?
  \}
)

(?<KV>
  (?&STRING) # [$^R, "string"]
  : (?&VALUE) # [[$^R, "string"], $value]
  (?{  #warn Dumper { kv => $^R };
     [$^R->[0][0], $^R->[0][1], $^R->[1]] })
)

(?<ARRAY>
  (?{ [$^R, []] })
  \[
    (?: (?&VALUE) (?{ [$^R->[0][0], [$^R->[1]]] })
      (?: , (?&VALUE) (?{  #warn Dumper { atwo => $^R };
             [$^R->[0][0], [@{$^R->[0][1]}, $^R->[1]]] })
      )*
    )?
  \]
)

(?<VALUE>
  \s*
  (
      (?&STRING)
    |
      (?&NUMBER)
    |
      (?&OBJECT)
    |
      (?&ARRAY)
    |
    true (?{ [$^R, 1] })
  |
    false (?{ [$^R, 0] })
  |
    null (?{ [$^R, undef] })
  )
  \s*
)

(?<STRING>
  (
    "
    (?:
      [^\\"]+
    |
      \\ ["\\/bfnrt]
#    |
#      \\ u [0-9a-fA-f]{4}
    )*
    "
  )

  (?{ [$^R, eval $^N] })
)

(?<NUMBER>
  (
    -?
    (?: 0 | [1-9]\d* )
    (?: \. \d+ )?
    (?: [eE] [-+]? \d+ )?
  )

  (?{ [$^R, eval $^N] })
)

) }xms;

sub from_json {
  local $_ = shift;
  local $^R;
  eval { m{\A$FROM_JSON\z}; } and return $_;
  die $@ if $@;
  return 'no match';
}

local $/;
while (<>) {
  chomp;
  print Dumper from_json($_);
}

There are more than a few interesting things in Randal’s code that I leave to you to explore:

1. Part of his intermediate data structure tells the grammar what he just did

2. It fails very quickly for invalid JSON data, although Randal says with more work it could fail faster.

3. Most interestingly, he replaces the target string with the data structure by assigning to $_ in the last (?{...}).

If you think that’s impressive, you should see Tom Christiansen’s Stackoverflow refutation that a regular expression can’t parse HTML in which he used many of the same features.

Lookarounds

Lookarounds are arbitrary anchors for regexes. We showed several anchors in Learning Perl, such as \A, \z, and \b, and I just showed the \G anchor. Using a lookaround, I can describe my own anchor as a regex, and just like the other anchors, they don’t consume part of the string. They specify a condition that must be true, but they don’t add to the part of the string that the overall pattern matches.

Lookarounds come in two flavors: lookaheads that look ahead to assert a condition immediately after the current match position, and lookbehinds that look behind to assert a condition immediately before the current match position. This sounds simple, but it’s easy to misapply these rules. The trick is to remember that it anchors to the current match position then figure out on which side it applies.

Both lookaheads and lookbehinds have two types: positive and negative. The positive lookaround asserts that its pattern has to match. The negative lookaround asserts that its pattern doesn’t match. No matter which I choose, I have to remember that they apply to the current match position, not anywhere else in the string.

Lookahead Assertions, (?=PATTERN) and (?!PATTERN)

Lookahead assertions let me peek at the string immediately ahead of the current match position. The assertion doesn’t consume part of the string, and if it succeeds, matching picks up right after the current match position.

Positive lookahead assertions

In Learning Perl, we included an exercise to check for both “Fred” and “Wilma” on the same line of input, no matter the order they appeared on the line. The trick we wanted to show to the novice Perler is that two regexes can be simpler than one. One way to do this repeats both Wilma and Fred in the alternation so I can try either order. A second try separates them into two regexes:

#!/usr/bin/perl
# fred_and_wilma.pl

$_ = "Here come Wilma and Fred!";
print "Matches: $_\n" if /Fred.*Wilma|Wilma.*Fred/;
print "Matches: $_\n" if /Fred/ && /Wilma/;

I can make a simple, single regex using a positive lookahead assertion, denoted by (?=PATTERN). This assertion doesn’t consume text in the string, but if it fails, the entire regex fails. In this example, in the positive lookahead assertion I use .*Wilma. That pattern must be true immediately after the current match position:

$_ = "Here come Wilma and Fred!";
print "Matches: $_\n" if /(?=.*Wilma).*Fred/;

Since I used that at the start of my pattern, that means it has to be true at the beginning of the string. Specifically, at the beginning of the string, I have to be able to match any number of characters except a newline followed by Wilma. If that succeeds, it anchors the rest of the pattern to its position (the start of the string). Figure 1 shows the two ways that can work, depending on the order of Fred and Wilma in the string. The .*Wilma anchors where it started matching. The elastic .*, which can match any number of nonnewline characters, anchors at the start of the string.

The positive lookahead assertion (?=.*Wilma) anchors the pattern at the beginning of the string
Figure 2-1. The positive lookahead assertion (?=.*Wilma) anchors the pattern at the beginning of the string

It’s easier to understand lookarounds by seeing when they don’t work, though. I’ll change my pattern a bit by removing the .* from the lookahead assertion. At first it appears to work, but it fails when I reverse the order of Fred and Wilma in the string:

$_ = "Here come Wilma and Fred!";
print "Matches: $_\n" if /(?=Wilma).*Fred/; # Works

$_ = "Here come Fred and Wilma!";
print "Matches: $_\n" if /(?=Wilma).*Fred/; # Doesn't work

Figure 2 shows what happens. In the first case, the lookahead anchors at the start of Wilma. The regex tries the assertion at the start of the string, finds that it doesn’t work, then moves over a position and tries again. It kept doing this until it got to Wilma. When it succeeded it set the anchor. Once it sets the anchor, the rest of the pattern has to start from that position.

In the first case, .*Fred can match from that anchor because Fred comes after Wilma. The second case in Figure 2 does the same thing. The regex tries that assertion at the beginning of the string, finds that it doesn’t work, and moves on to the next position. By the time the lookahead assertion matches, it has already passed Fred. The rest of the pattern has to start from the anchor, but it can’t match.

The positive lookahead assertion (?=Wilma) anchors the pattern at Wilma
Figure 2-2. The positive lookahead assertion (?=Wilma) anchors the pattern at Wilma

Since the lookahead assertions don’t consume any of the string, I can use it in a pattern for split when I don’t really want to discard the parts of the pattern that match. In this example, I want to break apart the words in the studly cap string. I want to split it based on the initial capital letter. I want to keep the initial letter, though, so I use a lookahead assertion instead of a character-consuming string. This is different from the separator retention mode because the split pattern isn’t really a separator; it’s just an anchor:

my @words = split /(?=[A-Z])/, 'CamelCaseString';
print join '_', map { lc } @words; # camel_case_string

Negative lookahead assertions

Suppose I want to find the input lines that contain Perl, but only if that isn’t Perl6 or Perl 6. I might try a negated character class to specify the pattern right after the l in Perl to ensure that the next character isn’t a 6. I also use the word boundary anchors \b because I don’t want to match in the middle of other words, such as “BioPerl” or “PerlPoint”:

#!/usr/bin/perl
# not_perl6.pl

print "Trying negated character class:\n";
while( <> ) {
    print if /\bPerl[^6]\b/;
    }

I’ll try this with some sample input:

# sample input
Perl6 comes after Perl 5.
Perl 6 has a space in it.
I just say "Perl".
This is a Perl 5 line
Perl 5 is the current version.
Just another Perl 5 hacker,
At the end is Perl
PerlPoint is like PowerPoint
BioPerl is genetic

It doesn’t work for all the lines it should. It only finds four of the lines that have Perl without a trailing 6, and a line that has a space between Perl and 6. Note that even in the first line of output, the match still works because it matches the Perl 5 at the end, which is perl, a space, a 5 (a word character), and then the word boundary at the end of the line:

Trying negated character class:
    Perl6 comes after Perl 5.
    Perl 6 has a space in it.
    This is a Perl 5 line
    Perl 5 is the current version.
    Just another Perl 5 hacker,

That doesn’t work because there has to be a character after the l in Perl. Not only that, I specified a word boundary. If that character after the l is a nonword character, such as the " in I just say "Perl", the word boundary at the end fails. If I take off the trailing \b, now PerlPoint matches. I haven’t even tried handling the case where there is a space between Perl and 6. For that I’ll need something much better.

To make this really easy, I can use a negative lookahead assertion. I don’t want to consume a character after the l, and since an assertion doesn’t consume characters, it’s the right tool to use. I just want to say that if there’s anything after perl, it can’t be a 6, even if there is some whitespace between them. The negative lookahead assertion uses (?!PATTERN). To solve this problem, I use \s?6 as my pattern, denoting the optional whitespace followed by a 6:

print "Trying negative lookahead assertion:\n";
while( <> ) {
    print if /\bPerl(?!\s?6)\b/;
    }

Now the output finds all of the right lines:

Trying negative lookahead assertion:
    Perl6 comes after Perl 5.
    I just say "Perl".
    This is a Perl 5 line
    Perl 5 is the current version.
    Just another Perl 5 hacker,
    At the end is Perl

Remember that (?!PATTERN) is a lookahead assertion, so it looks after the current match position. That’s why this next pattern still matches. The lookahead asserts that right before the b in bar that the next thing isn’t foo. Since the next thing is bar, which is not foo, it matches. People often confuse this to mean that the thing before bar can’t be foo, but each uses the same starting match position, and since bar is not foo, they both work:

if( 'foobar' =~ /(?!foo)bar/ ) {
    print "Matches! That's not what I wanted!\n";
    }
else {
    print "Doesn't match! Whew!\n";
    }

Lookbehind Assertions, (?<!PATTERN) and (?<=PATTERN)

Instead of looking ahead at the part of the string coming up, I can use a lookbehind to check the part of the string the regular expression engine has already processed. Due to Perl’s implementation details, the lookbehind assertions have to be a fixed width, so I can’t use variable width quantifiers in them as some other languages can.

Now I can try to match bar that doesn’t follow a foo. In the last section I couldn’t use a negative lookahead assertion because that looks forward in the string. A negative lookbehind, denoted by (?<!PATTERN), looks backward. That’s just what I need. Now I get the right answer:

#!/usr/bin/perl
# correct_foobar.pl

if( 'foobar' =~ /(?<!foo)bar/ ) {
    print "Matches! That's not what I wanted!\n";
    }
else {
    print "Doesn't match! Whew!\n";
    }

Now, since the regex has already processed that part of the string by the time it gets to bar, my lookbehind assertion can’t be a variable width pattern. I can’t use the quantifiers to make a variable width pattern because the engine is not going to backtrack in the string to make the lookbehind work. I won’t be able to check for a variable number of os in fooo:

'foooobar' =~ /(?<!fo+)bar/;

When I try that, I get the error telling me that I can’t do that, and even though it merely says not implemented, don’t hold your breath waiting for it:

Variable length lookbehind not implemented in regex...

The positive lookbehind assertion also looks backward, but its pattern must match. The only time I seem to use these are in substitutions in concert with another assertion. Using both a lookbehind and a lookahead assertion, I can make some of my substitutions easier to read.

For instance, throughout the book I’ve used variations of hyphenated words because I couldn’t decide which one I should use. Should it be “builtin” or “built-in”? Depending on my mood or typing skills, I used either of them[2].

I needed to clean up my inconsistency. I knew the part of the word on the left of the hyphen, and I knew the text on the right of the hyphen. At the position where they meet, there should be a hyphen. If I think about that for a moment, I’ve just described the ideal situation for lookarounds: I want to put something at a particular position, and I know what should be around it. Here’s a sample program to use a positive lookbehind to check the text on the left and a positive lookahead to check the text on the right. Since the regex only matches when those sides meet, that means that it’s discovered a missing hyphen. When I make the substitution, it put the hyphen at the match position, and I don’t have to worry about the particular text:

my @hyphenated = qw( built-in );

foreach my $word ( @hyphenated ) {
    my( $front, $back ) = split /-/, $word;

    $text =~ s/(?<=$front)(?=$back)/-/g;
    }

If that’s not a complicated enough example, try this one. Let’s use the lookarounds to add commas to numbers. Jeffery Friedl shows one attempt in Mastering Regular Expressions, adding commas to the U.S. population. The U.S. Census Bureau has a population clock so you can use the latest number if you’re reading this book a long time from now:

$pop = 316792343;  # that's for Feb 10, 2007

# From Jeffrey Friedl
$pop =~ s/(?<=\d)(?=(?:\d\d\d)+$)/,/g;

That works, mostly. The positive lookbehind (?<=\d) wants to match a number, and the positive lookahead (?=(?:\d\d\d)+$) wants to find groups of three digits all the way to the end of the string. This breaks when I have floating point numbers, such as currency. For instance, my broker tracks my stock positions to four decimal places. When I try that substitution, I get no comma on the left side of the decimal point and one of the fractional side. It’s because of that end of string anchor:

$money = '$1234.5678';

$money =~ s/(?<=\d)(?=(?:\d\d\d)+$)/,/g;  # $1234.5,678

I can modify that a bit. Instead of the end of string anchor, I’ll use a word boundary, \b. That might seem weird, but remember that a digit is a word character. That gets me the comma on the left side, but I still have that extra comma:

$money = '$1234.5678';

$money =~ s/(?<=\d)(?=(?:\d\d\d)+\b)/,/g;  # $1,234.5,678

What I really want for that first part of the regex is to use the lookbehind to match a digit, but not when it’s preceded by a decimal point. That’s the description of a negative lookbehind, (?<!\.\d). Since all of these match at the same position, it doesn’t matter that some of them might overlap as long as they all do what I need:

$money = '$1234.5678';

$money =~ s/(?<!\.\d)(?<=\d)(?=(?:\d\d\d)+\b)/,/g; # $1,234.5678

That looks like it works. Except it doesn’t when I track things to five decimal places:

$money = '$1234.56789';

$money =~ s/(?<!\.\d)(?<=\d)(?=(?:\d\d\d)+\b)/,/g; # $1,234.56,789

That’s the problem with regular expressions. They work for the cases we try it on, but some person comes along with something different to break what I’m proud of.

I tried for awhile to fix this problem but couldn’t come up with something manageable. I even asked about it on Stackoverflow as a last resort. I couldn’t salvage the example without using another advanced Perl feature.

The \K, added in v5.10, can act like a variable-width negative lookbehind, which Perl doesn’t do. Micheal Carman came up with this regex:

s/(?<!\.)(?:\b|\G)\d+?\K(?=(?:\d\d\d)+\b)/,/g;

The \K allows the pattern before it to match, but not be replaced. The substitution replaces only the part of the string after \K. I can break the pattern up see how it works:

s/
    (?<!\.)(?:\b|\G)\d+?
    \K
    (?=(?:\d\d\d)+\b)
/,/xg;

The second part, after the \K, is the same thing that I was doing before. The magic comes in the first part, which I break into its subparts:

(?<!\.)
(?:\b|\G)
\d+?

There’s a negative lookbehind assertion to check for something other than a dot. After that, there’s an alternation that asserts either a word boundary or the <\G> anchor. That \G is the magic that was missing from my previous tries and let my regular expression float past that decimal point. After that word boundary or current match position, there has to be one of more digits, non-greedily.

Later in this chapter I’ll come back to this example when I show how to debug regular expressions. Before I get that, I want to show a few other simpler regex-demystifying techniques.

Debugging Regular Expressions

While trying to figure out a regex, whether one I found in someone else’s code or one I wrote myself (maybe a long time ago), I can turn on Perl’s regex debugging mode.

The -D Switch

Perl’s -D switch turns on debugging options for the Perl interpreter (not for your program, as in Chapter 4). The switch takes a series of letters or numbers to indicate what it should turn on. The -Dr option turns on regex parsing and execution debugging.

I can use a short program to examine a regex. The first argument is the match string and the second argument is the regular expression. I save this program as explain_regex.pl:

#!/usr/bin/perl
# explain_regex.pl

$ARGV[0] =~ /$ARGV[1]/;

When I try this with the target string Just another Perl hacker, and the regex Just another (\S+) hacker,, I see two major sections of output, which the perldebguts documentation explains at length. First, Perl compiles the regex, and the -Dr output shows how Perl parsed the regex. It shows the regex nodes, such as EXACT and NSPACE, as well as any optimizations, such as anchored "Just another ". Second, it tries to match the target string, and shows its progress through the nodes. It’s a lot of information, but it shows me exactly what it’s doing:

% perl -Dr explain_regex.pl 'Just another Perl hacker,' 'Just another (\S+) hacker,'
Omitting $` $& $' support (0x0).

EXECUTING...

Compiling REx "Just another (\S+) hacker,"
rarest char k at 4
rarest char J at 0
Final program:
   1: EXACT <Just another > (6)
   6: OPEN1 (8)
   8:   PLUS (10)
   9:     NPOSIXD[\s] (0)
  10: CLOSE1 (12)
  12: EXACT < hacker,> (15)
  15: END (0)
anchored "Just another " at 0 floating " hacker," at 14..2147483647 (checking anchored) minlen 22
Guessing start of match in sv for REx "Just another (\S+) hacker," against "Just another Perl hacker,"
Found anchored substr "Just another " at offset 0...
Found floating substr " hacker," at offset 17...
Guessed: match at offset 0
Matching REx "Just another (\S+) hacker," against "Just another Perl hacker,"
   0 <> <Just anoth>         |  1:EXACT <Just another >(6)
  13 <ther > <Perl hacke>    |  6:OPEN1(8)
  13 <ther > <Perl hacke>    |  8:PLUS(10)
                                  NPOSIXD[\s] can match 4 times out of 2147483647...
  17 < Perl> < hacker,>      | 10:  CLOSE1(12)
  17 < Perl> < hacker,>      | 12:  EXACT < hacker,>(15)
  25 <Perl hacker,> <>       | 15:  END(0)
Match successful!
Freeing REx: "Just another (\S+) hacker,"

The re pragma, which comes with Perl, has a debugging mode that doesn’t require a -DDEBUGGING enabled interpreter. Once I turn on use re 'debug', it applies for the rest of the scope. It’s not lexically scoped like most pragmata. I modify my previous program to use the re pragma instead of the command-line switch:

#!/usr/bin/perl

use re 'debug';

$ARGV[0] =~ /$ARGV[1]/;

I don’t have to modify my program to use re since I can also load it from the command line. When I run this program with a regex as its argument, I get almost the same exact output as my previous -Dr example:

% perl -Mre=debug explain_regex 'Just another Perl hacker,' 'Just another (\S+) hacker,'
Compiling REx "Just another (\S+) hacker,"
Final program:
   1: EXACT <Just another > (6)
   6: OPEN1 (8)
   8:   PLUS (10)
   9:     NPOSIXD[\s] (0)
  10: CLOSE1 (12)
  12: EXACT < hacker,> (15)
  15: END (0)
anchored "Just another " at 0 floating " hacker," at 14..2147483647 (checking anchored) minlen 22
Guessing start of match in sv for REx "Just another (\S+) hacker," against "Just another Perl hacker,"
Found anchored substr "Just another " at offset 0...
Found floating substr " hacker," at offset 17...
Guessed: match at offset 0
Matching REx "Just another (\S+) hacker," against "Just another Perl hacker,"
   0 <> <Just anoth>         |  1:EXACT <Just another >(6)
  13 <ther > <Perl hacke>    |  6:OPEN1(8)
  13 <ther > <Perl hacke>    |  8:PLUS(10)
                                  NPOSIXD[\s] can match 4 times out of 2147483647...
  17 < Perl> < hacker,>      | 10:  CLOSE1(12)
  17 < Perl> < hacker,>      | 12:  EXACT < hacker,>(15)
  25 <Perl hacker,> <>       | 15:  END(0)
Match successful!
Freeing REx: "Just another (\S+) hacker,"

I can follow that output easily because it’s a simple pattern. I can run this with Michael Carman’s comma-adding pattern:

#!/usr/bin/perl
# comma_debug.pl

use re 'debug';

$money = '$1234.56789';

$money =~ s/(?<!\.\d)(?<=\d)(?=(?:\d\d\d)+\b)/,/g; # $1,234.5678

print $money;

There’s screens and screens of output. I want to go through the highlights because it shows you how to read the output. The first part shows the regex program:

% perl comma_debug.pl
Compiling REx "(?<!\.)(?:\b|\G)\d+?\K(?=(?:\d\d\d)+\b)"
Final program:
   1: UNLESSM[-1] (7)
   3:   EXACT <.> (5)
   5:   SUCCEED (0)
   6: TAIL (7)
   7: BRANCH (9)
   8:   BOUND (12)
   9: BRANCH (FAIL)
  10:   GPOS (12)
  11: TAIL (12)
  12: MINMOD (13)
  13: PLUS (15)
  14:   POSIXD[\d] (0)
  15: KEEPS (16)
  16: IFMATCH[0] (28)
  18:   CURLYM[0] {1,32767} (25)
  20:     POSIXD[\d] (21)
  21:     POSIXD[\d] (22)
  22:     POSIXD[\d] (23)
  23:     SUCCEED (0)
  24:   NOTHING (25)
  25:   BOUND (26)
  26:   SUCCEED (0)
  27: TAIL (28)
  28: END (0)

The next parts represent repeated matches because it’s the substitution operator with the /g flag. The first one matches the entire string and the match position is at 0:

GPOS:0 minlen 1
Matching REx "(?<!\.)(?:\b|\G)\d+?\K(?=(?:\d\d\d)+\b)" against "$1234.56789"

The first group of lines, prefixed with 0, show the regex engine going through the negative lookbehind, the \b, and the \G, matching at the beginning of the string. The first set of <> has nothing, and the second has the rest of the string:

0 <> <$1234.5678>         |  1:UNLESSM[-1](7)
0 <> <$1234.5678>         |  7:BRANCH(9)
0 <> <$1234.5678>         |  8:  BOUND(12)
                                 failed...
0 <> <$1234.5678>         |  9:BRANCH(11)
0 <> <$1234.5678>         | 10:  GPOS(12)
0 <> <$1234.5678>         | 12:  MINMOD(13)
0 <> <$1234.5678>         | 13:  PLUS(15)
                                 POSIXD[\d] can match 0 times out of 1...
                                 failed...
                               BRANCH failed...

The regex engine then moves on, shifting over one position:

1 <$> <1234.56789>        |  1:UNLESSM[-1](7)

It checks the negative lookbehind, and there’s no dot:

0 <> <$1234.5678>         |  3:  EXACT <.>(5)
                                 failed...

It goes through the anchors again and finds digits:

1 <$> <1234.56789>        |  7:BRANCH(9)
1 <$> <1234.56789>        |  8:  BOUND(12)
1 <$> <1234.56789>        | 12:  MINMOD(13)
1 <$> <1234.56789>        | 13:  PLUS(15)
                                 POSIXD[\d] can match 1 times out of 1...
2 <$1> <234.56789>        | 15:    KEEPS(16)
2 <$1> <234.56789>        | 16:      IFMATCH[0](28)
2 <$1> <234.56789>        | 18:        CURLYM[0] {1,32767}(25)

It finds more than one digit in a row, bounded by the decimal point which the regex never crosses:

2 <$1> <234.56789>        | 20:          POSIXD[\d](21)
3 <$12> <34.56789>        | 21:          POSIXD[\d](22)
4 <$123> <4.56789>        | 22:          POSIXD[\d](23)
5 <$1234> <.56789>        | 23:          SUCCEED(0)
                                         subpattern success...
                                       CURLYM now matched 1 times, len=3...
5 <$1234> <.56789>        | 20:          POSIXD[\d](21)
                                         failed...
                                       CURLYM trying tail with matches=1...
5 <$1234> <.56789>        | 25:          BOUND(26)
5 <$1234> <.56789>        | 26:          SUCCEED(0)
                                         subpattern success...

But I need three digits left over on the right and that’s how it matches:

2 <$1> <234.56789>        | 28:      END(0)
    Match successful!

At this point, the substitution operator makes its replacement, putting the comma between the 1 and the 2. It’s ready for another replacement, this time with a shorter string:

Matching REx "(?<!\.)(?:\b|\G)\d+?\K(?=(?:\d\d\d)+\b)" against "234.56789"

I won’t go through that; it’s a long series of trials to find four adjacent digits before the decimal point, which it can’t do. It fails and the string ends up $1,234.56789.

That’s fine for a core-dump style analysis where everything has already happened and I have to sift through the results. There’s a better way to do this, but I can’t show it to you in this book. Damian Conway’s Regexp::Debugger animates the same thing, pointing out and coloring the string as it goes. Like many debuggers (Chapter 4), it allows me to step through a regular expression.

Summary

This chapter covered some of the more useful advanced features of Perl’s regex engine. Some of these features, such as the readable regexes, global matching, and debugging I use all the time. The more complicated ones, like the grammars, I use sparingly no matter how fun they are.

Further Reading

perlre is the main documentation for Perl regexes, and perlretut gives a regex tutorial. Don’t confuse that with perlreftut, the tutorial on references. To make it even more complicated, perlreref is the regex quick reference.

The details for regex debugging shows up in perldebguts. It explains the output of -Dr and re 'debug'.

Perl Best Practices has a section on regexes, and gives the \x “Extended Formatting” pride of place.

Mastering Regular Expressions covers regexes in general, and compares their implementation in different languages. Jeffrey Friedl has an especially nice description of lookahead and lookbehind operators. If you really want to know about regexes, this is the book to get.

Simon Cozens explains advanced regex features in two articles for Perl.com: “Regexp Power” and “Power Regexps, Part II”.

The web site http://www.regular-expressions.info has good discussions about regular expressions and their implementations in different languages.

Michael Carman’s answer to my money commifying example comes from my StackOverflow question about it.

Tom’s Stackoverflow answer about parsing HTML with regular expressions uses many of the features from this chapter.



[2] O’Reilly Media deals with this specifying what I should use: http://www.oreilly.com/oreilly/author/stylesheet.html