=head1 NAME X X X perlre - Perl regular expressions =head1 DESCRIPTION This page describes the syntax of regular expressions in Perl. If you haven't used regular expressions before, a quick-start introduction is available in L, and a longer tutorial introduction is available in L. For reference on how regular expressions are used in matching operations, plus various examples of the same, see discussions of C, C, C and C in L. =head2 Modifiers Matching operations can have various modifiers. Modifiers that relate to the interpretation of the regular expression inside are listed below. Modifiers that alter the way a regular expression is used by Perl are detailed in L and L. =over 4 =item m X ~~X X X Treat string as multiple lines. That is, change "^" and "$" from matching the start or end of the string to matching the start or end of any line anywhere within the string. =item s X~~ X X X Treat string as single line. That is, change "." to match any character whatsoever, even a newline, which normally it would not match. Used together, as /ms, they let the "." match any character whatsoever, while still allowing "^" and "$" to match, respectively, just after and just before newlines within the string. =item i X X X X Do case-insensitive pattern matching. If C is in effect, the case map is taken from the current locale. See L. =item x X Extend your pattern's legibility by permitting whitespace and comments. =item p X

X X Preserve the string matched such that ${^PREMATCH}, {$^MATCH}, and ${^POSTMATCH} are available for use after matching. =item g and c X X Global matching, and keep the Current position after failed matching. Unlike i, m, s and x, these two flags affect the way the regex is used rather than the regex itself. See L for further explanation of the g and c modifiers. =back These are usually written as "the C modifier", even though the delimiter in question might not really be a slash. Any of these modifiers may also be embedded within the regular expression itself using the C<(?...)> construct. See below. The C modifier itself needs a little more explanation. It tells the regular expression parser to ignore whitespace that is neither backslashed nor within a character class. You can use this to break up your regular expression into (slightly) more readable parts. The C<#> character is also treated as a metacharacter introducing a comment, just as in ordinary Perl code. This also means that if you want real whitespace or C<#> characters in the pattern (outside a character class, where they are unaffected by C), then you'll either have to escape them (using backslashes or C<\Q...\E>) or encode them using octal or hex escapes. Taken together, these features go a long way towards making Perl's regular expressions more readable. Note that you have to be careful not to include the pattern delimiter in the comment--perl has no way of knowing you did not intend to close the pattern early. See the C-comment deletion code in L. Also note that anything inside a C<\Q...\E> stays unaffected by C. X =head2 Regular Expressions =head3 Metacharacters The patterns used in Perl pattern matching evolved from the ones supplied in the Version 8 regex routines. (The routines are derived (distantly) from Henry Spencer's freely redistributable reimplementation of the V8 routines.) See L for details. In particular the following metacharacters have their standard I-ish meanings: X X<\> X<^> X<.> X<$> X<|> X<(> X<()> X<[> X<[]> \ Quote the next metacharacter ^ Match the beginning of the line . Match any character (except newline) $ Match the end of the line (or before newline at the end) | Alternation () Grouping [] Character class By default, the "^" character is guaranteed to match only the beginning of the string, the "$" character only the end (or before the newline at the end), and Perl does certain optimizations with the assumption that the string contains only one line. Embedded newlines will not be matched by "^" or "$". You may, however, wish to treat a string as a multi-line buffer, such that the "^" will match after any newline within the string (except if the newline is the last character in the string), and "$" will match before any newline. At the cost of a little more overhead, you can do this by using the /m modifier on the pattern match operator. (Older programs did this by setting C<$*>, but this practice has been removed in perl 5.9.) X<^> X<$> X To simplify multi-line substitutions, the "." character never matches a newline unless you use the C modifier, which in effect tells Perl to pretend the string is a single line--even if it isn't. X<.> X =head3 Quantifiers The following standard quantifiers are recognized: X X X<*> X<+> X X<{n}> X<{n,}> X<{n,m}> * Match 0 or more times + Match 1 or more times ? Match 1 or 0 times {n} Match exactly n times {n,} Match at least n times {n,m} Match at least n but not more than m times (If a curly bracket occurs in any other context, it is treated as a regular character. In particular, the lower bound is not optional.) The "*" quantifier is equivalent to C<{0,}>, the "+" quantifier to C<{1,}>, and the "?" quantifier to C<{0,1}>. n and m are limited to integral values less than a preset limit defined when perl is built. This is usually 32766 on the most common platforms. The actual limit can be seen in the error message generated by code such as this: $_ **= $_ , / {$_} / for 2 .. 42; By default, a quantified subpattern is "greedy", that is, it will match as many times as possible (given a particular starting location) while still allowing the rest of the pattern to match. If you want it to match the minimum number of times possible, follow the quantifier with a "?". Note that the meanings don't change, just the "greediness": X X X X X<*?> X<+?> X X<{n}?> X<{n,}?> X<{n,m}?> *? Match 0 or more times, not greedily +? Match 1 or more times, not greedily ?? Match 0 or 1 time, not greedily {n}? Match exactly n times, not greedily {n,}? Match at least n times, not greedily {n,m}? Match at least n but not more than m times, not greedily By default, when a quantified subpattern does not allow the rest of the overall pattern to match, Perl will backtrack. However, this behaviour is sometimes undesirable. Thus Perl provides the "possessive" quantifier form as well. *+ Match 0 or more times and give nothing back ++ Match 1 or more times and give nothing back ?+ Match 0 or 1 time and give nothing back {n}+ Match exactly n times and give nothing back (redundant) {n,}+ Match at least n times and give nothing back {n,m}+ Match at least n but not more than m times and give nothing back For instance, 'aaaa' =~ /a++a/ will never match, as the C will gobble up all the C's in the string and won't leave any for the remaining part of the pattern. This feature can be extremely useful to give perl hints about where it shouldn't backtrack. For instance, the typical "match a double-quoted string" problem can be most efficiently performed when written as: /"(?:[^"\\]++|\\.)*+"/ as we know that if the final quote does not match, backtracking will not help. See the independent subexpression C<< (?>...) >> for more details; possessive quantifiers are just syntactic sugar for that construct. For instance the above example could also be written as follows: /"(?>(?:(?>[^"\\]+)|\\.)*)"/ =head3 Escape sequences Because patterns are processed as double quoted strings, the following also work: X<\t> X<\r> X<\f> X<\e> X<\a> X<\l> X<\u> X<\L> X<\U> X<\E> X<\Q> X<\0> X<\c> X<\N> X<\x> \t tab (HT, TAB) \n newline (LF, NL) \r return (CR) \f form feed (FF) \a alarm (bell) (BEL) \e escape (think troff) (ESC) \033 octal char (example: ESC) \x1B hex char (example: ESC) \x{263a} long hex char (example: Unicode SMILEY) \cK control char (example: VT) \N{name} named Unicode character \l lowercase next char (think vi) \u uppercase next char (think vi) \L lowercase till \E (think vi) \U uppercase till \E (think vi) \E end case modification (think vi) \Q quote (disable) pattern metacharacters till \E If C is in effect, the case map used by C<\l>, C<\L>, C<\u> and C<\U> is taken from the current locale. See L. For documentation of C<\N{name}>, see L. You cannot include a literal C<$> or C<@> within a C<\Q> sequence. An unescaped C<$> or C<@> interpolates the corresponding variable, while escaping will cause the literal string C<\$> to be matched. You'll need to write something like C. =head3 Character Classes and other Special Escapes In addition, Perl defines the following: X<\w> X<\W> X<\s> X<\S> X<\d> X<\D> X<\X> X<\p> X<\P> X<\C> X<\g> X<\k> X<\N> X<\K> X<\v> X<\V> X<\h> X<\H> X X X X \w Match a "word" character (alphanumeric plus "_") \W Match a non-"word" character \s Match a whitespace character \S Match a non-whitespace character \d Match a digit character \D Match a non-digit character \pP Match P, named property. Use \p{Prop} for longer names. \PP Match non-P \X Match eXtended Unicode "combining character sequence", equivalent to (?:\PM\pM*) \C Match a single C char (octet) even under Unicode. NOTE: breaks up characters into their UTF-8 bytes, so you may end up with malformed pieces of UTF-8. Unsupported in lookbehind. \1 Backreference to a specific group. '1' may actually be any positive integer. \g1 Backreference to a specific or previous group, \g{-1} number may be negative indicating a previous buffer and may optionally be wrapped in curly brackets for safer parsing. \g{name} Named backreference \k Named backreference \K Keep the stuff left of the \K, don't include it in $& \v Vertical whitespace \V Not vertical whitespace \h Horizontal whitespace \H Not horizontal whitespace \R Linebreak A C<\w> matches a single alphanumeric character (an alphabetic character, or a decimal digit) or C<_>, not a whole word. Use C<\w+> to match a string of Perl-identifier characters (which isn't the same as matching an English word). If C is in effect, the list of alphabetic characters generated by C<\w> is taken from the current locale. See L. You may use C<\w>, C<\W>, C<\s>, C<\S>, C<\d>, and C<\D> within character classes, but they aren't usable as either end of a range. If any of them precedes or follows a "-", the "-" is understood literally. If Unicode is in effect, C<\s> matches also "\x{85}", "\x{2028}", and "\x{2029}". See L for more details about C<\pP>, C<\PP>, C<\X> and the possibility of defining your own C<\p> and C<\P> properties, and L about Unicode in general. X<\w> X<\W> X C<\R> will atomically match a linebreak, including the network line-ending "\x0D\x0A". Specifically, X<\R> is exactly equivalent to (?>\x0D\x0A?|[\x0A-\x0C\x85\x{2028}\x{2029}]) B C<\R> has no special meaning inside of a character class; use C<\v> instead (vertical whitespace). X<\R> The POSIX character class syntax X [:class:] is also available. Note that the C<[> and C<]> brackets are I; they must always be used within a character class expression. # this is correct: $string =~ /[[:alpha:]]/; # this is not, and will generate a warning: $string =~ /[:alpha:]/; The available classes and their backslash equivalents (if available) are as follows: X X X X X X X X X X X X X X X alpha alnum ascii blank [1] cntrl digit \d graph lower print punct space \s [2] upper word \w [3] xdigit =over =item [1] A GNU extension equivalent to C<[ \t]>, "all horizontal whitespace". =item [2] Not exactly equivalent to C<\s> since the C<[[:space:]]> includes also the (very rare) "vertical tabulator", "\cK" or chr(11) in ASCII. =item [3] A Perl extension, see above. =back For example use C<[:upper:]> to match all the uppercase characters. Note that the C<[]> are part of the C<[::]> construct, not part of the whole character class. For example: [01[:alpha:]%] matches zero, one, any alphabetic character, and the percent sign. The following equivalences to Unicode \p{} constructs and equivalent backslash character classes (if available), will hold: X X<\p> X<\p{}> [[:...:]] \p{...} backslash alpha IsAlpha alnum IsAlnum ascii IsASCII blank cntrl IsCntrl digit IsDigit \d graph IsGraph lower IsLower print IsPrint punct IsPunct space IsSpace IsSpacePerl \s upper IsUpper word IsWord xdigit IsXDigit For example C<[[:lower:]]> and C<\p{IsLower}> are equivalent. If the C pragma is not used but the C pragma is, the classes correlate with the usual isalpha(3) interface (except for "word" and "blank"). The other named classes are: =over 4 =item cntrl X Any control character. Usually characters that don't produce output as such but instead control the terminal somehow: for example newline and backspace are control characters. All characters with ord() less than 32 are usually classified as control characters (assuming ASCII, the ISO Latin character sets, and Unicode), as is the character with the ord() value of 127 (C). =item graph X Any alphanumeric or punctuation (special) character. =item print X Any alphanumeric or punctuation (special) character or the space character. =item punct X Any punctuation (special) character. =item xdigit X Any hexadecimal digit. Though this may feel silly ([0-9A-Fa-f] would work just fine) it is included for completeness. =back You can negate the [::] character classes by prefixing the class name with a '^'. This is a Perl extension. For example: X POSIX traditional Unicode [[:^digit:]] \D \P{IsDigit} [[:^space:]] \S \P{IsSpace} [[:^word:]] \W \P{IsWord} Perl respects the POSIX standard in that POSIX character classes are only supported within a character class. The POSIX character classes [.cc.] and [=cc=] are recognized but B supported and trying to use them will cause an error. =head3 Assertions Perl defines the following zero-width assertions: X X X X X X<\b> X<\B> X<\A> X<\Z> X<\z> X<\G> \b Match a word boundary \B Match except at a word boundary \A Match only at beginning of string \Z Match only at end of string, or before newline at the end \z Match only at end of string \G Match only at pos() (e.g. at the end-of-match position of prior m//g) A word boundary (C<\b>) is a spot between two characters that has a C<\w> on one side of it and a C<\W> on the other side of it (in either order), counting the imaginary characters off the beginning and end of the string as matching a C<\W>. (Within character classes C<\b> represents backspace rather than a word boundary, just as it normally does in any double-quoted string.) The C<\A> and C<\Z> are just like "^" and "$", except that they won't match multiple times when the C modifier is used, while "^" and "$" will match at every internal line boundary. To match the actual end of the string and not ignore an optional trailing newline, use C<\z>. X<\b> X<\A> X<\Z> X<\z> X The C<\G> assertion can be used to chain global matches (using C), as described in L. It is also useful when writing C-like scanners, when you have several patterns that you want to match against consequent substrings of your string, see the previous reference. The actual location where C<\G> will match can also be influenced by using C as an lvalue: see L. Note that the rule for zero-length matches is modified somewhat, in that contents to the left of C<\G> is not counted when determining the length of the match. Thus the following will not match forever: X<\G> $str = 'ABC'; pos($str) = 1; while (/.\G/g) { print $&; } It will print 'A' and then terminate, as it considers the match to be zero-width, and thus will not match at the same position twice in a row. It is worth noting that C<\G> improperly used can result in an infinite loop. Take care when using patterns that include C<\G> in an alternation. =head3 Capture buffers The bracketing construct C<( ... )> creates capture buffers. To refer to the current contents of a buffer later on, within the same pattern, use \1 for the first, \2 for the second, and so on. Outside the match use "$" instead of "\". (The \ notation works in certain circumstances outside the match. See the warning below about \1 vs $1 for details.) Referring back to another part of the match is called a I. X X X X There is no limit to the number of captured substrings that you may use. However Perl also uses \10, \11, etc. as aliases for \010, \011, etc. (Recall that 0 means octal, so \011 is the character at number 9 in your coded character set; which would be the 10th character, a horizontal tab under ASCII.) Perl resolves this ambiguity by interpreting \10 as a backreference only if at least 10 left parentheses have opened before it. Likewise \11 is a backreference only if at least 11 left parentheses have opened before it. And so on. \1 through \9 are always interpreted as backreferences. X<\g{1}> X<\g{-1}> X<\g{name}> X X In order to provide a safer and easier way to construct patterns using backreferences, Perl provides the C<\g{N}> notation (starting with perl 5.10.0). The curly brackets are optional, however omitting them is less safe as the meaning of the pattern can be changed by text (such as digits) following it. When N is a positive integer the C<\g{N}> notation is exactly equivalent to using normal backreferences. When N is a negative integer then it is a relative backreference referring to the previous N'th capturing group. When the bracket form is used and N is not an integer, it is treated as a reference to a named buffer. Thus C<\g{-1}> refers to the last buffer, C<\g{-2}> refers to the buffer before that. For example: / (Y) # buffer 1 ( # buffer 2 (X) # buffer 3 \g{-1} # backref to buffer 3 \g{-3} # backref to buffer 1 ) /x and would match the same as C. Additionally, as of Perl 5.10.0 you may use named capture buffers and named backreferences. The notation is C<< (?...) >> to declare and C<< \k >> to reference. You may also use apostrophes instead of angle brackets to delimit the name; and you may use the bracketed C<< \g{name} >> backreference syntax. It's possible to refer to a named capture buffer by absolute and relative number as well. Outside the pattern, a named capture buffer is available via the C<%+> hash. When different buffers within the same pattern have the same name, C<$+{name}> and C<< \k >> refer to the leftmost defined group. (Thus it's possible to do things with named capture buffers that would otherwise require C<(??{})> code to accomplish.) X X X<%+> X<$+{name}> X<< \k >> Examples: s/^([^ ]*) *([^ ]*)/$2 $1/; # swap first two words /(.)\1/ # find first doubled char and print "'$1' is the first doubled character\n"; /(?.)\k/ # ... a different way and print "'$+{char}' is the first doubled character\n"; /(?'char'.)\1/ # ... mix and match and print "'$1' is the first doubled character\n"; if (/Time: (..):(..):(..)/) { # parse out values $hours = $1; $minutes = $2; $seconds = $3; } Several special variables also refer back to portions of the previous match. C<$+> returns whatever the last bracket match matched. C<$&> returns the entire matched string. (At one point C<$0> did also, but now it returns the name of the program.) C<$`> returns everything before the matched string. C<$'> returns everything after the matched string. And C<$^N> contains whatever was matched by the most-recently closed group (submatch). C<$^N> can be used in extended patterns (see below), for example to assign a submatch to a variable. X<$+> X<$^N> X<$&> X<$`> X<$'> The numbered match variables ($1, $2, $3, etc.) and the related punctuation set (C<$+>, C<$&>, C<$`>, C<$'>, and C<$^N>) are all dynamically scoped until the end of the enclosing block or until the next successful match, whichever comes first. (See L.) X<$+> X<$^N> X<$&> X<$`> X<$'> X<$1> X<$2> X<$3> X<$4> X<$5> X<$6> X<$7> X<$8> X<$9> B: Failed matches in Perl do not reset the match variables, which makes it easier to write code that tests for a series of more specific cases and remembers the best match. B: Once Perl sees that you need one of C<$&>, C<$`>, or C<$'> anywhere in the program, it has to provide them for every pattern match. This may substantially slow your program. Perl uses the same mechanism to produce $1, $2, etc, so you also pay a price for each pattern that contains capturing parentheses. (To avoid this cost while retaining the grouping behaviour, use the extended regular expression C<(?: ... )> instead.) But if you never use C<$&>, C<$`> or C<$'>, then patterns I capturing parentheses will not be penalized. So avoid C<$&>, C<$'>, and C<$`> if you can, but if you can't (and some algorithms really appreciate them), once you've used them once, use them at will, because you've already paid the price. As of 5.005, C<$&> is not so costly as the other two. X<$&> X<$`> X<$'> As a workaround for this problem, Perl 5.10.0 introduces C<${^PREMATCH}>, C<${^MATCH}> and C<${^POSTMATCH}>, which are equivalent to C<$`>, C<$&> and C<$'>, B that they are only guaranteed to be defined after a successful match that was executed with the C

(preserve) modifier. The use of these variables incurs no global performance penalty, unlike their punctuation char equivalents, however at the trade-off that you have to tell perl when you want to use them. X

Backslashed metacharacters in Perl are alphanumeric, such as C<\b>, C<\w>, C<\n>. Unlike some other regular expression languages, there are no backslashed symbols that aren't alphanumeric. So anything that looks like \\, $, $, \<, \>, \{, or \} is always interpreted as a literal character, not a metacharacter. This was once used in a common idiom to disable or quote the special meanings of regular expression metacharacters in a string that you want to use for a pattern. Simply quote all non-"word" characters: $pattern =~ s/(\W)/\\$1/g; (If C is set, then this depends on the current locale.) Today it is more common to use the quotemeta() function or the C<\Q> metaquoting escape sequence to disable all metacharacters' special meanings like this: /$unquoted\Q$quoted\E$unquoted/ Beware that if you put literal backslashes (those not inside interpolated variables) between C<\Q> and C<\E>, double-quotish backslash interpolation may lead to confusing results. If you I to use literal backslashes within C<\Q...\E>, consult L. =head2 Extended Patterns Perl also defines a consistent extension syntax for features not found in standard tools like B and B. The syntax is a pair of parentheses with a question mark as the first thing within the parentheses. The character after the question mark indicates the extension. The stability of these extensions varies widely. Some have been part of the core language for many years. Others are experimental and may change without warning or be completely removed. Check the documentation on an individual feature to verify its current status. A question mark was chosen for this and for the minimal-matching construct because 1) question marks are rare in older regular expressions, and 2) whenever you see one, you should stop and "question" exactly what is going on. That's psychology... =over 10 =item C<(?#text)> X<(?#)> A comment. The text is ignored. If the C modifier enables whitespace formatting, a simple C<#> will suffice. Note that Perl closes the comment as soon as it sees a C<)>, so there is no way to put a literal C<)> in the comment. =item C<(?pimsx-imsx)> X<(?)> One or more embedded pattern-match modifiers, to be turned on (or turned off, if preceded by C<->) for the remainder of the pattern or the remainder of the enclosing pattern group (if any). This is particularly useful for dynamic patterns, such as those read in from a configuration file, taken from an argument, or specified in a table somewhere. Consider the case where some patterns want to be case sensitive and some do not: The case insensitive ones merely need to include C<(?i)> at the front of the pattern. For example: $pattern = "foobar"; if ( /$pattern/i ) { } # more flexible: $pattern = "(?i)foobar"; if ( /$pattern/ ) { } These modifiers are restored at the end of the enclosing group. For example, ( (?i) blah ) \s+ \1 will match C in any case, some spaces, and an exact (I!) repetition of the previous word, assuming the C modifier, and no C modifier outside this group. Note that the C