gawk: Regexp Operators
1
1 3.3 Regular Expression Operators
1 ================================
1
1 You can combine regular expressions with special characters, called
1 "regular expression operators" or "metacharacters", to increase the
1 power and versatility of regular expressions.
1
1 The escape sequences described in ⇒Escape Sequences are valid
1 inside a regexp. They are introduced by a '\' and are recognized and
1 converted into corresponding real characters as the very first step in
1 processing regexps.
1
1 Here is a list of metacharacters. All characters that are not escape
1 sequences and that are not listed here stand for themselves:
1
1 '\'
1 This suppresses the special meaning of a character when matching.
1 For example, '\$' matches the character '$'.
1
1 '^'
1 This matches the beginning of a string. '^@chapter' matches
1 '@chapter' at the beginning of a string, for example, and can be
1 used to identify chapter beginnings in Texinfo source files. The
1 '^' is known as an "anchor", because it anchors the pattern to
1 match only at the beginning of the string.
1
1 It is important to realize that '^' does not match the beginning of
1 a line (the point right after a '\n' newline character) embedded in
1 a string. The condition is not true in the following example:
1
1 if ("line1\nLINE 2" ~ /^L/) ...
1
1 '$'
1 This is similar to '^', but it matches only at the end of a string.
1 For example, 'p$' matches a record that ends with a 'p'. The '$'
1 is an anchor and does not match the end of a line (the point right
1 before a '\n' newline character) embedded in a string. The
1 condition in the following example is not true:
1
1 if ("line1\nLINE 2" ~ /1$/) ...
1
1 '.' (period)
1 This matches any single character, _including_ the newline
1 character. For example, '.P' matches any single character followed
1 by a 'P' in a string. Using concatenation, we can make a regular
1 expression such as 'U.A', which matches any three-character
1 sequence that begins with 'U' and ends with 'A'.
1
1 In strict POSIX mode (⇒Options), '.' does not match the NUL
1 character, which is a character with all bits equal to zero.
1 Otherwise, NUL is just another character. Other versions of 'awk'
1 may not be able to match the NUL character.
1
1 '['...']'
1 This is called a "bracket expression".(1) It matches any _one_ of
1 the characters that are enclosed in the square brackets. For
1 example, '[MVX]' matches any one of the characters 'M', 'V', or 'X'
1 in a string. A full discussion of what can be inside the square
11 brackets of a bracket expression is given in ⇒Bracket
Expressions.
1
1 '[^'...']'
1 This is a "complemented bracket expression". The first character
1 after the '[' _must_ be a '^'. It matches any characters _except_
1 those in the square brackets. For example, '[^awk]' matches any
1 character that is not an 'a', 'w', or 'k'.
1
1 '|'
1 This is the "alternation operator" and it is used to specify
1 alternatives. The '|' has the lowest precedence of all the regular
1 expression operators. For example, '^P|[aeiouy]' matches any
1 string that matches either '^P' or '[aeiouy]'. This means it
1 matches any string that starts with 'P' or contains (anywhere
1 within it) a lowercase English vowel.
1
1 The alternation applies to the largest possible regexps on either
1 side.
1
1 '('...')'
1 Parentheses are used for grouping in regular expressions, as in
1 arithmetic. They can be used to concatenate regular expressions
1 containing the alternation operator, '|'. For example,
1 '@(samp|code)\{[^}]+\}' matches both '@code{foo}' and '@samp{bar}'.
1 (These are Texinfo formatting control sequences. The '+' is
1 explained further on in this list.)
1
1 '*'
1 This symbol means that the preceding regular expression should be
1 repeated as many times as necessary to find a match. For example,
1 'ph*' applies the '*' symbol to the preceding 'h' and looks for
1 matches of one 'p' followed by any number of 'h's. This also
1 matches just 'p' if no 'h's are present.
1
1 There are two subtle points to understand about how '*' works.
1 First, the '*' applies only to the single preceding regular
1 expression component (e.g., in 'ph*', it applies just to the 'h').
1 To cause '*' to apply to a larger subexpression, use parentheses:
1 '(ph)*' matches 'ph', 'phph', 'phphph', and so on.
1
1 Second, '*' finds as many repetitions as possible. If the text to
1 be matched is 'phhhhhhhhhhhhhhooey', 'ph*' matches all of the 'h's.
1
1 '+'
1 This symbol is similar to '*', except that the preceding expression
1 must be matched at least once. This means that 'wh+y' would match
1 'why' and 'whhy', but not 'wy', whereas 'wh*y' would match all
1 three.
1
1 '?'
1 This symbol is similar to '*', except that the preceding expression
1 can be matched either once or not at all. For example, 'fe?d'
1 matches 'fed' and 'fd', but nothing else.
1
1 '{'N'}'
1 '{'N',}'
1 '{'N','M'}'
1 One or two numbers inside braces denote an "interval expression".
1 If there is one number in the braces, the preceding regexp is
1 repeated N times. If there are two numbers separated by a comma,
1 the preceding regexp is repeated N to M times. If there is one
1 number followed by a comma, then the preceding regexp is repeated
1 at least N times:
1
1 'wh{3}y'
1 Matches 'whhhy', but not 'why' or 'whhhhy'.
1
1 'wh{3,5}y'
1 Matches 'whhhy', 'whhhhy', or 'whhhhhy' only.
1
1 'wh{2,}y'
1 Matches 'whhy', 'whhhy', and so on.
1
1 Interval expressions were not traditionally available in 'awk'.
1 They were added as part of the POSIX standard to make 'awk' and
1 'egrep' consistent with each other.
1
1 Initially, because old programs may use '{' and '}' in regexp
1 constants, 'gawk' did _not_ match interval expressions in regexps.
1
1 However, beginning with version 4.0, 'gawk' does match interval
1 expressions by default. This is because compatibility with POSIX
1 has become more important to most 'gawk' users than compatibility
1 with old programs.
1
1 For programs that use '{' and '}' in regexp constants, it is good
1 practice to always escape them with a backslash. Then the regexp
1 constants are valid and work the way you want them to, using any
1 version of 'awk'.(2)
1
1 Finally, when '{' and '}' appear in regexp constants in a way that
1 cannot be interpreted as an interval expression (such as '/q{a}/'),
1 then they stand for themselves.
1
1 In regular expressions, the '*', '+', and '?' operators, as well as
1 the braces '{' and '}', have the highest precedence, followed by
1 concatenation, and finally by '|'. As in arithmetic, parentheses can
1 change how operators are grouped.
1
1 In POSIX 'awk' and 'gawk', the '*', '+', and '?' operators stand for
1 themselves when there is nothing in the regexp that precedes them. For
1 example, '/+/' matches a literal plus sign. However, many other
1 versions of 'awk' treat such a usage as a syntax error.
1
1 If 'gawk' is in compatibility mode (⇒Options), interval
1 expressions are not available in regular expressions.
1
1 ---------- Footnotes ----------
1
1 (1) In other literature, you may see a bracket expression referred to
1 as either a "character set", a "character class", or a "character list".
1
1 (2) Use two backslashes if you're using a string constant with a
1 regexp operator or function.
1