gawk: Regexp Operators

1 
1 3.3 Regular Expression Operators
1 ================================
1 
1 You can combine regular expressions with special characters, called
1 "regular expression operators" or "metacharacters", to increase the
1 power and versatility of regular expressions.
1 
1    The escape sequences described in ⇒Escape Sequences are valid
1 inside a regexp.  They are introduced by a '\' and are recognized and
1 converted into corresponding real characters as the very first step in
1 processing regexps.
1 
1    Here is a list of metacharacters.  All characters that are not escape
1 sequences and that are not listed here stand for themselves:
1 
1 '\'
1      This suppresses the special meaning of a character when matching.
1      For example, '\$' matches the character '$'.
1 
1 '^'
1      This matches the beginning of a string.  '^@chapter' matches
1      '@chapter' at the beginning of a string, for example, and can be
1      used to identify chapter beginnings in Texinfo source files.  The
1      '^' is known as an "anchor", because it anchors the pattern to
1      match only at the beginning of the string.
1 
1      It is important to realize that '^' does not match the beginning of
1      a line (the point right after a '\n' newline character) embedded in
1      a string.  The condition is not true in the following example:
1 
1           if ("line1\nLINE 2" ~ /^L/) ...
1 
1 '$'
1      This is similar to '^', but it matches only at the end of a string.
1      For example, 'p$' matches a record that ends with a 'p'.  The '$'
1      is an anchor and does not match the end of a line (the point right
1      before a '\n' newline character) embedded in a string.  The
1      condition in the following example is not true:
1 
1           if ("line1\nLINE 2" ~ /1$/) ...
1 
1 '.' (period)
1      This matches any single character, _including_ the newline
1      character.  For example, '.P' matches any single character followed
1      by a 'P' in a string.  Using concatenation, we can make a regular
1      expression such as 'U.A', which matches any three-character
1      sequence that begins with 'U' and ends with 'A'.
1 
1      In strict POSIX mode (⇒Options), '.' does not match the NUL
1      character, which is a character with all bits equal to zero.
1      Otherwise, NUL is just another character.  Other versions of 'awk'
1      may not be able to match the NUL character.
1 
1 '['...']'
1      This is called a "bracket expression".(1)  It matches any _one_ of
1      the characters that are enclosed in the square brackets.  For
1      example, '[MVX]' matches any one of the characters 'M', 'V', or 'X'
1      in a string.  A full discussion of what can be inside the square
11      brackets of a bracket expression is given in ⇒Bracket
      Expressions.
1 
1 '[^'...']'
1      This is a "complemented bracket expression".  The first character
1      after the '[' _must_ be a '^'.  It matches any characters _except_
1      those in the square brackets.  For example, '[^awk]' matches any
1      character that is not an 'a', 'w', or 'k'.
1 
1 '|'
1      This is the "alternation operator" and it is used to specify
1      alternatives.  The '|' has the lowest precedence of all the regular
1      expression operators.  For example, '^P|[aeiouy]' matches any
1      string that matches either '^P' or '[aeiouy]'.  This means it
1      matches any string that starts with 'P' or contains (anywhere
1      within it) a lowercase English vowel.
1 
1      The alternation applies to the largest possible regexps on either
1      side.
1 
1 '('...')'
1      Parentheses are used for grouping in regular expressions, as in
1      arithmetic.  They can be used to concatenate regular expressions
1      containing the alternation operator, '|'.  For example,
1      '@(samp|code)\{[^}]+\}' matches both '@code{foo}' and '@samp{bar}'.
1      (These are Texinfo formatting control sequences.  The '+' is
1      explained further on in this list.)
1 
1 '*'
1      This symbol means that the preceding regular expression should be
1      repeated as many times as necessary to find a match.  For example,
1      'ph*' applies the '*' symbol to the preceding 'h' and looks for
1      matches of one 'p' followed by any number of 'h's.  This also
1      matches just 'p' if no 'h's are present.
1 
1      There are two subtle points to understand about how '*' works.
1      First, the '*' applies only to the single preceding regular
1      expression component (e.g., in 'ph*', it applies just to the 'h').
1      To cause '*' to apply to a larger subexpression, use parentheses:
1      '(ph)*' matches 'ph', 'phph', 'phphph', and so on.
1 
1      Second, '*' finds as many repetitions as possible.  If the text to
1      be matched is 'phhhhhhhhhhhhhhooey', 'ph*' matches all of the 'h's.
1 
1 '+'
1      This symbol is similar to '*', except that the preceding expression
1      must be matched at least once.  This means that 'wh+y' would match
1      'why' and 'whhy', but not 'wy', whereas 'wh*y' would match all
1      three.
1 
1 '?'
1      This symbol is similar to '*', except that the preceding expression
1      can be matched either once or not at all.  For example, 'fe?d'
1      matches 'fed' and 'fd', but nothing else.
1 
1 '{'N'}'
1 '{'N',}'
1 '{'N','M'}'
1      One or two numbers inside braces denote an "interval expression".
1      If there is one number in the braces, the preceding regexp is
1      repeated N times.  If there are two numbers separated by a comma,
1      the preceding regexp is repeated N to M times.  If there is one
1      number followed by a comma, then the preceding regexp is repeated
1      at least N times:
1 
1      'wh{3}y'
1           Matches 'whhhy', but not 'why' or 'whhhhy'.
1 
1      'wh{3,5}y'
1           Matches 'whhhy', 'whhhhy', or 'whhhhhy' only.
1 
1      'wh{2,}y'
1           Matches 'whhy', 'whhhy', and so on.
1 
1      Interval expressions were not traditionally available in 'awk'.
1      They were added as part of the POSIX standard to make 'awk' and
1      'egrep' consistent with each other.
1 
1      Initially, because old programs may use '{' and '}' in regexp
1      constants, 'gawk' did _not_ match interval expressions in regexps.
1 
1      However, beginning with version 4.0, 'gawk' does match interval
1      expressions by default.  This is because compatibility with POSIX
1      has become more important to most 'gawk' users than compatibility
1      with old programs.
1 
1      For programs that use '{' and '}' in regexp constants, it is good
1      practice to always escape them with a backslash.  Then the regexp
1      constants are valid and work the way you want them to, using any
1      version of 'awk'.(2)
1 
1      Finally, when '{' and '}' appear in regexp constants in a way that
1      cannot be interpreted as an interval expression (such as '/q{a}/'),
1      then they stand for themselves.
1 
1    In regular expressions, the '*', '+', and '?' operators, as well as
1 the braces '{' and '}', have the highest precedence, followed by
1 concatenation, and finally by '|'.  As in arithmetic, parentheses can
1 change how operators are grouped.
1 
1    In POSIX 'awk' and 'gawk', the '*', '+', and '?' operators stand for
1 themselves when there is nothing in the regexp that precedes them.  For
1 example, '/+/' matches a literal plus sign.  However, many other
1 versions of 'awk' treat such a usage as a syntax error.
1 
1    If 'gawk' is in compatibility mode (⇒Options), interval
1 expressions are not available in regular expressions.
1 
1    ---------- Footnotes ----------
1 
1    (1) In other literature, you may see a bracket expression referred to
1 as either a "character set", a "character class", or a "character list".
1 
1    (2) Use two backslashes if you're using a string constant with a
1 regexp operator or function.
1