gawk: Computed Regexps

1 
1 3.6 Using Dynamic Regexps
1 =========================
1 
1 The righthand side of a '~' or '!~' operator need not be a regexp
1 constant (i.e., a string of characters between slashes).  It may be any
1 expression.  The expression is evaluated and converted to a string if
1 necessary; the contents of the string are then used as the regexp.  A
1 regexp computed in this way is called a "dynamic regexp" or a "computed
1 regexp":
1 
1      BEGIN { digits_regexp = "[[:digit:]]+" }
1      $0 ~ digits_regexp    { print }
1 
1 This sets 'digits_regexp' to a regexp that describes one or more digits,
1 and tests whether the input record matches this regexp.
1 
1      NOTE: When using the '~' and '!~' operators, be aware that there is
1      a difference between a regexp constant enclosed in slashes and a
1      string constant enclosed in double quotes.  If you are going to use
1      a string constant, you have to understand that the string is, in
1      essence, scanned _twice_: the first time when 'awk' reads your
1      program, and the second time when it goes to match the string on
1      the lefthand side of the operator with the pattern on the right.
1      This is true of any string-valued expression (such as
1      'digits_regexp', shown in the previous example), not just string
1      constants.
1 
1    What difference does it make if the string is scanned twice?  The
1 answer has to do with escape sequences, and particularly with
1 backslashes.  To get a backslash into a regular expression inside a
1 string, you have to type two backslashes.
1 
1    For example, '/\*/' is a regexp constant for a literal '*'.  Only one
1 backslash is needed.  To do the same thing with a string, you have to
1 type '"\\*"'.  The first backslash escapes the second one so that the
1 string actually contains the two characters '\' and '*'.
1 
1    Given that you can use both regexp and string constants to describe
1 regular expressions, which should you use?  The answer is "regexp
1 constants," for several reasons:
1 
1    * String constants are more complicated to write and more difficult
1      to read.  Using regexp constants makes your programs less
1      error-prone.  Not understanding the difference between the two
1      kinds of constants is a common source of errors.
1 
1    * It is more efficient to use regexp constants.  'awk' can note that
1      you have supplied a regexp and store it internally in a form that
1      makes pattern matching more efficient.  When using a string
1      constant, 'awk' must first convert the string into this internal
1      form and then perform the pattern matching.
1 
1    * Using regexp constants is better form; it shows clearly that you
1      intend a regexp match.
1 
1          Using '\n' in Bracket Expressions of Dynamic Regexps
1 
1    Some older versions of 'awk' do not allow the newline character to be
1 used inside a bracket expression for a dynamic regexp:
1 
1      $ awk '$0 ~ "[ \t\n]"'
1      error-> awk: newline in character class [
1      error-> ]...
1      error->  source line number 1
1      error->  context is
1      error->        $0 ~ "[ >>>  \t\n]" <<<
1 
1    But a newline in a regexp constant works with no problem:
1 
1      $ awk '$0 ~ /[ \t\n]/'
1      here is a sample line
1      -| here is a sample line
1      Ctrl-d
1 
1    'gawk' does not have this problem, and it isn't likely to occur often
1 in practice, but it's worth noting for future reference.
1