gawk: Computed Regexps
1
1 3.6 Using Dynamic Regexps
1 =========================
1
1 The righthand side of a '~' or '!~' operator need not be a regexp
1 constant (i.e., a string of characters between slashes). It may be any
1 expression. The expression is evaluated and converted to a string if
1 necessary; the contents of the string are then used as the regexp. A
1 regexp computed in this way is called a "dynamic regexp" or a "computed
1 regexp":
1
1 BEGIN { digits_regexp = "[[:digit:]]+" }
1 $0 ~ digits_regexp { print }
1
1 This sets 'digits_regexp' to a regexp that describes one or more digits,
1 and tests whether the input record matches this regexp.
1
1 NOTE: When using the '~' and '!~' operators, be aware that there is
1 a difference between a regexp constant enclosed in slashes and a
1 string constant enclosed in double quotes. If you are going to use
1 a string constant, you have to understand that the string is, in
1 essence, scanned _twice_: the first time when 'awk' reads your
1 program, and the second time when it goes to match the string on
1 the lefthand side of the operator with the pattern on the right.
1 This is true of any string-valued expression (such as
1 'digits_regexp', shown in the previous example), not just string
1 constants.
1
1 What difference does it make if the string is scanned twice? The
1 answer has to do with escape sequences, and particularly with
1 backslashes. To get a backslash into a regular expression inside a
1 string, you have to type two backslashes.
1
1 For example, '/\*/' is a regexp constant for a literal '*'. Only one
1 backslash is needed. To do the same thing with a string, you have to
1 type '"\\*"'. The first backslash escapes the second one so that the
1 string actually contains the two characters '\' and '*'.
1
1 Given that you can use both regexp and string constants to describe
1 regular expressions, which should you use? The answer is "regexp
1 constants," for several reasons:
1
1 * String constants are more complicated to write and more difficult
1 to read. Using regexp constants makes your programs less
1 error-prone. Not understanding the difference between the two
1 kinds of constants is a common source of errors.
1
1 * It is more efficient to use regexp constants. 'awk' can note that
1 you have supplied a regexp and store it internally in a form that
1 makes pattern matching more efficient. When using a string
1 constant, 'awk' must first convert the string into this internal
1 form and then perform the pattern matching.
1
1 * Using regexp constants is better form; it shows clearly that you
1 intend a regexp match.
1
1 Using '\n' in Bracket Expressions of Dynamic Regexps
1
1 Some older versions of 'awk' do not allow the newline character to be
1 used inside a bracket expression for a dynamic regexp:
1
1 $ awk '$0 ~ "[ \t\n]"'
1 error-> awk: newline in character class [
1 error-> ]...
1 error-> source line number 1
1 error-> context is
1 error-> $0 ~ "[ >>> \t\n]" <<<
1
1 But a newline in a regexp constant works with no problem:
1
1 $ awk '$0 ~ /[ \t\n]/'
1 here is a sample line
1 -| here is a sample line
1 Ctrl-d
1
1 'gawk' does not have this problem, and it isn't likely to occur often
1 in practice, but it's worth noting for future reference.
1