gawk: Gory Details

1 
1 9.1.3.1 More about '\' and '&' with 'sub()', 'gsub()', and 'gensub()'
1 .....................................................................
1 
1      CAUTION: This subsubsection has been reported to cause headaches.
1      You might want to skip it upon first reading.
1 
1    When using 'sub()', 'gsub()', or 'gensub()', and trying to get
1 literal backslashes and ampersands into the replacement text, you need
1 to remember that there are several levels of "escape processing" going
1 on.
1 
1    First, there is the "lexical" level, which is when 'awk' reads your
1 program and builds an internal copy of it to execute.  Then there is the
1 runtime level, which is when 'awk' actually scans the replacement string
1 to determine what to generate.
1 
1    At both levels, 'awk' looks for a defined set of characters that can
1 come after a backslash.  At the lexical level, it looks for the escape
1 sequences listed in ⇒Escape Sequences.  Thus, for every '\' that
1 'awk' processes at the runtime level, you must type two backslashes at
1 the lexical level.  When a character that is not valid for an escape
1 sequence follows the '\', BWK 'awk' and 'gawk' both simply remove the
1 initial '\' and put the next character into the string.  Thus, for
1 example, '"a\qb"' is treated as '"aqb"'.
1 
1    At the runtime level, the various functions handle sequences of '\'
1 and '&' differently.  The situation is (sadly) somewhat complex.
1 Historically, the 'sub()' and 'gsub()' functions treated the
1 two-character sequence '\&' specially; this sequence was replaced in the
1 generated text with a single '&'.  Any other '\' within the REPLACEMENT
1 string that did not precede an '&' was passed through unchanged.  This
1 is illustrated in ⇒Table 9.1 table-sub-escapes.
1 
1       You type         'sub()' sees          'sub()' generates
1       -----         -------          ----------
1           '\&'              '&'            The matched text
1          '\\&'             '\&'            A literal '&'
1         '\\\&'             '\&'            A literal '&'
1        '\\\\&'            '\\&'            A literal '\&'
1       '\\\\\&'            '\\&'            A literal '\&'
1      '\\\\\\&'           '\\\&'            A literal '\\&'
1          '\\q'             '\q'            A literal '\q'
1 
1 Table 9.1: Historical escape sequence processing for 'sub()' and
1 'gsub()'
1 
1 This table shows the lexical-level processing, where an odd number of
1 backslashes becomes an even number at the runtime level, as well as the
1 runtime processing done by 'sub()'.  (For the sake of simplicity, the
1 rest of the following tables only show the case of even numbers of
1 backslashes entered at the lexical level.)
1 
1    The problem with the historical approach is that there is no way to
1 get a literal '\' followed by the matched text.
1 
1    Several editions of the POSIX standard attempted to fix this problem
1 but weren't successful.  The details are irrelevant at this point in
1 time.
1 
1    At one point, the 'gawk' maintainer submitted proposed text for a
1 revised standard that reverts to rules that correspond more closely to
1 the original existing practice.  The proposed rules have special cases
1 that make it possible to produce a '\' preceding the matched text.  This
1 is shown in ⇒Table 9.2 table-sub-proposed.
1 
1       You type         'sub()' sees         'sub()' generates
1       -----         -------         ----------
1      '\\\\\\&'           '\\\&'            A literal '\&'
1        '\\\\&'            '\\&'            A literal '\', followed by the matched text
1          '\\&'             '\&'            A literal '&'
1          '\\q'             '\q'            A literal '\q'
1         '\\\\'             '\\'            '\\'
1 
1 Table 9.2: 'gawk' rules for 'sub()' and backslash
1 
1    In a nutshell, at the runtime level, there are now three special
1 sequences of characters ('\\\&', '\\&', and '\&') whereas historically
1 there was only one.  However, as in the historical case, any '\' that is
1 not part of one of these three sequences is not special and appears in
1 the output literally.
1 
1    'gawk' 3.0 and 3.1 follow these rules for 'sub()' and 'gsub()'.  The
1 POSIX standard took much longer to be revised than was expected.  In
1 addition, the 'gawk' maintainer's proposal was lost during the
1 standardization process.  The final rules are somewhat simpler.  The
1 results are similar except for one case.
1 
1    The POSIX rules state that '\&' in the replacement string produces a
1 literal '&', '\\' produces a literal '\', and '\' followed by anything
1 else is not special; the '\' is placed straight into the output.  These
1 rules are presented in ⇒Table 9.3 table-posix-sub.
1 
1       You type         'sub()' sees         'sub()' generates
1       -----         -------         ----------
1      '\\\\\\&'           '\\\&'            A literal '\&'
1        '\\\\&'            '\\&'            A literal '\', followed by the matched text
1          '\\&'             '\&'            A literal '&'
1          '\\q'             '\q'            A literal '\q'
1         '\\\\'             '\\'            '\'
1 
1 Table 9.3: POSIX rules for 'sub()' and 'gsub()'
1 
1    The only case where the difference is noticeable is the last one:
1 '\\\\' is seen as '\\' and produces '\' instead of '\\'.
1 
1    Starting with version 3.1.4, 'gawk' followed the POSIX rules when
1 '--posix' was specified (⇒Options).  Otherwise, it continued to
1 follow the proposed rules, as that had been its behavior for many years.
1 
1    When version 4.0.0 was released, the 'gawk' maintainer made the POSIX
1 rules the default, breaking well over a decade's worth of backward
1 compatibility.(1)  Needless to say, this was a bad idea, and as of
1 version 4.0.1, 'gawk' resumed its historical behavior, and only follows
1 the POSIX rules when '--posix' is given.
1 
1    The rules for 'gensub()' are considerably simpler.  At the runtime
1 level, whenever 'gawk' sees a '\', if the following character is a
1 digit, then the text that matched the corresponding parenthesized
1 subexpression is placed in the generated output.  Otherwise, no matter
1 what character follows the '\', it appears in the generated text and the
1 '\' does not, as shown in ⇒Table 9.4 table-gensub-escapes.
1 
1        You type          'gensub()' sees         'gensub()' generates
1        -----          ---------         ------------
1            '&'                    '&'            The matched text
1          '\\&'                   '\&'            A literal '&'
1         '\\\\'                   '\\'            A literal '\'
1        '\\\\&'                  '\\&'            A literal '\', then the matched text
1      '\\\\\\&'                 '\\\&'            A literal '\&'
1          '\\q'                   '\q'            A literal 'q'
1 
1 Table 9.4: Escape sequence processing for 'gensub()'
1 
1    Because of the complexity of the lexical- and runtime-level
1 processing and the special cases for 'sub()' and 'gsub()', we recommend
1 the use of 'gawk' and 'gensub()' when you have to do substitutions.
1 
1    ---------- Footnotes ----------
1 
1    (1) This was rather naive of him, despite there being a note in this
1 minor node indicating that the next major version would move to the
1 POSIX rules.
1