gawk: Gory Details
1
1 9.1.3.1 More about '\' and '&' with 'sub()', 'gsub()', and 'gensub()'
1 .....................................................................
1
1 CAUTION: This subsubsection has been reported to cause headaches.
1 You might want to skip it upon first reading.
1
1 When using 'sub()', 'gsub()', or 'gensub()', and trying to get
1 literal backslashes and ampersands into the replacement text, you need
1 to remember that there are several levels of "escape processing" going
1 on.
1
1 First, there is the "lexical" level, which is when 'awk' reads your
1 program and builds an internal copy of it to execute. Then there is the
1 runtime level, which is when 'awk' actually scans the replacement string
1 to determine what to generate.
1
1 At both levels, 'awk' looks for a defined set of characters that can
1 come after a backslash. At the lexical level, it looks for the escape
1 sequences listed in ⇒Escape Sequences. Thus, for every '\' that
1 'awk' processes at the runtime level, you must type two backslashes at
1 the lexical level. When a character that is not valid for an escape
1 sequence follows the '\', BWK 'awk' and 'gawk' both simply remove the
1 initial '\' and put the next character into the string. Thus, for
1 example, '"a\qb"' is treated as '"aqb"'.
1
1 At the runtime level, the various functions handle sequences of '\'
1 and '&' differently. The situation is (sadly) somewhat complex.
1 Historically, the 'sub()' and 'gsub()' functions treated the
1 two-character sequence '\&' specially; this sequence was replaced in the
1 generated text with a single '&'. Any other '\' within the REPLACEMENT
1 string that did not precede an '&' was passed through unchanged. This
1 is illustrated in ⇒Table 9.1 table-sub-escapes.
1
1 You type 'sub()' sees 'sub()' generates
1 ----- ------- ----------
1 '\&' '&' The matched text
1 '\\&' '\&' A literal '&'
1 '\\\&' '\&' A literal '&'
1 '\\\\&' '\\&' A literal '\&'
1 '\\\\\&' '\\&' A literal '\&'
1 '\\\\\\&' '\\\&' A literal '\\&'
1 '\\q' '\q' A literal '\q'
1
1 Table 9.1: Historical escape sequence processing for 'sub()' and
1 'gsub()'
1
1 This table shows the lexical-level processing, where an odd number of
1 backslashes becomes an even number at the runtime level, as well as the
1 runtime processing done by 'sub()'. (For the sake of simplicity, the
1 rest of the following tables only show the case of even numbers of
1 backslashes entered at the lexical level.)
1
1 The problem with the historical approach is that there is no way to
1 get a literal '\' followed by the matched text.
1
1 Several editions of the POSIX standard attempted to fix this problem
1 but weren't successful. The details are irrelevant at this point in
1 time.
1
1 At one point, the 'gawk' maintainer submitted proposed text for a
1 revised standard that reverts to rules that correspond more closely to
1 the original existing practice. The proposed rules have special cases
1 that make it possible to produce a '\' preceding the matched text. This
1 is shown in ⇒Table 9.2 table-sub-proposed.
1
1 You type 'sub()' sees 'sub()' generates
1 ----- ------- ----------
1 '\\\\\\&' '\\\&' A literal '\&'
1 '\\\\&' '\\&' A literal '\', followed by the matched text
1 '\\&' '\&' A literal '&'
1 '\\q' '\q' A literal '\q'
1 '\\\\' '\\' '\\'
1
1 Table 9.2: 'gawk' rules for 'sub()' and backslash
1
1 In a nutshell, at the runtime level, there are now three special
1 sequences of characters ('\\\&', '\\&', and '\&') whereas historically
1 there was only one. However, as in the historical case, any '\' that is
1 not part of one of these three sequences is not special and appears in
1 the output literally.
1
1 'gawk' 3.0 and 3.1 follow these rules for 'sub()' and 'gsub()'. The
1 POSIX standard took much longer to be revised than was expected. In
1 addition, the 'gawk' maintainer's proposal was lost during the
1 standardization process. The final rules are somewhat simpler. The
1 results are similar except for one case.
1
1 The POSIX rules state that '\&' in the replacement string produces a
1 literal '&', '\\' produces a literal '\', and '\' followed by anything
1 else is not special; the '\' is placed straight into the output. These
1 rules are presented in ⇒Table 9.3 table-posix-sub.
1
1 You type 'sub()' sees 'sub()' generates
1 ----- ------- ----------
1 '\\\\\\&' '\\\&' A literal '\&'
1 '\\\\&' '\\&' A literal '\', followed by the matched text
1 '\\&' '\&' A literal '&'
1 '\\q' '\q' A literal '\q'
1 '\\\\' '\\' '\'
1
1 Table 9.3: POSIX rules for 'sub()' and 'gsub()'
1
1 The only case where the difference is noticeable is the last one:
1 '\\\\' is seen as '\\' and produces '\' instead of '\\'.
1
1 Starting with version 3.1.4, 'gawk' followed the POSIX rules when
1 '--posix' was specified (⇒Options). Otherwise, it continued to
1 follow the proposed rules, as that had been its behavior for many years.
1
1 When version 4.0.0 was released, the 'gawk' maintainer made the POSIX
1 rules the default, breaking well over a decade's worth of backward
1 compatibility.(1) Needless to say, this was a bad idea, and as of
1 version 4.0.1, 'gawk' resumed its historical behavior, and only follows
1 the POSIX rules when '--posix' is given.
1
1 The rules for 'gensub()' are considerably simpler. At the runtime
1 level, whenever 'gawk' sees a '\', if the following character is a
1 digit, then the text that matched the corresponding parenthesized
1 subexpression is placed in the generated output. Otherwise, no matter
1 what character follows the '\', it appears in the generated text and the
1 '\' does not, as shown in ⇒Table 9.4 table-gensub-escapes.
1
1 You type 'gensub()' sees 'gensub()' generates
1 ----- --------- ------------
1 '&' '&' The matched text
1 '\\&' '\&' A literal '&'
1 '\\\\' '\\' A literal '\'
1 '\\\\&' '\\&' A literal '\', then the matched text
1 '\\\\\\&' '\\\&' A literal '\&'
1 '\\q' '\q' A literal 'q'
1
1 Table 9.4: Escape sequence processing for 'gensub()'
1
1 Because of the complexity of the lexical- and runtime-level
1 processing and the special cases for 'sub()' and 'gsub()', we recommend
1 the use of 'gawk' and 'gensub()' when you have to do substitutions.
1
1 ---------- Footnotes ----------
1
1 (1) This was rather naive of him, despite there being a note in this
1 minor node indicating that the next major version would move to the
1 POSIX rules.
1