gawk: Regexp Field Splitting
1
1 4.5.2 Using Regular Expressions to Separate Fields
1 --------------------------------------------------
1
1 The previous node discussed the use of single characters or simple
1 strings as the value of 'FS'. More generally, the value of 'FS' may be
1 a string containing any regular expression. In this case, each match in
1 the record for the regular expression separates fields. For example,
1 the assignment:
1
1 FS = ", \t"
1
1 makes every area of an input line that consists of a comma followed by a
1 space and a TAB into a field separator. ('\t' is an "escape sequence"
1 that stands for a TAB; ⇒Escape Sequences, for the complete list
1 of similar escape sequences.)
1
1 For a less trivial example of a regular expression, try using single
1 spaces to separate fields the way single commas are used. 'FS' can be
1 set to '"[ ]"' (left bracket, space, right bracket). This regular
1 expression matches a single space and nothing else (⇒Regexp).
1
1 There is an important difference between the two cases of 'FS = " "'
1 (a single space) and 'FS = "[ \t\n]+"' (a regular expression matching
1 one or more spaces, TABs, or newlines). For both values of 'FS', fields
1 are separated by "runs" (multiple adjacent occurrences) of spaces, TABs,
1 and/or newlines. However, when the value of 'FS' is '" "', 'awk' first
1 strips leading and trailing whitespace from the record and then decides
1 where the fields are. For example, the following pipeline prints 'b':
1
1 $ echo ' a b c d ' | awk '{ print $2 }'
1 -| b
1
1 However, this pipeline prints 'a' (note the extra spaces around each
1 letter):
1
1 $ echo ' a b c d ' | awk 'BEGIN { FS = "[ \t\n]+" }
1 > { print $2 }'
1 -| a
1
1 In this case, the first field is null, or empty.
1
1 The stripping of leading and trailing whitespace also comes into play
1 whenever '$0' is recomputed. For instance, study this pipeline:
1
1 $ echo ' a b c d' | awk '{ print; $2 = $2; print }'
1 -| a b c d
1 -| a b c d
1
1 The first 'print' statement prints the record as it was read, with
1 leading whitespace intact. The assignment to '$2' rebuilds '$0' by
1 concatenating '$1' through '$NF' together, separated by the value of
1 'OFS' (which is a space by default). Because the leading whitespace was
1 ignored when finding '$1', it is not part of the new '$0'. Finally, the
1 last 'print' statement prints the new '$0'.
1
1 There is an additional subtlety to be aware of when using regular
1 expressions for field splitting. It is not well specified in the POSIX
1 standard, or anywhere else, what '^' means when splitting fields. Does
1 the '^' match only at the beginning of the entire record? Or is each
1 field separator a new string? It turns out that different 'awk'
1 versions answer this question differently, and you should not rely on
1 any specific behavior in your programs. (d.c.)
1
1 As a point of information, BWK 'awk' allows '^' to match only at the
1 beginning of the record. 'gawk' also works this way. For example:
1
1 $ echo 'xxAA xxBxx C' |
1 > gawk -F '(^x+)|( +)' '{ for (i = 1; i <= NF; i++)
1 > printf "-->%s<--\n", $i }'
1 -| --><--
1 -| -->AA<--
1 -| -->xxBxx<--
1 -| -->C<--
1