gawk: Regexp Field Splitting

1 
1 4.5.2 Using Regular Expressions to Separate Fields
1 --------------------------------------------------
1 
1 The previous node discussed the use of single characters or simple
1 strings as the value of 'FS'.  More generally, the value of 'FS' may be
1 a string containing any regular expression.  In this case, each match in
1 the record for the regular expression separates fields.  For example,
1 the assignment:
1 
1      FS = ", \t"
1 
1 makes every area of an input line that consists of a comma followed by a
1 space and a TAB into a field separator.  ('\t' is an "escape sequence"
1 that stands for a TAB; ⇒Escape Sequences, for the complete list
1 of similar escape sequences.)
1 
1    For a less trivial example of a regular expression, try using single
1 spaces to separate fields the way single commas are used.  'FS' can be
1 set to '"[ ]"' (left bracket, space, right bracket).  This regular
1 expression matches a single space and nothing else (⇒Regexp).
1 
1    There is an important difference between the two cases of 'FS = " "'
1 (a single space) and 'FS = "[ \t\n]+"' (a regular expression matching
1 one or more spaces, TABs, or newlines).  For both values of 'FS', fields
1 are separated by "runs" (multiple adjacent occurrences) of spaces, TABs,
1 and/or newlines.  However, when the value of 'FS' is '" "', 'awk' first
1 strips leading and trailing whitespace from the record and then decides
1 where the fields are.  For example, the following pipeline prints 'b':
1 
1      $ echo ' a b c d ' | awk '{ print $2 }'
1      -| b
1 
1 However, this pipeline prints 'a' (note the extra spaces around each
1 letter):
1 
1      $ echo ' a  b  c  d ' | awk 'BEGIN { FS = "[ \t\n]+" }
1      >                                  { print $2 }'
1      -| a
1 
1 In this case, the first field is null, or empty.
1 
1    The stripping of leading and trailing whitespace also comes into play
1 whenever '$0' is recomputed.  For instance, study this pipeline:
1 
1      $ echo '   a b c d' | awk '{ print; $2 = $2; print }'
1      -|    a b c d
1      -| a b c d
1 
1 The first 'print' statement prints the record as it was read, with
1 leading whitespace intact.  The assignment to '$2' rebuilds '$0' by
1 concatenating '$1' through '$NF' together, separated by the value of
1 'OFS' (which is a space by default).  Because the leading whitespace was
1 ignored when finding '$1', it is not part of the new '$0'.  Finally, the
1 last 'print' statement prints the new '$0'.
1 
1    There is an additional subtlety to be aware of when using regular
1 expressions for field splitting.  It is not well specified in the POSIX
1 standard, or anywhere else, what '^' means when splitting fields.  Does
1 the '^' match only at the beginning of the entire record?  Or is each
1 field separator a new string?  It turns out that different 'awk'
1 versions answer this question differently, and you should not rely on
1 any specific behavior in your programs.  (d.c.)
1 
1    As a point of information, BWK 'awk' allows '^' to match only at the
1 beginning of the record.  'gawk' also works this way.  For example:
1 
1      $ echo 'xxAA  xxBxx  C' |
1      > gawk -F '(^x+)|( +)' '{ for (i = 1; i <= NF; i++)
1      >                             printf "-->%s<--\n", $i }'
1      -| --><--
1      -| -->AA<--
1      -| -->xxBxx<--
1      -| -->C<--
1