gawk: gawk split records

1 
1 4.1.2 Record Splitting with 'gawk'
1 ----------------------------------
1 
1 When using 'gawk', the value of 'RS' is not limited to a one-character
1 string.  It can be any regular expression (⇒Regexp).  (c.e.)  In
1 general, each record ends at the next string that matches the regular
1 expression; the next record starts at the end of the matching string.
1 This general rule is actually at work in the usual case, where 'RS'
1 contains just a newline: a record ends at the beginning of the next
1 matching string (the next newline in the input), and the following
1 record starts just after the end of this string (at the first character
1 of the following line).  The newline, because it matches 'RS', is not
1 part of either record.
1 
1    When 'RS' is a single character, 'RT' contains the same single
1 character.  However, when 'RS' is a regular expression, 'RT' contains
1 the actual input text that matched the regular expression.
1 
1    If the input file ends without any text matching 'RS', 'gawk' sets
1 'RT' to the null string.
1 
1    The following example illustrates both of these features.  It sets
1 'RS' equal to a regular expression that matches either a newline or a
1 series of one or more uppercase letters with optional leading and/or
1 trailing whitespace:
1 
1      $ echo record 1 AAAA record 2 BBBB record 3 |
1      > gawk 'BEGIN { RS = "\n|( *[[:upper:]]+ *)" }
1      >             { print "Record =", $0,"and RT = [" RT "]" }'
1      -| Record = record 1 and RT = [ AAAA ]
1      -| Record = record 2 and RT = [ BBBB ]
1      -| Record = record 3 and RT = [
1      -| ]
1 
1 The square brackets delineate the contents of 'RT', letting you see the
1 leading and trailing whitespace.  The final value of 'RT' is a newline.
1 ⇒Simple Sed for a more useful example of 'RS' as a regexp and
1 'RT'.
1 
1    If you set 'RS' to a regular expression that allows optional trailing
1 text, such as 'RS = "abc(XYZ)?"', it is possible, due to implementation
1 constraints, that 'gawk' may match the leading part of the regular
1 expression, but not the trailing part, particularly if the input text
1 that could match the trailing part is fairly long.  'gawk' attempts to
1 avoid this problem, but currently, there's no guarantee that this will
1 never happen.
1 
1      NOTE: Remember that in 'awk', the '^' and '$' anchor metacharacters
1      match the beginning and end of a _string_, and not the beginning
1      and end of a _line_.  As a result, something like 'RS =
1      "^[[:upper:]]"' can only match at the beginning of a file.  This is
1      because 'gawk' views the input file as one long string that happens
1      to contain newline characters.  It is thus best to avoid anchor
1      metacharacters in the value of 'RS'.
1 
1    The use of 'RS' as a regular expression and the 'RT' variable are
11 'gawk' extensions; they are not available in compatibility mode (⇒
 Options).  In compatibility mode, only the first character of the
1 value of 'RS' determines the end of the record.
1 
1                       'RS = "\0"' Is Not Portable
1 
1    There are times when you might want to treat an entire data file as a
1 single record.  The only way to make this happen is to give 'RS' a value
1 that you know doesn't occur in the input file.  This is hard to do in a
1 general way, such that a program always works for arbitrary input files.
1 
1    You might think that for text files, the NUL character, which
1 consists of a character with all bits equal to zero, is a good value to
1 use for 'RS' in this case:
1 
1      BEGIN { RS = "\0" }  # whole file becomes one record?
1 
1    'gawk' in fact accepts this, and uses the NUL character for the
1 record separator.  This works for certain special files, such as
1 '/proc/environ' on GNU/Linux systems, where the NUL character is in fact
1 the record separator.  However, this usage is _not_ portable to most
1 other 'awk' implementations.
1 
1    Almost all other 'awk' implementations(1) store strings internally as
1 C-style strings.  C strings use the NUL character as the string
1 terminator.  In effect, this means that 'RS = "\0"' is the same as 'RS =
1 ""'.  (d.c.)
1 
1    It happens that recent versions of 'mawk' can use the NUL character
1 as a record separator.  However, this is a special case: 'mawk' does not
1 allow embedded NUL characters in strings.  (This may change in a future
1 version of 'mawk'.)
1 
1    ⇒Readfile Function for an interesting way to read whole files.
1 If you are using 'gawk', see ⇒Extension Sample Readfile for
1 another option.
1 
1    ---------- Footnotes ----------
1 
1    (1) At least that we know about.
1