gawk: gawk split records
1
1 4.1.2 Record Splitting with 'gawk'
1 ----------------------------------
1
1 When using 'gawk', the value of 'RS' is not limited to a one-character
1 string. It can be any regular expression (⇒Regexp). (c.e.) In
1 general, each record ends at the next string that matches the regular
1 expression; the next record starts at the end of the matching string.
1 This general rule is actually at work in the usual case, where 'RS'
1 contains just a newline: a record ends at the beginning of the next
1 matching string (the next newline in the input), and the following
1 record starts just after the end of this string (at the first character
1 of the following line). The newline, because it matches 'RS', is not
1 part of either record.
1
1 When 'RS' is a single character, 'RT' contains the same single
1 character. However, when 'RS' is a regular expression, 'RT' contains
1 the actual input text that matched the regular expression.
1
1 If the input file ends without any text matching 'RS', 'gawk' sets
1 'RT' to the null string.
1
1 The following example illustrates both of these features. It sets
1 'RS' equal to a regular expression that matches either a newline or a
1 series of one or more uppercase letters with optional leading and/or
1 trailing whitespace:
1
1 $ echo record 1 AAAA record 2 BBBB record 3 |
1 > gawk 'BEGIN { RS = "\n|( *[[:upper:]]+ *)" }
1 > { print "Record =", $0,"and RT = [" RT "]" }'
1 -| Record = record 1 and RT = [ AAAA ]
1 -| Record = record 2 and RT = [ BBBB ]
1 -| Record = record 3 and RT = [
1 -| ]
1
1 The square brackets delineate the contents of 'RT', letting you see the
1 leading and trailing whitespace. The final value of 'RT' is a newline.
1 ⇒Simple Sed for a more useful example of 'RS' as a regexp and
1 'RT'.
1
1 If you set 'RS' to a regular expression that allows optional trailing
1 text, such as 'RS = "abc(XYZ)?"', it is possible, due to implementation
1 constraints, that 'gawk' may match the leading part of the regular
1 expression, but not the trailing part, particularly if the input text
1 that could match the trailing part is fairly long. 'gawk' attempts to
1 avoid this problem, but currently, there's no guarantee that this will
1 never happen.
1
1 NOTE: Remember that in 'awk', the '^' and '$' anchor metacharacters
1 match the beginning and end of a _string_, and not the beginning
1 and end of a _line_. As a result, something like 'RS =
1 "^[[:upper:]]"' can only match at the beginning of a file. This is
1 because 'gawk' views the input file as one long string that happens
1 to contain newline characters. It is thus best to avoid anchor
1 metacharacters in the value of 'RS'.
1
1 The use of 'RS' as a regular expression and the 'RT' variable are
11 'gawk' extensions; they are not available in compatibility mode (⇒
Options). In compatibility mode, only the first character of the
1 value of 'RS' determines the end of the record.
1
1 'RS = "\0"' Is Not Portable
1
1 There are times when you might want to treat an entire data file as a
1 single record. The only way to make this happen is to give 'RS' a value
1 that you know doesn't occur in the input file. This is hard to do in a
1 general way, such that a program always works for arbitrary input files.
1
1 You might think that for text files, the NUL character, which
1 consists of a character with all bits equal to zero, is a good value to
1 use for 'RS' in this case:
1
1 BEGIN { RS = "\0" } # whole file becomes one record?
1
1 'gawk' in fact accepts this, and uses the NUL character for the
1 record separator. This works for certain special files, such as
1 '/proc/environ' on GNU/Linux systems, where the NUL character is in fact
1 the record separator. However, this usage is _not_ portable to most
1 other 'awk' implementations.
1
1 Almost all other 'awk' implementations(1) store strings internally as
1 C-style strings. C strings use the NUL character as the string
1 terminator. In effect, this means that 'RS = "\0"' is the same as 'RS =
1 ""'. (d.c.)
1
1 It happens that recent versions of 'mawk' can use the NUL character
1 as a record separator. However, this is a special case: 'mawk' does not
1 allow embedded NUL characters in strings. (This may change in a future
1 version of 'mawk'.)
1
1 ⇒Readfile Function for an interesting way to read whole files.
1 If you are using 'gawk', see ⇒Extension Sample Readfile for
1 another option.
1
1 ---------- Footnotes ----------
1
1 (1) At least that we know about.
1