gawk: Multiple Line

1 
1 4.9 Multiple-Line Records
1 =========================
1 
1 In some databases, a single line cannot conveniently hold all the
1 information in one entry.  In such cases, you can use multiline records.
1 The first step in doing this is to choose your data format.
1 
1    One technique is to use an unusual character or string to separate
1 records.  For example, you could use the formfeed character (written
1 '\f' in 'awk', as in C) to separate them, making each record a page of
1 the file.  To do this, just set the variable 'RS' to '"\f"' (a string
1 containing the formfeed character).  Any other character could equally
1 well be used, as long as it won't be part of the data in a record.
1 
1    Another technique is to have blank lines separate records.  By a
1 special dispensation, an empty string as the value of 'RS' indicates
1 that records are separated by one or more blank lines.  When 'RS' is set
1 to the empty string, each record always ends at the first blank line
1 encountered.  The next record doesn't start until the first nonblank
1 line that follows.  No matter how many blank lines appear in a row, they
1 all act as one record separator.  (Blank lines must be completely empty;
1 lines that contain only whitespace do not count.)
1 
1    You can achieve the same effect as 'RS = ""' by assigning the string
1 '"\n\n+"' to 'RS'.  This regexp matches the newline at the end of the
1 record and one or more blank lines after the record.  In addition, a
1 regular expression always matches the longest possible sequence when
1 there is a choice (⇒Leftmost Longest).  So, the next record
1 doesn't start until the first nonblank line that follows--no matter how
1 many blank lines appear in a row, they are considered one record
1 separator.
1 
1    However, there is an important difference between 'RS = ""' and 'RS =
1 "\n\n+"'.  In the first case, leading newlines in the input data file
1 are ignored, and if a file ends without extra blank lines after the last
1 record, the final newline is removed from the record.  In the second
1 case, this special processing is not done.  (d.c.)
1 
1    Now that the input is separated into records, the second step is to
1 separate the fields in the records.  One way to do this is to divide
1 each of the lines into fields in the normal manner.  This happens by
1 default as the result of a special feature.  When 'RS' is set to the
1 empty string _and_ 'FS' is set to a single character, the newline
1 character _always_ acts as a field separator.  This is in addition to
1 whatever field separations result from 'FS'.(1)
1 
1    The original motivation for this special exception was probably to
1 provide useful behavior in the default case (i.e., 'FS' is equal to
1 '" "').  This feature can be a problem if you really don't want the
1 newline character to separate fields, because there is no way to prevent
1 it.  However, you can work around this by using the 'split()' function
1 to break up the record manually (⇒String Functions).  If you have
1 a single-character field separator, you can work around the special
1 feature in a different way, by making 'FS' into a regexp for that single
1 character.  For example, if the field separator is a percent character,
1 instead of 'FS = "%"', use 'FS = "[%]"'.
1 
1    Another way to separate fields is to put each field on a separate
1 line: to do this, just set the variable 'FS' to the string '"\n"'.
1 (This single-character separator matches a single newline.)  A practical
1 example of a data file organized this way might be a mailing list, where
1 blank lines separate the entries.  Consider a mailing list in a file
1 named 'addresses', which looks like this:
1 
1      Jane Doe
1      123 Main Street
1      Anywhere, SE 12345-6789
1 
1      John Smith
1      456 Tree-lined Avenue
1      Smallville, MW 98765-4321
1      ...
1 
1 A simple program to process this file is as follows:
1 
1      # addrs.awk --- simple mailing list program
1 
1      # Records are separated by blank lines.
1      # Each line is one field.
1      BEGIN { RS = "" ; FS = "\n" }
1 
1      {
1            print "Name is:", $1
1            print "Address is:", $2
1            print "City and State are:", $3
1            print ""
1      }
1 
1    Running the program produces the following output:
1 
1      $ awk -f addrs.awk addresses
1      -| Name is: Jane Doe
1      -| Address is: 123 Main Street
1      -| City and State are: Anywhere, SE 12345-6789
1      -|
1      -| Name is: John Smith
1      -| Address is: 456 Tree-lined Avenue
1      -| City and State are: Smallville, MW 98765-4321
1      -|
1      ...
1 
1    ⇒Labels Program for a more realistic program dealing with
1 address lists.  The following list summarizes how records are split,
1 based on the value of 'RS'.  ('==' means "is equal to.")
1 
1 'RS == "\n"'
1      Records are separated by the newline character ('\n').  In effect,
1      every line in the data file is a separate record, including blank
1      lines.  This is the default.
1 
1 'RS == ANY SINGLE CHARACTER'
1      Records are separated by each occurrence of the character.
1      Multiple successive occurrences delimit empty records.
1 
1 'RS == ""'
1      Records are separated by runs of blank lines.  When 'FS' is a
1      single character, then the newline character always serves as a
1      field separator, in addition to whatever value 'FS' may have.
1      Leading and trailing newlines in a file are ignored.
1 
1 'RS == REGEXP'
1      Records are separated by occurrences of characters that match
1      REGEXP.  Leading and trailing matches of REGEXP delimit empty
1      records.  (This is a 'gawk' extension; it is not specified by the
1      POSIX standard.)
1 
1    If not in compatibility mode (⇒Options), 'gawk' sets 'RT' to
1 the input text that matched the value specified by 'RS'.  But if the
1 input file ended without any text that matches 'RS', then 'gawk' sets
1 'RT' to the null string.
1 
1    ---------- Footnotes ----------
1 
1    (1) When 'FS' is the null string ('""') or a regexp, this special
1 feature of 'RS' does not apply.  It does apply to the default field
1 separator of a single space: 'FS = " "'.
1