gawk: Splitting By Content

1 
1 4.7 Defining Fields by Content
1 ==============================
1 
1 This minor node discusses an advanced feature of 'gawk'.  If you are a
1 novice 'awk' user, you might want to skip it on the first reading.
1 
1    Normally, when using 'FS', 'gawk' defines the fields as the parts of
1 the record that occur in between each field separator.  In other words,
1 'FS' defines what a field _is not_, instead of what a field _is_.
1 However, there are times when you really want to define the fields by
1 what they are, and not by what they are not.
1 
1    The most notorious such case is so-called "comma-separated values"
1 (CSV) data.  Many spreadsheet programs, for example, can export their
1 data into text files, where each record is terminated with a newline,
1 and fields are separated by commas.  If commas only separated the data,
1 there wouldn't be an issue.  The problem comes when one of the fields
1 contains an _embedded_ comma.  In such cases, most programs embed the
1 field in double quotes.(1)  So, we might have data like this:
1 
1      Robbins,Arnold,"1234 A Pretty Street, NE",MyTown,MyState,12345-6789,USA
1 
1    The 'FPAT' variable offers a solution for cases like this.  The value
1 of 'FPAT' should be a string that provides a regular expression.  This
1 regular expression describes the contents of each field.
1 
1    In the case of CSV data as presented here, each field is either
1 "anything that is not a comma," or "a double quote, anything that is not
1 a double quote, and a closing double quote."  If written as a regular
1 expression constant (⇒Regexp), we would have
1 '/([^,]+)|("[^"]+")/'.  Writing this as a string requires us to escape
1 the double quotes, leading to:
1 
1      FPAT = "([^,]+)|(\"[^\"]+\")"
1 
1    Putting this to use, here is a simple program to parse the data:
1 
1      BEGIN {
1          FPAT = "([^,]+)|(\"[^\"]+\")"
1      }
1 
1      {
1          print "NF = ", NF
1          for (i = 1; i <= NF; i++) {
1              printf("$%d = <%s>\n", i, $i)
1          }
1      }
1 
1    When run, we get the following:
1 
1      $ gawk -f simple-csv.awk addresses.csv
1      NF =  7
1      $1 = <Robbins>
1      $2 = <Arnold>
1      $3 = <"1234 A Pretty Street, NE">
1      $4 = <MyTown>
1      $5 = <MyState>
1      $6 = <12345-6789>
1      $7 = <USA>
1 
1    Note the embedded comma in the value of '$3'.
1 
1    A straightforward improvement when processing CSV data of this sort
1 would be to remove the quotes when they occur, with something like this:
1 
1      if (substr($i, 1, 1) == "\"") {
1          len = length($i)
1          $i = substr($i, 2, len - 2)    # Get text within the two quotes
1      }
1 
1    As with 'FS', the 'IGNORECASE' variable (⇒User-modified)
1 affects field splitting with 'FPAT'.
1 
1    Assigning a value to 'FPAT' overrides field splitting with 'FS' and
1 with 'FIELDWIDTHS'.
1 
1      NOTE: Some programs export CSV data that contains embedded newlines
1      between the double quotes.  'gawk' provides no way to deal with
1      this.  Even though a formal specification for CSV data exists,
1      there isn't much more to be done; the 'FPAT' mechanism provides an
1      elegant solution for the majority of cases, and the 'gawk'
1      developers are satisfied with that.
1 
1    As written, the regexp used for 'FPAT' requires that each field
1 contain at least one character.  A straightforward modification
1 (changing the first '+' to '*') allows fields to be empty:
1 
1      FPAT = "([^,]*)|(\"[^\"]+\")"
1 
1    Finally, the 'patsplit()' function makes the same functionality
1 available for splitting regular strings (⇒String Functions).
1 
1    ---------- Footnotes ----------
1 
1    (1) The CSV format lacked a formal standard definition for many
1 years.  RFC 4180 (http://www.ietf.org/rfc/rfc4180.txt) standardizes the
1 most common practices.
1