gawk: Splitting By Content
1
1 4.7 Defining Fields by Content
1 ==============================
1
1 This minor node discusses an advanced feature of 'gawk'. If you are a
1 novice 'awk' user, you might want to skip it on the first reading.
1
1 Normally, when using 'FS', 'gawk' defines the fields as the parts of
1 the record that occur in between each field separator. In other words,
1 'FS' defines what a field _is not_, instead of what a field _is_.
1 However, there are times when you really want to define the fields by
1 what they are, and not by what they are not.
1
1 The most notorious such case is so-called "comma-separated values"
1 (CSV) data. Many spreadsheet programs, for example, can export their
1 data into text files, where each record is terminated with a newline,
1 and fields are separated by commas. If commas only separated the data,
1 there wouldn't be an issue. The problem comes when one of the fields
1 contains an _embedded_ comma. In such cases, most programs embed the
1 field in double quotes.(1) So, we might have data like this:
1
1 Robbins,Arnold,"1234 A Pretty Street, NE",MyTown,MyState,12345-6789,USA
1
1 The 'FPAT' variable offers a solution for cases like this. The value
1 of 'FPAT' should be a string that provides a regular expression. This
1 regular expression describes the contents of each field.
1
1 In the case of CSV data as presented here, each field is either
1 "anything that is not a comma," or "a double quote, anything that is not
1 a double quote, and a closing double quote." If written as a regular
1 expression constant (⇒Regexp), we would have
1 '/([^,]+)|("[^"]+")/'. Writing this as a string requires us to escape
1 the double quotes, leading to:
1
1 FPAT = "([^,]+)|(\"[^\"]+\")"
1
1 Putting this to use, here is a simple program to parse the data:
1
1 BEGIN {
1 FPAT = "([^,]+)|(\"[^\"]+\")"
1 }
1
1 {
1 print "NF = ", NF
1 for (i = 1; i <= NF; i++) {
1 printf("$%d = <%s>\n", i, $i)
1 }
1 }
1
1 When run, we get the following:
1
1 $ gawk -f simple-csv.awk addresses.csv
1 NF = 7
1 $1 = <Robbins>
1 $2 = <Arnold>
1 $3 = <"1234 A Pretty Street, NE">
1 $4 = <MyTown>
1 $5 = <MyState>
1 $6 = <12345-6789>
1 $7 = <USA>
1
1 Note the embedded comma in the value of '$3'.
1
1 A straightforward improvement when processing CSV data of this sort
1 would be to remove the quotes when they occur, with something like this:
1
1 if (substr($i, 1, 1) == "\"") {
1 len = length($i)
1 $i = substr($i, 2, len - 2) # Get text within the two quotes
1 }
1
1 As with 'FS', the 'IGNORECASE' variable (⇒User-modified)
1 affects field splitting with 'FPAT'.
1
1 Assigning a value to 'FPAT' overrides field splitting with 'FS' and
1 with 'FIELDWIDTHS'.
1
1 NOTE: Some programs export CSV data that contains embedded newlines
1 between the double quotes. 'gawk' provides no way to deal with
1 this. Even though a formal specification for CSV data exists,
1 there isn't much more to be done; the 'FPAT' mechanism provides an
1 elegant solution for the majority of cases, and the 'gawk'
1 developers are satisfied with that.
1
1 As written, the regexp used for 'FPAT' requires that each field
1 contain at least one character. A straightforward modification
1 (changing the first '+' to '*') allows fields to be empty:
1
1 FPAT = "([^,]*)|(\"[^\"]+\")"
1
1 Finally, the 'patsplit()' function makes the same functionality
1 available for splitting regular strings (⇒String Functions).
1
1 ---------- Footnotes ----------
1
1 (1) The CSV format lacked a formal standard definition for many
1 years. RFC 4180 (http://www.ietf.org/rfc/rfc4180.txt) standardizes the
1 most common practices.
1