gawk: awk split records

1 
1 4.1.1 Record Splitting with Standard 'awk'
1 ------------------------------------------
1 
1 Records are separated by a character called the "record separator".  By
1 default, the record separator is the newline character.  This is why
1 records are, by default, single lines.  To use a different character for
1 the record separator, simply assign that character to the predefined
1 variable 'RS'.
1 
1    Like any other variable, the value of 'RS' can be changed in the
11 'awk' program with the assignment operator, '=' (⇒Assignment
 Ops).  The new record-separator character should be enclosed in
1 quotation marks, which indicate a string constant.  Often, the right
1 time to do this is at the beginning of execution, before any input is
1 processed, so that the very first record is read with the proper
11 separator.  To do this, use the special 'BEGIN' pattern (⇒
 BEGIN/END).  For example:
1 
1      awk 'BEGIN { RS = "u" }
1           { print $0 }' mail-list
1 
1 changes the value of 'RS' to 'u', before reading any input.  The new
1 value is a string whose first character is the letter "u"; as a result,
1 records are separated by the letter "u".  Then the input file is read,
1 and the second rule in the 'awk' program (the action with no pattern)
1 prints each record.  Because each 'print' statement adds a newline at
1 the end of its output, this 'awk' program copies the input with each 'u'
1 changed to a newline.  Here are the results of running the program on
1 'mail-list':
1 
1      $ awk 'BEGIN { RS = "u" }
1      >      { print $0 }' mail-list
1      -| Amelia       555-5553     amelia.zodiac
1      -| sq
1      -| e@gmail.com    F
1      -| Anthony      555-3412     anthony.assert
1      -| ro@hotmail.com   A
1      -| Becky        555-7685     becky.algebrar
1      -| m@gmail.com      A
1      -| Bill         555-1675     bill.drowning@hotmail.com       A
1      -| Broderick    555-0542     broderick.aliq
1      -| otiens@yahoo.com R
1      -| Camilla      555-2912     camilla.inf
1      -| sar
1      -| m@skynet.be     R
1      -| Fabi
1      -| s       555-1234     fabi
1      -| s.
1      -| ndevicesim
1      -| s@
1      -| cb.ed
1      -|     F
1      -| J
1      -| lie        555-6699     j
1      -| lie.perscr
1      -| tabor@skeeve.com   F
1      -| Martin       555-6480     martin.codicib
1      -| s@hotmail.com    A
1      -| Sam
1      -| el       555-3430     sam
1      -| el.lanceolis@sh
1      -| .ed
1      -|         A
1      -| Jean-Pa
1      -| l    555-2127     jeanpa
1      -| l.campanor
1      -| m@ny
1      -| .ed
1      -|      R
1      -|
1 
1 Note that the entry for the name 'Bill' is not split.  In the original
1 data file (⇒Sample Data Files), the line looks like this:
1 
1      Bill         555-1675     bill.drowning@hotmail.com       A
1 
1 It contains no 'u', so there is no reason to split the record, unlike
1 the others, which each have one or more occurrences of the 'u'.  In
1 fact, this record is treated as part of the previous record; the newline
1 separating them in the output is the original newline in the data file,
1 not the one added by 'awk' when it printed the record!
1 
1    Another way to change the record separator is on the command line,
1 using the variable-assignment feature (⇒Other Arguments):
1 
1      awk '{ print $0 }' RS="u" mail-list
1 
1 This sets 'RS' to 'u' before processing 'mail-list'.
1 
1    Using an alphabetic character such as 'u' for the record separator is
1 highly likely to produce strange results.  Using an unusual character
1 such as '/' is more likely to produce correct behavior in the majority
1 of cases, but there are no guarantees.  The moral is: Know Your Data.
1 
1    When using regular characters as the record separator, there is one
1 unusual case that occurs when 'gawk' is being fully POSIX-compliant
1 (⇒Options).  Then, the following (extreme) pipeline prints a
1 surprising '1':
1 
1      $ echo | gawk --posix 'BEGIN { RS = "a" } ; { print NF }'
1      -| 1
1 
1    There is one field, consisting of a newline.  The value of the
1 built-in variable 'NF' is the number of fields in the current record.
1 (In the normal case, 'gawk' treats the newline as whitespace, printing
1 '0' as the result.  Most other versions of 'awk' also act this way.)
1 
1    Reaching the end of an input file terminates the current input
1 record, even if the last character in the file is not the character in
1 'RS'.  (d.c.)
1 
1    The empty string '""' (a string without any characters) has a special
1 meaning as the value of 'RS'.  It means that records are separated by
1 one or more blank lines and nothing else.  ⇒Multiple Line for
1 more details.
1 
1    If you change the value of 'RS' in the middle of an 'awk' run, the
1 new value is used to delimit subsequent records, but the record
1 currently being processed, as well as records already processed, are not
1 affected.
1 
1    After the end of the record has been determined, 'gawk' sets the
1 variable 'RT' to the text in the input that matched 'RS'.
1