gawk: awk split records
1
1 4.1.1 Record Splitting with Standard 'awk'
1 ------------------------------------------
1
1 Records are separated by a character called the "record separator". By
1 default, the record separator is the newline character. This is why
1 records are, by default, single lines. To use a different character for
1 the record separator, simply assign that character to the predefined
1 variable 'RS'.
1
1 Like any other variable, the value of 'RS' can be changed in the
11 'awk' program with the assignment operator, '=' (⇒Assignment
Ops). The new record-separator character should be enclosed in
1 quotation marks, which indicate a string constant. Often, the right
1 time to do this is at the beginning of execution, before any input is
1 processed, so that the very first record is read with the proper
11 separator. To do this, use the special 'BEGIN' pattern (⇒
BEGIN/END). For example:
1
1 awk 'BEGIN { RS = "u" }
1 { print $0 }' mail-list
1
1 changes the value of 'RS' to 'u', before reading any input. The new
1 value is a string whose first character is the letter "u"; as a result,
1 records are separated by the letter "u". Then the input file is read,
1 and the second rule in the 'awk' program (the action with no pattern)
1 prints each record. Because each 'print' statement adds a newline at
1 the end of its output, this 'awk' program copies the input with each 'u'
1 changed to a newline. Here are the results of running the program on
1 'mail-list':
1
1 $ awk 'BEGIN { RS = "u" }
1 > { print $0 }' mail-list
1 -| Amelia 555-5553 amelia.zodiac
1 -| sq
1 -| e@gmail.com F
1 -| Anthony 555-3412 anthony.assert
1 -| ro@hotmail.com A
1 -| Becky 555-7685 becky.algebrar
1 -| m@gmail.com A
1 -| Bill 555-1675 bill.drowning@hotmail.com A
1 -| Broderick 555-0542 broderick.aliq
1 -| otiens@yahoo.com R
1 -| Camilla 555-2912 camilla.inf
1 -| sar
1 -| m@skynet.be R
1 -| Fabi
1 -| s 555-1234 fabi
1 -| s.
1 -| ndevicesim
1 -| s@
1 -| cb.ed
1 -| F
1 -| J
1 -| lie 555-6699 j
1 -| lie.perscr
1 -| tabor@skeeve.com F
1 -| Martin 555-6480 martin.codicib
1 -| s@hotmail.com A
1 -| Sam
1 -| el 555-3430 sam
1 -| el.lanceolis@sh
1 -| .ed
1 -| A
1 -| Jean-Pa
1 -| l 555-2127 jeanpa
1 -| l.campanor
1 -| m@ny
1 -| .ed
1 -| R
1 -|
1
1 Note that the entry for the name 'Bill' is not split. In the original
1 data file (⇒Sample Data Files), the line looks like this:
1
1 Bill 555-1675 bill.drowning@hotmail.com A
1
1 It contains no 'u', so there is no reason to split the record, unlike
1 the others, which each have one or more occurrences of the 'u'. In
1 fact, this record is treated as part of the previous record; the newline
1 separating them in the output is the original newline in the data file,
1 not the one added by 'awk' when it printed the record!
1
1 Another way to change the record separator is on the command line,
1 using the variable-assignment feature (⇒Other Arguments):
1
1 awk '{ print $0 }' RS="u" mail-list
1
1 This sets 'RS' to 'u' before processing 'mail-list'.
1
1 Using an alphabetic character such as 'u' for the record separator is
1 highly likely to produce strange results. Using an unusual character
1 such as '/' is more likely to produce correct behavior in the majority
1 of cases, but there are no guarantees. The moral is: Know Your Data.
1
1 When using regular characters as the record separator, there is one
1 unusual case that occurs when 'gawk' is being fully POSIX-compliant
1 (⇒Options). Then, the following (extreme) pipeline prints a
1 surprising '1':
1
1 $ echo | gawk --posix 'BEGIN { RS = "a" } ; { print NF }'
1 -| 1
1
1 There is one field, consisting of a newline. The value of the
1 built-in variable 'NF' is the number of fields in the current record.
1 (In the normal case, 'gawk' treats the newline as whitespace, printing
1 '0' as the result. Most other versions of 'awk' also act this way.)
1
1 Reaching the end of an input file terminates the current input
1 record, even if the last character in the file is not the character in
1 'RS'. (d.c.)
1
1 The empty string '""' (a string without any characters) has a special
1 meaning as the value of 'RS'. It means that records are separated by
1 one or more blank lines and nothing else. ⇒Multiple Line for
1 more details.
1
1 If you change the value of 'RS' in the middle of an 'awk' run, the
1 new value is used to delimit subsequent records, but the record
1 currently being processed, as well as records already processed, are not
1 affected.
1
1 After the end of the record has been determined, 'gawk' sets the
1 variable 'RT' to the text in the input that matched 'RS'.
1