gawk: Cut Program
1
1 11.2.1 Cutting Out Fields and Columns
1 -------------------------------------
1
1 The 'cut' utility selects, or "cuts," characters or fields from its
1 standard input and sends them to its standard output. Fields are
1 separated by TABs by default, but you may supply a command-line option
1 to change the field "delimiter" (i.e., the field-separator character).
1 'cut''s definition of fields is less general than 'awk''s.
1
1 A common use of 'cut' might be to pull out just the login names of
1 logged-on users from the output of 'who'. For example, the following
1 pipeline generates a sorted, unique list of the logged-on users:
1
1 who | cut -c1-8 | sort | uniq
1
1 The options for 'cut' are:
1
1 '-c LIST'
1 Use LIST as the list of characters to cut out. Items within the
1 list may be separated by commas, and ranges of characters can be
1 separated with dashes. The list '1-8,15,22-35' specifies
1 characters 1 through 8, 15, and 22 through 35.
1
1 '-f LIST'
1 Use LIST as the list of fields to cut out.
1
1 '-d DELIM'
1 Use DELIM as the field-separator character instead of the TAB
1 character.
1
1 '-s'
1 Suppress printing of lines that do not contain the field delimiter.
1
1 The 'awk' implementation of 'cut' uses the 'getopt()' library
1 function (⇒Getopt Function) and the 'join()' library function
1 (⇒Join Function).
1
1 The program begins with a comment describing the options, the library
1 functions needed, and a 'usage()' function that prints out a usage
1 message and exits. 'usage()' is called if invalid arguments are
1 supplied:
1
1 # cut.awk --- implement cut in awk
1
1 # Options:
1 # -f list Cut fields
1 # -d c Field delimiter character
1 # -c list Cut characters
1 #
1 # -s Suppress lines without the delimiter
1 #
1 # Requires getopt() and join() library functions
1
1 function usage()
1 {
1 print("usage: cut [-f list] [-d c] [-s] [files...]") > "/dev/stderr"
1 print("usage: cut [-c list] [files...]") > "/dev/stderr"
1 exit 1
1 }
1
1 Next comes a 'BEGIN' rule that parses the command-line options. It
1 sets 'FS' to a single TAB character, because that is 'cut''s default
1 field separator. The rule then sets the output field separator to be
1 the same as the input field separator. A loop using 'getopt()' steps
1 through the command-line options. Exactly one of the variables
1 'by_fields' or 'by_chars' is set to true, to indicate that processing
1 should be done by fields or by characters, respectively. When cutting
1 by characters, the output field separator is set to the null string:
1
1 BEGIN {
1 FS = "\t" # default
1 OFS = FS
1 while ((c = getopt(ARGC, ARGV, "sf:c:d:")) != -1) {
1 if (c == "f") {
1 by_fields = 1
1 fieldlist = Optarg
1 } else if (c == "c") {
1 by_chars = 1
1 fieldlist = Optarg
1 OFS = ""
1 } else if (c == "d") {
1 if (length(Optarg) > 1) {
1 printf("cut: using first character of %s" \
1 " for delimiter\n", Optarg) > "/dev/stderr"
1 Optarg = substr(Optarg, 1, 1)
1 }
1 fs = FS = Optarg
1 OFS = FS
1 if (FS == " ") # defeat awk semantics
1 FS = "[ ]"
1 } else if (c == "s")
1 suppress = 1
1 else
1 usage()
1 }
1
1 # Clear out options
1 for (i = 1; i < Optind; i++)
1 ARGV[i] = ""
1
1 The code must take special care when the field delimiter is a space.
1 Using a single space ('" "') for the value of 'FS' is incorrect--'awk'
1 would separate fields with runs of spaces, TABs, and/or newlines, and we
1 want them to be separated with individual spaces. To this end, we save
1 the original space character in the variable 'fs' for later use; after
1 setting 'FS' to '"[ ]"' we can't use it directly to see if the field
1 delimiter character is in the string.
1
11 Also remember that after 'getopt()' is through (as described in ⇒
Getopt Function), we have to clear out all the elements of 'ARGV' from
1 1 to 'Optind', so that 'awk' does not try to process the command-line
1 options as file names.
1
1 After dealing with the command-line options, the program verifies
1 that the options make sense. Only one or the other of '-c' and '-f'
1 should be used, and both require a field list. Then the program calls
1 either 'set_fieldlist()' or 'set_charlist()' to pull apart the list of
1 fields or characters:
1
1 if (by_fields && by_chars)
1 usage()
1
1 if (by_fields == 0 && by_chars == 0)
1 by_fields = 1 # default
1
1 if (fieldlist == "") {
1 print "cut: needs list for -c or -f" > "/dev/stderr"
1 exit 1
1 }
1
1 if (by_fields)
1 set_fieldlist()
1 else
1 set_charlist()
1 }
1
1 'set_fieldlist()' splits the field list apart at the commas into an
1 array. Then, for each element of the array, it looks to see if the
1 element is actually a range, and if so, splits it apart. The function
1 checks the range to make sure that the first number is smaller than the
1 second. Each number in the list is added to the 'flist' array, which
1 simply lists the fields that will be printed. Normal field splitting is
1 used. The program lets 'awk' handle the job of doing the field
1 splitting:
1
1 function set_fieldlist( n, m, i, j, k, f, g)
1 {
1 n = split(fieldlist, f, ",")
1 j = 1 # index in flist
1 for (i = 1; i <= n; i++) {
1 if (index(f[i], "-") != 0) { # a range
1 m = split(f[i], g, "-")
1 if (m != 2 || g[1] >= g[2]) {
1 printf("cut: bad field list: %s\n",
1 f[i]) > "/dev/stderr"
1 exit 1
1 }
1 for (k = g[1]; k <= g[2]; k++)
1 flist[j++] = k
1 } else
1 flist[j++] = f[i]
1 }
1 nfields = j - 1
1 }
1
1 The 'set_charlist()' function is more complicated than
1 'set_fieldlist()'. The idea here is to use 'gawk''s 'FIELDWIDTHS'
1 variable (⇒Constant Size), which describes constant-width input.
1 When using a character list, that is exactly what we have.
1
1 Setting up 'FIELDWIDTHS' is more complicated than simply listing the
1 fields that need to be printed. We have to keep track of the fields to
1 print and also the intervening characters that have to be skipped. For
1 example, suppose you wanted characters 1 through 8, 15, and 22 through
1 35. You would use '-c 1-8,15,22-35'. The necessary value for
1 'FIELDWIDTHS' is '"8 6 1 6 14"'. This yields five fields, and the
1 fields to print are '$1', '$3', and '$5'. The intermediate fields are
1 "filler", which is stuff in between the desired data. 'flist' lists the
1 fields to print, and 't' tracks the complete field list, including
1 filler fields:
1
1 function set_charlist( field, i, j, f, g, n, m, t,
1 filler, last, len)
1 {
1 field = 1 # count total fields
1 n = split(fieldlist, f, ",")
1 j = 1 # index in flist
1 for (i = 1; i <= n; i++) {
1 if (index(f[i], "-") != 0) { # range
1 m = split(f[i], g, "-")
1 if (m != 2 || g[1] >= g[2]) {
1 printf("cut: bad character list: %s\n",
1 f[i]) > "/dev/stderr"
1 exit 1
1 }
1 len = g[2] - g[1] + 1
1 if (g[1] > 1) # compute length of filler
1 filler = g[1] - last - 1
1 else
1 filler = 0
1 if (filler)
1 t[field++] = filler
1 t[field++] = len # length of field
1 last = g[2]
1 flist[j++] = field - 1
1 } else {
1 if (f[i] > 1)
1 filler = f[i] - last - 1
1 else
1 filler = 0
1 if (filler)
1 t[field++] = filler
1 t[field++] = 1
1 last = f[i]
1 flist[j++] = field - 1
1 }
1 }
1 FIELDWIDTHS = join(t, 1, field - 1)
1 nfields = j - 1
1 }
1
1 Next is the rule that processes the data. If the '-s' option is
1 given, then 'suppress' is true. The first 'if' statement makes sure
1 that the input record does have the field separator. If 'cut' is
1 processing fields, 'suppress' is true, and the field separator character
1 is not in the record, then the record is skipped.
1
1 If the record is valid, then 'gawk' has split the data into fields,
1 either using the character in 'FS' or using fixed-length fields and
1 'FIELDWIDTHS'. The loop goes through the list of fields that should be
1 printed. The corresponding field is printed if it contains data. If
1 the next field also has data, then the separator character is written
1 out between the fields:
1
1 {
1 if (by_fields && suppress && index($0, fs) == 0)
1 next
1
1 for (i = 1; i <= nfields; i++) {
1 if ($flist[i] != "") {
1 printf "%s", $flist[i]
1 if (i < nfields && $flist[i+1] != "")
1 printf "%s", OFS
1 }
1 }
1 print ""
1 }
1
1 This version of 'cut' relies on 'gawk''s 'FIELDWIDTHS' variable to do
1 the character-based cutting. It is possible in other 'awk'
1 implementations to use 'substr()' (⇒String Functions), but it is
1 also extremely painful. The 'FIELDWIDTHS' variable supplies an elegant
1 solution to the problem of picking the input line apart by characters.
1