gawk: Cut Program

1 
1 11.2.1 Cutting Out Fields and Columns
1 -------------------------------------
1 
1 The 'cut' utility selects, or "cuts," characters or fields from its
1 standard input and sends them to its standard output.  Fields are
1 separated by TABs by default, but you may supply a command-line option
1 to change the field "delimiter" (i.e., the field-separator character).
1 'cut''s definition of fields is less general than 'awk''s.
1 
1    A common use of 'cut' might be to pull out just the login names of
1 logged-on users from the output of 'who'.  For example, the following
1 pipeline generates a sorted, unique list of the logged-on users:
1 
1      who | cut -c1-8 | sort | uniq
1 
1    The options for 'cut' are:
1 
1 '-c LIST'
1      Use LIST as the list of characters to cut out.  Items within the
1      list may be separated by commas, and ranges of characters can be
1      separated with dashes.  The list '1-8,15,22-35' specifies
1      characters 1 through 8, 15, and 22 through 35.
1 
1 '-f LIST'
1      Use LIST as the list of fields to cut out.
1 
1 '-d DELIM'
1      Use DELIM as the field-separator character instead of the TAB
1      character.
1 
1 '-s'
1      Suppress printing of lines that do not contain the field delimiter.
1 
1    The 'awk' implementation of 'cut' uses the 'getopt()' library
1 function (⇒Getopt Function) and the 'join()' library function
1 (⇒Join Function).
1 
1    The program begins with a comment describing the options, the library
1 functions needed, and a 'usage()' function that prints out a usage
1 message and exits.  'usage()' is called if invalid arguments are
1 supplied:
1 
1      # cut.awk --- implement cut in awk
1 
1      # Options:
1      #    -f list     Cut fields
1      #    -d c        Field delimiter character
1      #    -c list     Cut characters
1      #
1      #    -s          Suppress lines without the delimiter
1      #
1      # Requires getopt() and join() library functions
1 
1      function usage()
1      {
1          print("usage: cut [-f list] [-d c] [-s] [files...]") > "/dev/stderr"
1          print("usage: cut [-c list] [files...]") > "/dev/stderr"
1          exit 1
1      }
1 
1    Next comes a 'BEGIN' rule that parses the command-line options.  It
1 sets 'FS' to a single TAB character, because that is 'cut''s default
1 field separator.  The rule then sets the output field separator to be
1 the same as the input field separator.  A loop using 'getopt()' steps
1 through the command-line options.  Exactly one of the variables
1 'by_fields' or 'by_chars' is set to true, to indicate that processing
1 should be done by fields or by characters, respectively.  When cutting
1 by characters, the output field separator is set to the null string:
1 
1      BEGIN {
1          FS = "\t"    # default
1          OFS = FS
1          while ((c = getopt(ARGC, ARGV, "sf:c:d:")) != -1) {
1              if (c == "f") {
1                  by_fields = 1
1                  fieldlist = Optarg
1              } else if (c == "c") {
1                  by_chars = 1
1                  fieldlist = Optarg
1                  OFS = ""
1              } else if (c == "d") {
1                  if (length(Optarg) > 1) {
1                      printf("cut: using first character of %s" \
1                             " for delimiter\n", Optarg) > "/dev/stderr"
1                      Optarg = substr(Optarg, 1, 1)
1                  }
1                  fs = FS = Optarg
1                  OFS = FS
1                  if (FS == " ")    # defeat awk semantics
1                      FS = "[ ]"
1              } else if (c == "s")
1                  suppress = 1
1              else
1                  usage()
1          }
1 
1          # Clear out options
1          for (i = 1; i < Optind; i++)
1              ARGV[i] = ""
1 
1    The code must take special care when the field delimiter is a space.
1 Using a single space ('" "') for the value of 'FS' is incorrect--'awk'
1 would separate fields with runs of spaces, TABs, and/or newlines, and we
1 want them to be separated with individual spaces.  To this end, we save
1 the original space character in the variable 'fs' for later use; after
1 setting 'FS' to '"[ ]"' we can't use it directly to see if the field
1 delimiter character is in the string.
1 
11    Also remember that after 'getopt()' is through (as described in ⇒
 Getopt Function), we have to clear out all the elements of 'ARGV' from
1 1 to 'Optind', so that 'awk' does not try to process the command-line
1 options as file names.
1 
1    After dealing with the command-line options, the program verifies
1 that the options make sense.  Only one or the other of '-c' and '-f'
1 should be used, and both require a field list.  Then the program calls
1 either 'set_fieldlist()' or 'set_charlist()' to pull apart the list of
1 fields or characters:
1 
1          if (by_fields && by_chars)
1              usage()
1 
1          if (by_fields == 0 && by_chars == 0)
1              by_fields = 1    # default
1 
1          if (fieldlist == "") {
1              print "cut: needs list for -c or -f" > "/dev/stderr"
1              exit 1
1          }
1 
1          if (by_fields)
1              set_fieldlist()
1          else
1              set_charlist()
1      }
1 
1    'set_fieldlist()' splits the field list apart at the commas into an
1 array.  Then, for each element of the array, it looks to see if the
1 element is actually a range, and if so, splits it apart.  The function
1 checks the range to make sure that the first number is smaller than the
1 second.  Each number in the list is added to the 'flist' array, which
1 simply lists the fields that will be printed.  Normal field splitting is
1 used.  The program lets 'awk' handle the job of doing the field
1 splitting:
1 
1      function set_fieldlist(        n, m, i, j, k, f, g)
1      {
1          n = split(fieldlist, f, ",")
1          j = 1    # index in flist
1          for (i = 1; i <= n; i++) {
1              if (index(f[i], "-") != 0) { # a range
1                  m = split(f[i], g, "-")
1                  if (m != 2 || g[1] >= g[2]) {
1                      printf("cut: bad field list: %s\n",
1                                        f[i]) > "/dev/stderr"
1                      exit 1
1                  }
1                  for (k = g[1]; k <= g[2]; k++)
1                      flist[j++] = k
1              } else
1                  flist[j++] = f[i]
1          }
1          nfields = j - 1
1      }
1 
1    The 'set_charlist()' function is more complicated than
1 'set_fieldlist()'.  The idea here is to use 'gawk''s 'FIELDWIDTHS'
1 variable (⇒Constant Size), which describes constant-width input.
1 When using a character list, that is exactly what we have.
1 
1    Setting up 'FIELDWIDTHS' is more complicated than simply listing the
1 fields that need to be printed.  We have to keep track of the fields to
1 print and also the intervening characters that have to be skipped.  For
1 example, suppose you wanted characters 1 through 8, 15, and 22 through
1 35.  You would use '-c 1-8,15,22-35'.  The necessary value for
1 'FIELDWIDTHS' is '"8 6 1 6 14"'.  This yields five fields, and the
1 fields to print are '$1', '$3', and '$5'.  The intermediate fields are
1 "filler", which is stuff in between the desired data.  'flist' lists the
1 fields to print, and 't' tracks the complete field list, including
1 filler fields:
1 
1      function set_charlist(    field, i, j, f, g, n, m, t,
1                                filler, last, len)
1      {
1          field = 1   # count total fields
1          n = split(fieldlist, f, ",")
1          j = 1       # index in flist
1          for (i = 1; i <= n; i++) {
1              if (index(f[i], "-") != 0) { # range
1                  m = split(f[i], g, "-")
1                  if (m != 2 || g[1] >= g[2]) {
1                      printf("cut: bad character list: %s\n",
1                                     f[i]) > "/dev/stderr"
1                      exit 1
1                  }
1                  len = g[2] - g[1] + 1
1                  if (g[1] > 1)  # compute length of filler
1                      filler = g[1] - last - 1
1                  else
1                      filler = 0
1                  if (filler)
1                      t[field++] = filler
1                  t[field++] = len  # length of field
1                  last = g[2]
1                  flist[j++] = field - 1
1              } else {
1                  if (f[i] > 1)
1                      filler = f[i] - last - 1
1                  else
1                      filler = 0
1                  if (filler)
1                      t[field++] = filler
1                  t[field++] = 1
1                  last = f[i]
1                  flist[j++] = field - 1
1              }
1          }
1          FIELDWIDTHS = join(t, 1, field - 1)
1          nfields = j - 1
1      }
1 
1    Next is the rule that processes the data.  If the '-s' option is
1 given, then 'suppress' is true.  The first 'if' statement makes sure
1 that the input record does have the field separator.  If 'cut' is
1 processing fields, 'suppress' is true, and the field separator character
1 is not in the record, then the record is skipped.
1 
1    If the record is valid, then 'gawk' has split the data into fields,
1 either using the character in 'FS' or using fixed-length fields and
1 'FIELDWIDTHS'.  The loop goes through the list of fields that should be
1 printed.  The corresponding field is printed if it contains data.  If
1 the next field also has data, then the separator character is written
1 out between the fields:
1 
1      {
1          if (by_fields && suppress && index($0, fs) == 0)
1              next
1 
1          for (i = 1; i <= nfields; i++) {
1              if ($flist[i] != "") {
1                  printf "%s", $flist[i]
1                  if (i < nfields && $flist[i+1] != "")
1                      printf "%s", OFS
1              }
1          }
1          print ""
1      }
1 
1    This version of 'cut' relies on 'gawk''s 'FIELDWIDTHS' variable to do
1 the character-based cutting.  It is possible in other 'awk'
1 implementations to use 'substr()' (⇒String Functions), but it is
1 also extremely painful.  The 'FIELDWIDTHS' variable supplies an elegant
1 solution to the problem of picking the input line apart by characters.
1