gawk: Uniq Program

1 
1 11.2.6 Printing Nonduplicated Lines of Text
1 -------------------------------------------
1 
1 The 'uniq' utility reads sorted lines of data on its standard input, and
1 by default removes duplicate lines.  In other words, it only prints
1 unique lines--hence the name.  'uniq' has a number of options.  The
1 usage is as follows:
1 
1      'uniq' ['-udc' ['-N']] ['+N'] [INPUTFILE [OUTPUTFILE]]
1 
1    The options for 'uniq' are:
1 
1 '-d'
1      Print only repeated (duplicated) lines.
1 
1 '-u'
1      Print only nonrepeated (unique) lines.
1 
1 '-c'
1      Count lines.  This option overrides '-d' and '-u'.  Both repeated
1      and nonrepeated lines are counted.
1 
1 '-N'
1      Skip N fields before comparing lines.  The definition of fields is
1      similar to 'awk''s default: nonwhitespace characters separated by
1      runs of spaces and/or TABs.
1 
1 '+N'
1      Skip N characters before comparing lines.  Any fields specified
1      with '-N' are skipped first.
1 
1 'INPUTFILE'
1      Data is read from the input file named on the command line, instead
1      of from the standard input.
1 
1 'OUTPUTFILE'
1      The generated output is sent to the named output file, instead of
1      to the standard output.
1 
1    Normally 'uniq' behaves as if both the '-d' and '-u' options are
1 provided.
1 
1    'uniq' uses the 'getopt()' library function (⇒Getopt Function)
1 and the 'join()' library function (⇒Join Function).
1 
1    The program begins with a 'usage()' function and then a brief outline
1 of the options and their meanings in comments.  The 'BEGIN' rule deals
1 with the command-line arguments and options.  It uses a trick to get
1 'getopt()' to handle options of the form '-25', treating such an option
1 as the option letter '2' with an argument of '5'.  If indeed two or more
1 digits are supplied ('Optarg' looks like a number), 'Optarg' is
1 concatenated with the option digit and then the result is added to zero
1 to make it into a number.  If there is only one digit in the option,
1 then 'Optarg' is not needed.  In this case, 'Optind' must be decremented
1 so that 'getopt()' processes it next time.  This code is admittedly a
1 bit tricky.
1 
1    If no options are supplied, then the default is taken, to print both
1 repeated and nonrepeated lines.  The output file, if provided, is
1 assigned to 'outputfile'.  Early on, 'outputfile' is initialized to the
1 standard output, '/dev/stdout':
1 
1      # uniq.awk --- do uniq in awk
1      #
1      # Requires getopt() and join() library functions
1 
1      function usage()
1      {
1          print("Usage: uniq [-udc [-n]] [+n] [ in [ out ]]") > "/dev/stderr"
1          exit 1
1      }
1 
1      # -c    count lines. overrides -d and -u
1      # -d    only repeated lines
1      # -u    only nonrepeated lines
1      # -n    skip n fields
1      # +n    skip n characters, skip fields first
1 
1      BEGIN {
1          count = 1
1          outputfile = "/dev/stdout"
1          opts = "udc0:1:2:3:4:5:6:7:8:9:"
1          while ((c = getopt(ARGC, ARGV, opts)) != -1) {
1              if (c == "u")
1                  non_repeated_only++
1              else if (c == "d")
1                  repeated_only++
1              else if (c == "c")
1                  do_count++
1              else if (index("0123456789", c) != 0) {
1                  # getopt() requires args to options
1                  # this messes us up for things like -5
1                  if (Optarg ~ /^[[:digit:]]+$/)
1                      fcount = (c Optarg) + 0
1                  else {
1                      fcount = c + 0
1                      Optind--
1                  }
1              } else
1                  usage()
1          }
1 
1          if (ARGV[Optind] ~ /^\+[[:digit:]]+$/) {
1              charcount = substr(ARGV[Optind], 2) + 0
1              Optind++
1          }
1 
1          for (i = 1; i < Optind; i++)
1              ARGV[i] = ""
1 
1          if (repeated_only == 0 && non_repeated_only == 0)
1              repeated_only = non_repeated_only = 1
1 
1          if (ARGC - Optind == 2) {
1              outputfile = ARGV[ARGC - 1]
1              ARGV[ARGC - 1] = ""
1          }
1      }
1 
1    The following function, 'are_equal()', compares the current line,
1 '$0', to the previous line, 'last'.  It handles skipping fields and
1 characters.  If no field count and no character count are specified,
1 'are_equal()' returns one or zero depending upon the result of a simple
1 string comparison of 'last' and '$0'.
1 
1    Otherwise, things get more complicated.  If fields have to be
1 Functions::); the desired fields are then joined back into a line using
1 'join()'.  The joined lines are stored in 'clast' and 'cline'.  If no
1 fields are skipped, 'clast' and 'cline' are set to 'last' and '$0',
1 respectively.  Finally, if characters are skipped, 'substr()' is used to
1 strip off the leading 'charcount' characters in 'clast' and 'cline'.
1 The two strings are then compared and 'are_equal()' returns the result:
1 
1      function are_equal(    n, m, clast, cline, alast, aline)
1      {
1          if (fcount == 0 && charcount == 0)
1              return (last == $0)
1 
1          if (fcount > 0) {
1              n = split(last, alast)
1              m = split($0, aline)
1              clast = join(alast, fcount+1, n)
1              cline = join(aline, fcount+1, m)
1          } else {
1              clast = last
1              cline = $0
1          }
1          if (charcount) {
1              clast = substr(clast, charcount + 1)
1              cline = substr(cline, charcount + 1)
1          }
1 
1          return (clast == cline)
1      }
1 
1    The following two rules are the body of the program.  The first one
1 is executed only for the very first line of data.  It sets 'last' equal
1 to '$0', so that subsequent lines of text have something to be compared
1 to.
1 
1    The second rule does the work.  The variable 'equal' is one or zero,
1 depending upon the results of 'are_equal()''s comparison.  If 'uniq' is
1 counting repeated lines, and the lines are equal, then it increments the
1 'count' variable.  Otherwise, it prints the line and resets 'count',
1 because the two lines are not equal.
1 
1    If 'uniq' is not counting, and if the lines are equal, 'count' is
1 incremented.  Nothing is printed, as the point is to remove duplicates.
1 Otherwise, if 'uniq' is counting repeated lines and more than one line
1 is seen, or if 'uniq' is counting nonrepeated lines and only one line is
1 seen, then the line is printed, and 'count' is reset.
1 
1    Finally, similar logic is used in the 'END' rule to print the final
1 line of input data:
1 
1      NR == 1 {
1          last = $0
1          next
1      }
1 
1      {
1          equal = are_equal()
1 
1          if (do_count) {    # overrides -d and -u
1              if (equal)
1                  count++
1              else {
1                  printf("%4d %s\n", count, last) > outputfile
1                  last = $0
1                  count = 1    # reset
1              }
1              next
1          }
1 
1          if (equal)
1              count++
1          else {
1              if ((repeated_only && count > 1) ||
1                  (non_repeated_only && count == 1))
1                      print last > outputfile
1              last = $0
1              count = 1
1          }
1      }
1 
1      END {
1          if (do_count)
1              printf("%4d %s\n", count, last) > outputfile
1          else if ((repeated_only && count > 1) ||
1                  (non_repeated_only && count == 1))
1              print last > outputfile
1          close(outputfile)
1      }
1