gawk: Uniq Program
1
1 11.2.6 Printing Nonduplicated Lines of Text
1 -------------------------------------------
1
1 The 'uniq' utility reads sorted lines of data on its standard input, and
1 by default removes duplicate lines. In other words, it only prints
1 unique lines--hence the name. 'uniq' has a number of options. The
1 usage is as follows:
1
1 'uniq' ['-udc' ['-N']] ['+N'] [INPUTFILE [OUTPUTFILE]]
1
1 The options for 'uniq' are:
1
1 '-d'
1 Print only repeated (duplicated) lines.
1
1 '-u'
1 Print only nonrepeated (unique) lines.
1
1 '-c'
1 Count lines. This option overrides '-d' and '-u'. Both repeated
1 and nonrepeated lines are counted.
1
1 '-N'
1 Skip N fields before comparing lines. The definition of fields is
1 similar to 'awk''s default: nonwhitespace characters separated by
1 runs of spaces and/or TABs.
1
1 '+N'
1 Skip N characters before comparing lines. Any fields specified
1 with '-N' are skipped first.
1
1 'INPUTFILE'
1 Data is read from the input file named on the command line, instead
1 of from the standard input.
1
1 'OUTPUTFILE'
1 The generated output is sent to the named output file, instead of
1 to the standard output.
1
1 Normally 'uniq' behaves as if both the '-d' and '-u' options are
1 provided.
1
1 'uniq' uses the 'getopt()' library function (⇒Getopt Function)
1 and the 'join()' library function (⇒Join Function).
1
1 The program begins with a 'usage()' function and then a brief outline
1 of the options and their meanings in comments. The 'BEGIN' rule deals
1 with the command-line arguments and options. It uses a trick to get
1 'getopt()' to handle options of the form '-25', treating such an option
1 as the option letter '2' with an argument of '5'. If indeed two or more
1 digits are supplied ('Optarg' looks like a number), 'Optarg' is
1 concatenated with the option digit and then the result is added to zero
1 to make it into a number. If there is only one digit in the option,
1 then 'Optarg' is not needed. In this case, 'Optind' must be decremented
1 so that 'getopt()' processes it next time. This code is admittedly a
1 bit tricky.
1
1 If no options are supplied, then the default is taken, to print both
1 repeated and nonrepeated lines. The output file, if provided, is
1 assigned to 'outputfile'. Early on, 'outputfile' is initialized to the
1 standard output, '/dev/stdout':
1
1 # uniq.awk --- do uniq in awk
1 #
1 # Requires getopt() and join() library functions
1
1 function usage()
1 {
1 print("Usage: uniq [-udc [-n]] [+n] [ in [ out ]]") > "/dev/stderr"
1 exit 1
1 }
1
1 # -c count lines. overrides -d and -u
1 # -d only repeated lines
1 # -u only nonrepeated lines
1 # -n skip n fields
1 # +n skip n characters, skip fields first
1
1 BEGIN {
1 count = 1
1 outputfile = "/dev/stdout"
1 opts = "udc0:1:2:3:4:5:6:7:8:9:"
1 while ((c = getopt(ARGC, ARGV, opts)) != -1) {
1 if (c == "u")
1 non_repeated_only++
1 else if (c == "d")
1 repeated_only++
1 else if (c == "c")
1 do_count++
1 else if (index("0123456789", c) != 0) {
1 # getopt() requires args to options
1 # this messes us up for things like -5
1 if (Optarg ~ /^[[:digit:]]+$/)
1 fcount = (c Optarg) + 0
1 else {
1 fcount = c + 0
1 Optind--
1 }
1 } else
1 usage()
1 }
1
1 if (ARGV[Optind] ~ /^\+[[:digit:]]+$/) {
1 charcount = substr(ARGV[Optind], 2) + 0
1 Optind++
1 }
1
1 for (i = 1; i < Optind; i++)
1 ARGV[i] = ""
1
1 if (repeated_only == 0 && non_repeated_only == 0)
1 repeated_only = non_repeated_only = 1
1
1 if (ARGC - Optind == 2) {
1 outputfile = ARGV[ARGC - 1]
1 ARGV[ARGC - 1] = ""
1 }
1 }
1
1 The following function, 'are_equal()', compares the current line,
1 '$0', to the previous line, 'last'. It handles skipping fields and
1 characters. If no field count and no character count are specified,
1 'are_equal()' returns one or zero depending upon the result of a simple
1 string comparison of 'last' and '$0'.
1
1 Otherwise, things get more complicated. If fields have to be
1 Functions::); the desired fields are then joined back into a line using
1 'join()'. The joined lines are stored in 'clast' and 'cline'. If no
1 fields are skipped, 'clast' and 'cline' are set to 'last' and '$0',
1 respectively. Finally, if characters are skipped, 'substr()' is used to
1 strip off the leading 'charcount' characters in 'clast' and 'cline'.
1 The two strings are then compared and 'are_equal()' returns the result:
1
1 function are_equal( n, m, clast, cline, alast, aline)
1 {
1 if (fcount == 0 && charcount == 0)
1 return (last == $0)
1
1 if (fcount > 0) {
1 n = split(last, alast)
1 m = split($0, aline)
1 clast = join(alast, fcount+1, n)
1 cline = join(aline, fcount+1, m)
1 } else {
1 clast = last
1 cline = $0
1 }
1 if (charcount) {
1 clast = substr(clast, charcount + 1)
1 cline = substr(cline, charcount + 1)
1 }
1
1 return (clast == cline)
1 }
1
1 The following two rules are the body of the program. The first one
1 is executed only for the very first line of data. It sets 'last' equal
1 to '$0', so that subsequent lines of text have something to be compared
1 to.
1
1 The second rule does the work. The variable 'equal' is one or zero,
1 depending upon the results of 'are_equal()''s comparison. If 'uniq' is
1 counting repeated lines, and the lines are equal, then it increments the
1 'count' variable. Otherwise, it prints the line and resets 'count',
1 because the two lines are not equal.
1
1 If 'uniq' is not counting, and if the lines are equal, 'count' is
1 incremented. Nothing is printed, as the point is to remove duplicates.
1 Otherwise, if 'uniq' is counting repeated lines and more than one line
1 is seen, or if 'uniq' is counting nonrepeated lines and only one line is
1 seen, then the line is printed, and 'count' is reset.
1
1 Finally, similar logic is used in the 'END' rule to print the final
1 line of input data:
1
1 NR == 1 {
1 last = $0
1 next
1 }
1
1 {
1 equal = are_equal()
1
1 if (do_count) { # overrides -d and -u
1 if (equal)
1 count++
1 else {
1 printf("%4d %s\n", count, last) > outputfile
1 last = $0
1 count = 1 # reset
1 }
1 next
1 }
1
1 if (equal)
1 count++
1 else {
1 if ((repeated_only && count > 1) ||
1 (non_repeated_only && count == 1))
1 print last > outputfile
1 last = $0
1 count = 1
1 }
1 }
1
1 END {
1 if (do_count)
1 printf("%4d %s\n", count, last) > outputfile
1 else if ((repeated_only && count > 1) ||
1 (non_repeated_only && count == 1))
1 print last > outputfile
1 close(outputfile)
1 }
1