gawk: Wc Program

1 
1 11.2.7 Counting Things
1 ----------------------
1 
1 The 'wc' (word count) utility counts lines, words, and characters in one
1 or more input files.  Its usage is as follows:
1 
1      'wc' ['-lwc'] [FILES ...]
1 
1    If no files are specified on the command line, 'wc' reads its
1 standard input.  If there are multiple files, it also prints total
1 counts for all the files.  The options and their meanings are as
1 follows:
1 
1 '-l'
1      Count only lines.
1 
1 '-w'
1      Count only words.  A "word" is a contiguous sequence of
1      nonwhitespace characters, separated by spaces and/or TABs.
1      Luckily, this is the normal way 'awk' separates fields in its input
1      data.
1 
1 '-c'
1      Count only characters.
1 
1    Implementing 'wc' in 'awk' is particularly elegant, because 'awk'
1 does a lot of the work for us; it splits lines into words (i.e., fields)
1 and counts them, it counts lines (i.e., records), and it can easily tell
1 us how long a line is.
1 
11 Function::) and the file-transition functions (⇒Filetrans
 Function).
1 
1    This version has one notable difference from traditional versions of
1 'wc': it always prints the counts in the order lines, words, and
1 characters.  Traditional versions note the order of the '-l', '-w', and
1 '-c' options on the command line, and print the counts in that order.
1 
1    The 'BEGIN' rule does the argument processing.  The variable
1 'print_total' is true if more than one file is named on the command
1 line:
1 
1      # wc.awk --- count lines, words, characters
1 
1      # Options:
1      #    -l    only count lines
1      #    -w    only count words
1      #    -c    only count characters
1      #
1      # Default is to count lines, words, characters
1      #
1      # Requires getopt() and file transition library functions
1 
1      BEGIN {
1          # let getopt() print a message about
1          # invalid options. we ignore them
1          while ((c = getopt(ARGC, ARGV, "lwc")) != -1) {
1              if (c == "l")
1                  do_lines = 1
1              else if (c == "w")
1                  do_words = 1
1              else if (c == "c")
1                  do_chars = 1
1          }
1          for (i = 1; i < Optind; i++)
1              ARGV[i] = ""
1 
1          # if no options, do all
1          if (! do_lines && ! do_words && ! do_chars)
1              do_lines = do_words = do_chars = 1
1 
1          print_total = (ARGC - i > 1)
1      }
1 
1    The 'beginfile()' function is simple; it just resets the counts of
1 lines, words, and characters to zero, and saves the current file name in
1 'fname':
1 
1      function beginfile(file)
1      {
1          lines = words = chars = 0
1          fname = FILENAME
1      }
1 
1    The 'endfile()' function adds the current file's numbers to the
1 running totals of lines, words, and characters.  It then prints out
1 those numbers for the file that was just read.  It relies on
1 'beginfile()' to reset the numbers for the following data file:
1 
1      function endfile(file)
1      {
1          tlines += lines
1          twords += words
1          tchars += chars
1          if (do_lines)
1              printf "\t%d", lines
1          if (do_words)
1              printf "\t%d", words
1          if (do_chars)
1              printf "\t%d", chars
1          printf "\t%s\n", fname
1      }
1 
1    There is one rule that is executed for each line.  It adds the length
1 of the record, plus one, to 'chars'.(1)  Adding one plus the record
1 length is needed because the newline character separating records (the
1 value of 'RS') is not part of the record itself, and thus not included
1 in its length.  Next, 'lines' is incremented for each line read, and
1 'words' is incremented by the value of 'NF', which is the number of
1 "words" on this line:
1 
1      # do per line
1      {
1          chars += length($0) + 1    # get newline
1          lines++
1          words += NF
1      }
1 
1    Finally, the 'END' rule simply prints the totals for all the files:
1 
1      END {
1          if (print_total) {
1              if (do_lines)
1                  printf "\t%d", tlines
1              if (do_words)
1                  printf "\t%d", twords
1              if (do_chars)
1                  printf "\t%d", tchars
1              print "\ttotal"
1          }
1      }
1 
1    ---------- Footnotes ----------
1 
1    (1) Because 'gawk' understands multibyte locales, this code counts
1 characters, not bytes.
1