gawk: Wc Program
1
1 11.2.7 Counting Things
1 ----------------------
1
1 The 'wc' (word count) utility counts lines, words, and characters in one
1 or more input files. Its usage is as follows:
1
1 'wc' ['-lwc'] [FILES ...]
1
1 If no files are specified on the command line, 'wc' reads its
1 standard input. If there are multiple files, it also prints total
1 counts for all the files. The options and their meanings are as
1 follows:
1
1 '-l'
1 Count only lines.
1
1 '-w'
1 Count only words. A "word" is a contiguous sequence of
1 nonwhitespace characters, separated by spaces and/or TABs.
1 Luckily, this is the normal way 'awk' separates fields in its input
1 data.
1
1 '-c'
1 Count only characters.
1
1 Implementing 'wc' in 'awk' is particularly elegant, because 'awk'
1 does a lot of the work for us; it splits lines into words (i.e., fields)
1 and counts them, it counts lines (i.e., records), and it can easily tell
1 us how long a line is.
1
11 Function::) and the file-transition functions (⇒Filetrans
Function).
1
1 This version has one notable difference from traditional versions of
1 'wc': it always prints the counts in the order lines, words, and
1 characters. Traditional versions note the order of the '-l', '-w', and
1 '-c' options on the command line, and print the counts in that order.
1
1 The 'BEGIN' rule does the argument processing. The variable
1 'print_total' is true if more than one file is named on the command
1 line:
1
1 # wc.awk --- count lines, words, characters
1
1 # Options:
1 # -l only count lines
1 # -w only count words
1 # -c only count characters
1 #
1 # Default is to count lines, words, characters
1 #
1 # Requires getopt() and file transition library functions
1
1 BEGIN {
1 # let getopt() print a message about
1 # invalid options. we ignore them
1 while ((c = getopt(ARGC, ARGV, "lwc")) != -1) {
1 if (c == "l")
1 do_lines = 1
1 else if (c == "w")
1 do_words = 1
1 else if (c == "c")
1 do_chars = 1
1 }
1 for (i = 1; i < Optind; i++)
1 ARGV[i] = ""
1
1 # if no options, do all
1 if (! do_lines && ! do_words && ! do_chars)
1 do_lines = do_words = do_chars = 1
1
1 print_total = (ARGC - i > 1)
1 }
1
1 The 'beginfile()' function is simple; it just resets the counts of
1 lines, words, and characters to zero, and saves the current file name in
1 'fname':
1
1 function beginfile(file)
1 {
1 lines = words = chars = 0
1 fname = FILENAME
1 }
1
1 The 'endfile()' function adds the current file's numbers to the
1 running totals of lines, words, and characters. It then prints out
1 those numbers for the file that was just read. It relies on
1 'beginfile()' to reset the numbers for the following data file:
1
1 function endfile(file)
1 {
1 tlines += lines
1 twords += words
1 tchars += chars
1 if (do_lines)
1 printf "\t%d", lines
1 if (do_words)
1 printf "\t%d", words
1 if (do_chars)
1 printf "\t%d", chars
1 printf "\t%s\n", fname
1 }
1
1 There is one rule that is executed for each line. It adds the length
1 of the record, plus one, to 'chars'.(1) Adding one plus the record
1 length is needed because the newline character separating records (the
1 value of 'RS') is not part of the record itself, and thus not included
1 in its length. Next, 'lines' is incremented for each line read, and
1 'words' is incremented by the value of 'NF', which is the number of
1 "words" on this line:
1
1 # do per line
1 {
1 chars += length($0) + 1 # get newline
1 lines++
1 words += NF
1 }
1
1 Finally, the 'END' rule simply prints the totals for all the files:
1
1 END {
1 if (print_total) {
1 if (do_lines)
1 printf "\t%d", tlines
1 if (do_words)
1 printf "\t%d", twords
1 if (do_chars)
1 printf "\t%d", tchars
1 print "\ttotal"
1 }
1 }
1
1 ---------- Footnotes ----------
1
1 (1) Because 'gawk' understands multibyte locales, this code counts
1 characters, not bytes.
1