gawk: Word Sorting
1
1 11.3.5 Generating Word-Usage Counts
1 -----------------------------------
1
1 When working with large amounts of text, it can be interesting to know
1 how often different words appear. For example, an author may overuse
1 certain words, in which case he or she might wish to find synonyms to
1 substitute for words that appear too often. This node develops a
1 program for counting words and presenting the frequency information in a
1 useful format.
1
1 At first glance, a program like this would seem to do the job:
1
1 # wordfreq-first-try.awk --- print list of word frequencies
1
1 {
1 for (i = 1; i <= NF; i++)
1 freq[$i]++
1 }
1
1 END {
1 for (word in freq)
1 printf "%s\t%d\n", word, freq[word]
1 }
1
1 The program relies on 'awk''s default field-splitting mechanism to
1 break each line up into "words" and uses an associative array named
1 'freq', indexed by each word, to count the number of times the word
1 occurs. In the 'END' rule, it prints the counts.
1
1 This program has several problems that prevent it from being useful
1 on real text files:
1
1 * The 'awk' language considers upper- and lowercase characters to be
1 distinct. Therefore, "bartender" and "Bartender" are not treated
1 as the same word. This is undesirable, because words are
1 capitalized if they begin sentences in normal text, and a frequency
1 analyzer should not be sensitive to capitalization.
1
1 * Words are detected using the 'awk' convention that fields are
1 separated just by whitespace. Other characters in the input
1 (except newlines) don't have any special meaning to 'awk'. This
1 means that punctuation characters count as part of words.
1
1 * The output does not come out in any useful order. You're more
1 likely to be interested in which words occur most frequently or in
1 having an alphabetized table of how frequently each word occurs.
1
1 The first problem can be solved by using 'tolower()' to remove case
1 distinctions. The second problem can be solved by using 'gsub()' to
1 remove punctuation characters. Finally, we solve the third problem by
1 using the system 'sort' utility to process the output of the 'awk'
1 script. Here is the new version of the program:
1
1 # wordfreq.awk --- print list of word frequencies
1
1 {
1 $0 = tolower($0) # remove case distinctions
1 # remove punctuation
1 gsub(/[^[:alnum:]_[:blank:]]/, "", $0)
1 for (i = 1; i <= NF; i++)
1 freq[$i]++
1 }
1
1 END {
1 for (word in freq)
1 printf "%s\t%d\n", word, freq[word]
1 }
1
1 The regexp '/[^[:alnum:]_[:blank:]]/' might have been written
1 '/[[:punct:]]/', but then underscores would also be removed, and we want
1 to keep them.
1
1 Assuming we have saved this program in a file named 'wordfreq.awk',
1 and that the data is in 'file1', the following pipeline:
1
1 awk -f wordfreq.awk file1 | sort -k 2nr
1
1 produces a table of the words appearing in 'file1' in order of
1 decreasing frequency.
1
1 The 'awk' program suitably massages the data and produces a word
1 frequency table, which is not ordered. The 'awk' script's output is
1 then sorted by the 'sort' utility and printed on the screen.
1
1 The options given to 'sort' specify a sort that uses the second field
1 of each input line (skipping one field), that the sort keys should be
1 treated as numeric quantities (otherwise '15' would come before '5'),
1 and that the sorting should be done in descending (reverse) order.
1
1 The 'sort' could even be done from within the program, by changing
1 the 'END' action to:
1
1 END {
1 sort = "sort -k 2nr"
1 for (word in freq)
1 printf "%s\t%d\n", word, freq[word] | sort
1 close(sort)
1 }
1
1 This way of sorting must be used on systems that do not have true
1 pipes at the command-line (or batch-file) level. See the general
1 operating system documentation for more information on how to use the
1 'sort' program.
1