Info: (gawk) Word Sorting

⇖ Info Catalog ← gawk: Labels Program ↑ gawk: Miscellaneous Programs → gawk: History Sorting
gawk: Word Sorting

1 
1 11.3.5 Generating Word-Usage Counts
1 -----------------------------------
1 
1 When working with large amounts of text, it can be interesting to know
1 how often different words appear.  For example, an author may overuse
1 certain words, in which case he or she might wish to find synonyms to
1 substitute for words that appear too often.  This node develops a
1 program for counting words and presenting the frequency information in a
1 useful format.
1 
1    At first glance, a program like this would seem to do the job:
1 
1      # wordfreq-first-try.awk --- print list of word frequencies
1 
1      {
1          for (i = 1; i <= NF; i++)
1              freq[$i]++
1      }
1 
1      END {
1          for (word in freq)
1              printf "%s\t%d\n", word, freq[word]
1      }
1 
1    The program relies on 'awk''s default field-splitting mechanism to
1 break each line up into "words" and uses an associative array named
1 'freq', indexed by each word, to count the number of times the word
1 occurs.  In the 'END' rule, it prints the counts.
1 
1    This program has several problems that prevent it from being useful
1 on real text files:
1 
1    * The 'awk' language considers upper- and lowercase characters to be
1      distinct.  Therefore, "bartender" and "Bartender" are not treated
1      as the same word.  This is undesirable, because words are
1      capitalized if they begin sentences in normal text, and a frequency
1      analyzer should not be sensitive to capitalization.
1 
1    * Words are detected using the 'awk' convention that fields are
1      separated just by whitespace.  Other characters in the input
1      (except newlines) don't have any special meaning to 'awk'.  This
1      means that punctuation characters count as part of words.
1 
1    * The output does not come out in any useful order.  You're more
1      likely to be interested in which words occur most frequently or in
1      having an alphabetized table of how frequently each word occurs.
1 
1    The first problem can be solved by using 'tolower()' to remove case
1 distinctions.  The second problem can be solved by using 'gsub()' to
1 remove punctuation characters.  Finally, we solve the third problem by
1 using the system 'sort' utility to process the output of the 'awk'
1 script.  Here is the new version of the program:
1 
1      # wordfreq.awk --- print list of word frequencies
1 
1      {
1          $0 = tolower($0)    # remove case distinctions
1          # remove punctuation
1          gsub(/[^[:alnum:]_[:blank:]]/, "", $0)
1          for (i = 1; i <= NF; i++)
1              freq[$i]++
1      }
1 
1      END {
1          for (word in freq)
1              printf "%s\t%d\n", word, freq[word]
1      }
1 
1    The regexp '/[^[:alnum:]_[:blank:]]/' might have been written
1 '/[[:punct:]]/', but then underscores would also be removed, and we want
1 to keep them.
1 
1    Assuming we have saved this program in a file named 'wordfreq.awk',
1 and that the data is in 'file1', the following pipeline:
1 
1      awk -f wordfreq.awk file1 | sort -k 2nr
1 
1 produces a table of the words appearing in 'file1' in order of
1 decreasing frequency.
1 
1    The 'awk' program suitably massages the data and produces a word
1 frequency table, which is not ordered.  The 'awk' script's output is
1 then sorted by the 'sort' utility and printed on the screen.
1 
1    The options given to 'sort' specify a sort that uses the second field
1 of each input line (skipping one field), that the sort keys should be
1 treated as numeric quantities (otherwise '15' would come before '5'),
1 and that the sorting should be done in descending (reverse) order.
1 
1    The 'sort' could even be done from within the program, by changing
1 the 'END' action to:
1 
1      END {
1          sort = "sort -k 2nr"
1          for (word in freq)
1              printf "%s\t%d\n", word, freq[word] | sort
1          close(sort)
1      }
1 
1    This way of sorting must be used on systems that do not have true
1 pipes at the command-line (or batch-file) level.  See the general
1 operating system documentation for more information on how to use the
1 'sort' program.
1
⇖ Info Catalog ← gawk: Labels Program ↑ gawk: Miscellaneous Programs → gawk: History Sorting