gawk: Anagram Program

1 
1 11.3.10 Finding Anagrams from a Dictionary
1 ------------------------------------------
1 
1 An interesting programming challenge is to search for "anagrams" in a
1 word list (such as '/usr/share/dict/words' on many GNU/Linux systems).
1 One word is an anagram of another if both words contain the same letters
1 (e.g., "babbling" and "blabbing").
1 
1    Column 2, Problem C, of Jon Bentley's 'Programming Pearls', Second
1 Edition, presents an elegant algorithm.  The idea is to give words that
1 are anagrams a common signature, sort all the words together by their
1 signatures, and then print them.  Dr. Bentley observes that taking the
1 letters in each word and sorting them produces those common signatures.
1 
1    The following program uses arrays of arrays to bring together words
1 with the same signature and array sorting to print the words in sorted
1 order:
1 
1      # anagram.awk --- An implementation of the anagram-finding algorithm
1      #                 from Jon Bentley's "Programming Pearls," 2nd edition.
1      #                 Addison Wesley, 2000, ISBN 0-201-65788-0.
1      #                 Column 2, Problem C, section 2.8, pp 18-20.
1 
1      /'s$/   { next }        # Skip possessives
1 
1    The program starts with a header, and then a rule to skip possessives
1 in the dictionary file.  The next rule builds up the data structure.
1 The first dimension of the array is indexed by the signature; the second
1 dimension is the word itself:
1 
1      {
1          key = word2key($1)  # Build signature
1          data[key][$1] = $1  # Store word with signature
1      }
1 
1    The 'word2key()' function creates the signature.  It splits the word
1 apart into individual letters, sorts the letters, and then joins them
1 back together:
1 
1      # word2key --- split word apart into letters, sort, and join back together
1 
1      function word2key(word,     a, i, n, result)
1      {
1          n = split(word, a, "")
1          asort(a)
1 
1          for (i = 1; i <= n; i++)
1              result = result a[i]
1 
1          return result
1      }
1 
1    Finally, the 'END' rule traverses the array and prints out the
1 anagram lists.  It sends the output to the system 'sort' command because
1 otherwise the anagrams would appear in arbitrary order:
1 
1      END {
1          sort = "sort"
1          for (key in data) {
1              # Sort words with same key
1              nwords = asorti(data[key], words)
1              if (nwords == 1)
1                  continue
1 
1              # And print. Minor glitch: trailing space at end of each line
1              for (j = 1; j <= nwords; j++)
1                  printf("%s ", words[j]) | sort
1              print "" | sort
1          }
1          close(sort)
1      }
1 
1    Here is some partial output when the program is run:
1 
1      $ gawk -f anagram.awk /usr/share/dict/words | grep '^b'
1      ...
1      babbled blabbed
1      babbler blabber brabble
1      babblers blabbers brabbles
1      babbling blabbing
1      babbly blabby
1      babel bable
1      babels beslab
1      babery yabber
1      ...
1