gawk: Anagram Program
1
1 11.3.10 Finding Anagrams from a Dictionary
1 ------------------------------------------
1
1 An interesting programming challenge is to search for "anagrams" in a
1 word list (such as '/usr/share/dict/words' on many GNU/Linux systems).
1 One word is an anagram of another if both words contain the same letters
1 (e.g., "babbling" and "blabbing").
1
1 Column 2, Problem C, of Jon Bentley's 'Programming Pearls', Second
1 Edition, presents an elegant algorithm. The idea is to give words that
1 are anagrams a common signature, sort all the words together by their
1 signatures, and then print them. Dr. Bentley observes that taking the
1 letters in each word and sorting them produces those common signatures.
1
1 The following program uses arrays of arrays to bring together words
1 with the same signature and array sorting to print the words in sorted
1 order:
1
1 # anagram.awk --- An implementation of the anagram-finding algorithm
1 # from Jon Bentley's "Programming Pearls," 2nd edition.
1 # Addison Wesley, 2000, ISBN 0-201-65788-0.
1 # Column 2, Problem C, section 2.8, pp 18-20.
1
1 /'s$/ { next } # Skip possessives
1
1 The program starts with a header, and then a rule to skip possessives
1 in the dictionary file. The next rule builds up the data structure.
1 The first dimension of the array is indexed by the signature; the second
1 dimension is the word itself:
1
1 {
1 key = word2key($1) # Build signature
1 data[key][$1] = $1 # Store word with signature
1 }
1
1 The 'word2key()' function creates the signature. It splits the word
1 apart into individual letters, sorts the letters, and then joins them
1 back together:
1
1 # word2key --- split word apart into letters, sort, and join back together
1
1 function word2key(word, a, i, n, result)
1 {
1 n = split(word, a, "")
1 asort(a)
1
1 for (i = 1; i <= n; i++)
1 result = result a[i]
1
1 return result
1 }
1
1 Finally, the 'END' rule traverses the array and prints out the
1 anagram lists. It sends the output to the system 'sort' command because
1 otherwise the anagrams would appear in arbitrary order:
1
1 END {
1 sort = "sort"
1 for (key in data) {
1 # Sort words with same key
1 nwords = asorti(data[key], words)
1 if (nwords == 1)
1 continue
1
1 # And print. Minor glitch: trailing space at end of each line
1 for (j = 1; j <= nwords; j++)
1 printf("%s ", words[j]) | sort
1 print "" | sort
1 }
1 close(sort)
1 }
1
1 Here is some partial output when the program is run:
1
1 $ gawk -f anagram.awk /usr/share/dict/words | grep '^b'
1 ...
1 babbled blabbed
1 babbler blabber brabble
1 babblers blabbers brabbles
1 babbling blabbing
1 babbly blabby
1 babel bable
1 babels beslab
1 babery yabber
1 ...
1