gawk: Dupword Program
1
1 11.3.1 Finding Duplicated Words in a Document
1 ---------------------------------------------
1
1 A common error when writing large amounts of prose is to accidentally
1 duplicate words. Typically you will see this in text as something like
1 "the the program does the following..." When the text is online, often
1 the duplicated words occur at the end of one line and the beginning of
1 another, making them very difficult to spot.
1
1 This program, 'dupword.awk', scans through a file one line at a time
1 and looks for adjacent occurrences of the same word. It also saves the
1 last word on a line (in the variable 'prev') for comparison with the
1 first word on the next line.
1
1 The first two statements make sure that the line is all lowercase, so
1 that, for example, "The" and "the" compare equal to each other. The
1 next statement replaces nonalphanumeric and nonwhitespace characters
1 with spaces, so that punctuation does not affect the comparison either.
1 The characters are replaced with spaces so that formatting controls
1 don't create nonsense words (e.g., the Texinfo '@code{NF}' becomes
1 'codeNF' if punctuation is simply deleted). The record is then resplit
1 into fields, yielding just the actual words on the line, and ensuring
1 that there are no empty fields.
1
1 If there are no fields left after removing all the punctuation, the
1 current record is skipped. Otherwise, the program loops through each
1 word, comparing it to the previous one:
1
1 # dupword.awk --- find duplicate words in text
1 {
1 $0 = tolower($0)
1 gsub(/[^[:alnum:][:blank:]]/, " ");
1 $0 = $0 # re-split
1 if (NF == 0)
1 next
1 if ($1 == prev)
1 printf("%s:%d: duplicate %s\n",
1 FILENAME, FNR, $1)
1 for (i = 2; i <= NF; i++)
1 if ($i == $(i-1))
1 printf("%s:%d: duplicate %s\n",
1 FILENAME, FNR, $i)
1 prev = $NF
1 }
1