Info: (gawk) Dupword Program

⇖ Info Catalog ↑ gawk: Miscellaneous Programs → gawk: Alarm Program
gawk: Dupword Program

1 
1 11.3.1 Finding Duplicated Words in a Document
1 ---------------------------------------------
1 
1 A common error when writing large amounts of prose is to accidentally
1 duplicate words.  Typically you will see this in text as something like
1 "the the program does the following..." When the text is online, often
1 the duplicated words occur at the end of one line and the beginning of
1 another, making them very difficult to spot.
1 
1    This program, 'dupword.awk', scans through a file one line at a time
1 and looks for adjacent occurrences of the same word.  It also saves the
1 last word on a line (in the variable 'prev') for comparison with the
1 first word on the next line.
1 
1    The first two statements make sure that the line is all lowercase, so
1 that, for example, "The" and "the" compare equal to each other.  The
1 next statement replaces nonalphanumeric and nonwhitespace characters
1 with spaces, so that punctuation does not affect the comparison either.
1 The characters are replaced with spaces so that formatting controls
1 don't create nonsense words (e.g., the Texinfo '@code{NF}' becomes
1 'codeNF' if punctuation is simply deleted).  The record is then resplit
1 into fields, yielding just the actual words on the line, and ensuring
1 that there are no empty fields.
1 
1    If there are no fields left after removing all the punctuation, the
1 current record is skipped.  Otherwise, the program loops through each
1 word, comparing it to the previous one:
1 
1      # dupword.awk --- find duplicate words in text
1      {
1          $0 = tolower($0)
1          gsub(/[^[:alnum:][:blank:]]/, " ");
1          $0 = $0         # re-split
1          if (NF == 0)
1              next
1          if ($1 == prev)
1              printf("%s:%d: duplicate %s\n",
1                  FILENAME, FNR, $1)
1          for (i = 2; i <= NF; i++)
1              if ($i == $(i-1))
1                  printf("%s:%d: duplicate %s\n",
1                      FILENAME, FNR, $i)
1          prev = $NF
1      }
1
⇖ Info Catalog ↑ gawk: Miscellaneous Programs → gawk: Alarm Program