gawk: History Sorting
1
1 11.3.6 Removing Duplicates from Unsorted Text
1 ---------------------------------------------
1
1 The 'uniq' program (⇒Uniq Program) removes duplicate lines from
1 _sorted_ data.
1
1 Suppose, however, you need to remove duplicate lines from a data file
1 but that you want to preserve the order the lines are in. A good
1 example of this might be a shell history file. The history file keeps a
1 copy of all the commands you have entered, and it is not unusual to
1 repeat a command several times in a row. Occasionally you might want to
1 compact the history by removing duplicate entries. Yet it is desirable
1 to maintain the order of the original commands.
1
1 This simple program does the job. It uses two arrays. The 'data'
1 array is indexed by the text of each line. For each line, 'data[$0]' is
1 incremented. If a particular line has not been seen before, then
1 'data[$0]' is zero. In this case, the text of the line is stored in
1 'lines[count]'. Each element of 'lines' is a unique command, and the
1 indices of 'lines' indicate the order in which those lines are
1 encountered. The 'END' rule simply prints out the lines, in order:
1
1 # histsort.awk --- compact a shell history file
1 # Thanks to Byron Rakitzis for the general idea
1
1 {
1 if (data[$0]++ == 0)
1 lines[++count] = $0
1 }
1
1 END {
1 for (i = 1; i <= count; i++)
1 print lines[i]
1 }
1
1 This program also provides a foundation for generating other useful
1 information. For example, using the following 'print' statement in the
1 'END' rule indicates how often a particular command is used:
1
1 print data[lines[i]], lines[i]
1
1 This works because 'data[$0]' is incremented each time a line is seen.
1