Info: (gawk) History Sorting

⇖ Info Catalog ← gawk: Word Sorting ↑ gawk: Miscellaneous Programs → gawk: Extract Program

gawk: History Sorting

1 
1 11.3.6 Removing Duplicates from Unsorted Text
1 ---------------------------------------------
1 
1 The 'uniq' program (⇒Uniq Program) removes duplicate lines from
1 _sorted_ data.
1 
1    Suppose, however, you need to remove duplicate lines from a data file
1 but that you want to preserve the order the lines are in.  A good
1 example of this might be a shell history file.  The history file keeps a
1 copy of all the commands you have entered, and it is not unusual to
1 repeat a command several times in a row.  Occasionally you might want to
1 compact the history by removing duplicate entries.  Yet it is desirable
1 to maintain the order of the original commands.
1 
1    This simple program does the job.  It uses two arrays.  The 'data'
1 array is indexed by the text of each line.  For each line, 'data[$0]' is
1 incremented.  If a particular line has not been seen before, then
1 'data[$0]' is zero.  In this case, the text of the line is stored in
1 'lines[count]'.  Each element of 'lines' is a unique command, and the
1 indices of 'lines' indicate the order in which those lines are
1 encountered.  The 'END' rule simply prints out the lines, in order:
1 
1      # histsort.awk --- compact a shell history file
1      # Thanks to Byron Rakitzis for the general idea
1 
1      {
1          if (data[$0]++ == 0)
1              lines[++count] = $0
1      }
1 
1      END {
1          for (i = 1; i <= count; i++)
1              print lines[i]
1      }
1 
1    This program also provides a foundation for generating other useful
1 information.  For example, using the following 'print' statement in the
1 'END' rule indicates how often a particular command is used:
1 
1      print data[lines[i]], lines[i]
1 
1 This works because 'data[$0]' is incremented each time a line is seen.
1

⇖ Info Catalog ← gawk: Word Sorting ↑ gawk: Miscellaneous Programs → gawk: Extract Program