diffutils: diff Performance

1 
1 6 'diff' Performance Tradeoffs
1 ******************************
1 
1 GNU 'diff' runs quite efficiently; however, in some circumstances you
1 can cause it to run faster or produce a more compact set of changes.
1 
1    One way to improve 'diff' performance is to use hard or symbolic
1 links to files instead of copies.  This improves performance because
1 'diff' normally does not need to read two hard or symbolic links to the
1 same file, since their contents must be identical.  For example, suppose
1 you copy a large directory hierarchy, make a few changes to the copy,
1 and then often use 'diff -r' to compare the original to the copy.  If
1 the original files are read-only, you can greatly improve performance by
1 creating the copy using hard or symbolic links (e.g., with GNU 'cp -lR'
1 or 'cp -sR').  Before editing a file in the copy for the first time, you
1 should break the link and replace it with a regular copy.
1 
1    You can also affect the performance of GNU 'diff' by giving it
1 options that change the way it compares files.  Performance has more
1 than one dimension.  These options improve one aspect of performance at
1 the cost of another, or they improve performance in some cases while
1 hurting it in others.
1 
1    The way that GNU 'diff' determines which lines have changed always
1 comes up with a near-minimal set of differences.  Usually it is good
1 enough for practical purposes.  If the 'diff' output is large, you might
1 want 'diff' to use a modified algorithm that sometimes produces a
1 smaller set of differences.  The '--minimal' ('-d') option does this;
1 however, it can also cause 'diff' to run more slowly than usual, so it
1 is not the default behavior.
1 
1    When the files you are comparing are large and have small groups of
1 changes scattered throughout them, you can use the '--speed-large-files'
1 option to make a different modification to the algorithm that 'diff'
1 uses.  If the input files have a constant small density of changes, this
1 option speeds up the comparisons without changing the output.  If not,
1 'diff' might produce a larger set of differences; however, the output
1 will still be correct.
1 
1    Normally 'diff' discards the prefix and suffix that is common to both
1 files before it attempts to find a minimal set of differences.  This
1 makes 'diff' run faster, but occasionally it may produce non-minimal
1 output.  The '--horizon-lines=LINES' option prevents 'diff' from
1 discarding the last LINES lines of the prefix and the first LINES lines
1 of the suffix.  This gives 'diff' further opportunities to find a
1 minimal output.
1 
1    Suppose a run of changed lines includes a sequence of lines at one
1 end and there is an identical sequence of lines just outside the other
1 end.  The 'diff' command is free to choose which identical sequence is
1 included in the hunk.  In this case, 'diff' normally shifts the hunk's
1 boundaries when this merges adjacent hunks, or shifts a hunk's lines
1 towards the end of the file.  Merging hunks can make the output look
1 nicer in some cases.
1