grep: Performance

1 
1 5 Performance
1 *************
1 
1 Typically ‘grep’ is an efficient way to search text.  However, it can be
1 quite slow in some cases, and it can search large files where even minor
1 performance tweaking can help significantly.  Although the algorithm
1 used by ‘grep’ is an implementation detail that can change from release
1 to release, understanding its basic strengths and weaknesses can help
1 you improve its performance.
1 
1    The ‘grep’ command operates partly via a set of automata that are
1 designed for efficiency, and partly via a slower matcher that takes over
1 when the fast matchers run into unusual features like back-references.
1 When feasible, the Boyer–Moore fast string searching algorithm is used
1 to match a single fixed pattern, and the Aho–Corasick algorithm is used
1 to match multiple fixed patterns.
1 
1    Generally speaking ‘grep’ operates more efficiently in single-byte
1 locales, since it can avoid the special processing needed for multi-byte
1 characters.  If your pattern will work just as well that way, setting
1 ‘LC_ALL’ to a single-byte locale can help performance considerably.
1 Setting ‘LC_ALL='C'’ can be particularly efficient, as ‘grep’ is tuned
1 for that locale.
1 
1    Outside the ‘C’ locale, case-insensitive search, and search for
1 bracket expressions like ‘[a-z]’ and ‘[[=a=]b]’, can be surprisingly
1 inefficient due to difficulties in fast portable access to concepts like
1 multi-character collating elements.
1 
1    A back-reference such as ‘\1’ can hurt performance significantly in
1 some cases, since back-references cannot in general be implemented via a
1 finite state automaton, and instead trigger a backtracking algorithm
1 that can be quite inefficient.  For example, although the pattern
1 ‘^(.*)\1{14}(.*)\2{13}$’ matches only lines whose lengths can be written
1 as a sum 15x + 14y for nonnegative integers x and y, the pattern matcher
1 does not perform linear Diophantine analysis and instead backtracks
1 through all possible matching strings, using an algorithm that is
1 exponential in the worst case.
1 
1    On some operating systems that support files with holes—large regions
1 of zeros that are not physically present on secondary storage—‘grep’ can
1 skip over the holes efficiently without needing to read the zeros.  This
1 optimization is not available if the ‘-a’ (‘--text’) option is used
1 (⇒File and Directory Selection), unless the ‘-z’ (‘--null-data’)
1 option is also used (⇒Other Options).
1 
1    For more about the algorithms used by ‘grep’ and about related string
1 matching algorithms, see:
1 
1    • Aho AV. Algorithms for finding patterns in strings. In: van Leeuwen
1      J. _Handbook of Theoretical Computer Science_, vol. A. New York:
1      Elsevier; 1990. p. 255–300. This surveys classic string matching
1      algorithms, some of which are used by ‘grep’.
1 
1    • Aho AV, Corasick MJ. Efficient string matching: an aid to
1      bibliographic search. _CACM_. 1975;18(6):333–40.
1      <http://dx.doi.org/10.1145/360825.360855>. This introduces the
1      Aho–Corasick algorithm.
1 
1    • Boyer RS, Moore JS. A fast string searching algorithm. _CACM_.
1      1977;20(10):762–72. <http://dx.doi.org/10.1145/359842.359859>. This
1      introduces the Boyer–Moore algorithm.
1 
1    • Faro S, Lecroq T. The exact online string matching problem: a
1      review of the most recent results. _ACM Comput Surv_.
1      2013;45(2):13. <http://dx.doi.org/10.1145/2431211.2431212>. This
1      surveys string matching algorithms that might help improve the
1      performance of ‘grep’ in the future.
1