sed: Text search across multiple lines

1 
1 7.7 Text search across multiple lines
1 =====================================
1 
1 This section uses 'N' and 'D' commands to search for consecutive words
1 spanning multiple lines.  ⇒Multiline techniques.
1 
1    These examples deal with finding doubled occurrences of words in a
1 document.
1 
1    Finding doubled words in a single line is easy using GNU 'grep' and
1 similarly with GNU 'sed':
1 
1      $ cat two-cities-dup1.txt
1      It was the best of times,
1      it was the worst of times,
1      it was the the age of wisdom,
1      it was the age of foolishness,
1 
1      $ grep -E '\b(\w+)\s+\1\b' two-cities-dup1.txt
1      it was the the age of wisdom,
1 
1      $ grep -n -E '\b(\w+)\s+\1\b' two-cities-dup1.txt
1      3:it was the the age of wisdom,
1 
1      $ sed -En '/\b(\w+)\s+\1\b/p' two-cities-dup1.txt
1      it was the the age of wisdom,
1 
1      $ sed -En '/\b(\w+)\s+\1\b/{=;p}' two-cities-dup1.txt
1      3
1      it was the the age of wisdom,
1 
1    * The regular expression '\b\w+\s+' searches for word-boundary
1      ('\b'), followed by one-or-more word-characters ('\w+'), followed
1      by whitespace ('\s+').  ⇒regexp extensions.
1 
1    * Adding parentheses around the '(\w+)' expression creates a
1      subexpression.  The regular expression pattern '(PATTERN)\s+\1'
1      defines a subexpression (in the parentheses) followed by a
1      back-reference, separated by whitespace.  A successful match means
11      the PATTERN was repeated twice in succession.  ⇒
      Back-references and Subexpressions.
1 
1    * The word-boundery expression ('\b') at both ends ensures partial
1      words are not matched (e.g.  'the then' is not a desired match).
1 
1    * The '-E' option enables extended regular expression syntax,
1      alleviating the need to add backslashes before the parenthesis.
1      ⇒ERE syntax.
1 
1    When the doubled word span two lines the above regular expression
1 will not find them as 'grep' and 'sed' operate line-by-line.
1 
1    By using 'N' and 'D' commands, 'sed' can apply regular expressions on
1 multiple lines (that is, multiple lines are stored in the pattern space,
1 and the regular expression works on it):
1 
1      $ cat two-cities-dup2.txt
1      It was the best of times, it was the
1      worst of times, it was the
1      the age of wisdom,
1      it was the age of foolishness,
1 
1      $ sed -En '{N; /\b(\w+)\s+\1\b/{=;p} ; D}'  two-cities-dup2.txt
1      3
1      worst of times, it was the
1      the age of wisdom,
1 
1    * The 'N' command appends the next line to the pattern space (thus
1      ensuring it contains two consecutive lines in every cycle).
1 
1    * The regular expression uses '\s+' for word separator which matches
1      both spaces and newlines.
1 
1    * The regular expression matches, the entire pattern space is printed
1      with 'p'.  No lines are printed by default due to the '-n' option.
1 
1    * The 'D' removes the first line from the pattern space (up until the
1      first newline), readying it for the next cycle.
1 
1    See the GNU 'coreutils' manual for an alternative solution using 'tr
1 -s' and 'uniq' at
1 <https://gnu.org/s/coreutils/manual/html_node/Squeezing-and-deleting.html>.
1