sed: Text search across multiple lines
1
1 7.7 Text search across multiple lines
1 =====================================
1
1 This section uses 'N' and 'D' commands to search for consecutive words
1 spanning multiple lines. ⇒Multiline techniques.
1
1 These examples deal with finding doubled occurrences of words in a
1 document.
1
1 Finding doubled words in a single line is easy using GNU 'grep' and
1 similarly with GNU 'sed':
1
1 $ cat two-cities-dup1.txt
1 It was the best of times,
1 it was the worst of times,
1 it was the the age of wisdom,
1 it was the age of foolishness,
1
1 $ grep -E '\b(\w+)\s+\1\b' two-cities-dup1.txt
1 it was the the age of wisdom,
1
1 $ grep -n -E '\b(\w+)\s+\1\b' two-cities-dup1.txt
1 3:it was the the age of wisdom,
1
1 $ sed -En '/\b(\w+)\s+\1\b/p' two-cities-dup1.txt
1 it was the the age of wisdom,
1
1 $ sed -En '/\b(\w+)\s+\1\b/{=;p}' two-cities-dup1.txt
1 3
1 it was the the age of wisdom,
1
1 * The regular expression '\b\w+\s+' searches for word-boundary
1 ('\b'), followed by one-or-more word-characters ('\w+'), followed
1 by whitespace ('\s+'). ⇒regexp extensions.
1
1 * Adding parentheses around the '(\w+)' expression creates a
1 subexpression. The regular expression pattern '(PATTERN)\s+\1'
1 defines a subexpression (in the parentheses) followed by a
1 back-reference, separated by whitespace. A successful match means
11 the PATTERN was repeated twice in succession. ⇒
Back-references and Subexpressions.
1
1 * The word-boundery expression ('\b') at both ends ensures partial
1 words are not matched (e.g. 'the then' is not a desired match).
1
1 * The '-E' option enables extended regular expression syntax,
1 alleviating the need to add backslashes before the parenthesis.
1 ⇒ERE syntax.
1
1 When the doubled word span two lines the above regular expression
1 will not find them as 'grep' and 'sed' operate line-by-line.
1
1 By using 'N' and 'D' commands, 'sed' can apply regular expressions on
1 multiple lines (that is, multiple lines are stored in the pattern space,
1 and the regular expression works on it):
1
1 $ cat two-cities-dup2.txt
1 It was the best of times, it was the
1 worst of times, it was the
1 the age of wisdom,
1 it was the age of foolishness,
1
1 $ sed -En '{N; /\b(\w+)\s+\1\b/{=;p} ; D}' two-cities-dup2.txt
1 3
1 worst of times, it was the
1 the age of wisdom,
1
1 * The 'N' command appends the next line to the pattern space (thus
1 ensuring it contains two consecutive lines in every cycle).
1
1 * The regular expression uses '\s+' for word separator which matches
1 both spaces and newlines.
1
1 * The regular expression matches, the entire pattern space is printed
1 with 'p'. No lines are printed by default due to the '-n' option.
1
1 * The 'D' removes the first line from the pattern space (up until the
1 first newline), readying it for the next cycle.
1
1 See the GNU 'coreutils' manual for an alternative solution using 'tr
1 -s' and 'uniq' at
1 <https://gnu.org/s/coreutils/manual/html_node/Squeezing-and-deleting.html>.
1