gawk: GNU Regexp Operators
1
1 3.7 'gawk'-Specific Regexp Operators
1 ====================================
1
1 GNU software that deals with regular expressions provides a number of
1 additional regexp operators. These operators are described in this
1 minor node and are specific to 'gawk'; they are not available in other
1 'awk' implementations. Most of the additional operators deal with word
1 matching. For our purposes, a "word" is a sequence of one or more
1 letters, digits, or underscores ('_'):
1
1 '\s'
1 Matches any whitespace character. Think of it as shorthand for
1 '[[:space:]]'.
1
1 '\S'
1 Matches any character that is not whitespace. Think of it as
1 shorthand for '[^[:space:]]'.
1
1 '\w'
1 Matches any word-constituent character--that is, it matches any
1 letter, digit, or underscore. Think of it as shorthand for
1 '[[:alnum:]_]'.
1
1 '\W'
1 Matches any character that is not word-constituent. Think of it as
1 shorthand for '[^[:alnum:]_]'.
1
1 '\<'
1 Matches the empty string at the beginning of a word. For example,
1 '/\<away/' matches 'away' but not 'stowaway'.
1
1 '\>'
1 Matches the empty string at the end of a word. For example,
1 '/stow\>/' matches 'stow' but not 'stowaway'.
1
1 '\y'
1 Matches the empty string at either the beginning or the end of a
1 word (i.e., the word boundar*y*). For example, '\yballs?\y'
1 matches either 'ball' or 'balls', as a separate word.
1
1 '\B'
1 Matches the empty string that occurs between two word-constituent
1 characters. For example, '/\Brat\B/' matches 'crate', but it does
1 not match 'dirty rat'. '\B' is essentially the opposite of '\y'.
1
1 There are two other operators that work on buffers. In Emacs, a
1 "buffer" is, naturally, an Emacs buffer. Other GNU programs, including
1 'gawk', consider the entire string to match as the buffer. The
1 operators are:
1
1 '\`'
1 Matches the empty string at the beginning of a buffer (string)
1
1 '\''
1 Matches the empty string at the end of a buffer (string)
1
1 Because '^' and '$' always work in terms of the beginning and end of
1 strings, these operators don't add any new capabilities for 'awk'. They
1 are provided for compatibility with other GNU software.
1
1 In other GNU software, the word-boundary operator is '\b'. However,
1 that conflicts with the 'awk' language's definition of '\b' as
1 backspace, so 'gawk' uses a different letter. An alternative method
1 would have been to require two backslashes in the GNU operators, but
1 this was deemed too confusing. The current method of using '\y' for the
1 GNU '\b' appears to be the lesser of two evils.
1
1 The various command-line options (⇒Options) control how 'gawk'
1 interprets characters in regexps:
1
1 No options
1 In the default case, 'gawk' provides all the facilities of POSIX
11 regexps and the GNU regexp operators described in ⇒Regexp
Operators.
1
1 '--posix'
1 Match only POSIX regexps; the GNU operators are not special (e.g.,
1 '\w' matches a literal 'w'). Interval expressions are allowed.
1
1 '--traditional'
1 Match traditional Unix 'awk' regexps. The GNU operators are not
1 special, and interval expressions are not available. Because BWK
1 'awk' supports them, the POSIX character classes ('[[:alnum:]]',
1 etc.) are available. Characters described by octal and
1 hexadecimal escape sequences are treated literally, even if they
1 represent regexp metacharacters.
1
1 '--re-interval'
1 Allow interval expressions in regexps, if '--traditional' has been
1 provided. Otherwise, interval expressions are available by
1 default.
1