gawk: Bracket Expressions

1 
1 3.4 Using Bracket Expressions
1 =============================
1 
1 As mentioned earlier, a bracket expression matches any character among
1 those listed between the opening and closing square brackets.
1 
1    Within a bracket expression, a "range expression" consists of two
1 characters separated by a hyphen.  It matches any single character that
1 sorts between the two characters, based upon the system's native
1 character set.  For example, '[0-9]' is equivalent to '[0123456789]'.
1 (See ⇒Ranges and Locales for an explanation of how the POSIX
1 standard and 'gawk' have changed over time.  This is mainly of
1 historical interest.)
1 
1    With the increasing popularity of the Unicode character standard
1 (http://www.unicode.org), there is an additional wrinkle to consider.
1 Octal and hexadecimal escape sequences inside bracket expressions are
1 taken to represent only single-byte characters (characters whose values
1 fit within the range 0-256).  To match a range of characters where the
1 endpoints of the range are larger than 256, enter the multibyte
1 encodings of the characters directly.
1 
1    To include one of the characters '\', ']', '-', or '^' in a bracket
1 expression, put a '\' in front of it.  For example:
1 
1      [d\]]
1 
1 matches either 'd' or ']'.  Additionally, if you place ']' right after
1 the opening '[', the closing bracket is treated as one of the characters
1 to be matched.
1 
1    The treatment of '\' in bracket expressions is compatible with other
1 'awk' implementations and is also mandated by POSIX. The regular
1 expressions in 'awk' are a superset of the POSIX specification for
1 Extended Regular Expressions (EREs).  POSIX EREs are based on the
1 regular expressions accepted by the traditional 'egrep' utility.
1 
1    "Character classes" are a feature introduced in the POSIX standard.
1 A character class is a special notation for describing lists of
1 characters that have a specific attribute, but the actual characters can
1 vary from country to country and/or from character set to character set.
1 For example, the notion of what is an alphabetic character differs
1 between the United States and France.
1 
1    A character class is only valid in a regexp _inside_ the brackets of
1 a bracket expression.  Character classes consist of '[:', a keyword
1 denoting the class, and ':]'.  ⇒Table 3.1 table-char-classes.
1 lists the character classes defined by the POSIX standard.
1 
1 Class       Meaning
1 --------------------------------------------------------------------------
1 '[:alnum:]' Alphanumeric characters
1 '[:alpha:]' Alphabetic characters
1 '[:blank:]' Space and TAB characters
1 '[:cntrl:]' Control characters
1 '[:digit:]' Numeric characters
1 '[:graph:]' Characters that are both printable and visible (a space is
1             printable but not visible, whereas an 'a' is both)
1 '[:lower:]' Lowercase alphabetic characters
1 '[:print:]' Printable characters (characters that are not control
1             characters)
1 '[:punct:]' Punctuation characters (characters that are not letters,
1             digits, control characters, or space characters)
1 '[:space:]' Space characters (such as space, TAB, and formfeed, to name
1             a few)
1 '[:upper:]' Uppercase alphabetic characters
1 '[:xdigit:]'Characters that are hexadecimal digits
1 
1 Table 3.1: POSIX character classes
1 
1    For example, before the POSIX standard, you had to write
1 '/[A-Za-z0-9]/' to match alphanumeric characters.  If your character set
1 had other alphabetic characters in it, this would not match them.  With
1 the POSIX character classes, you can write '/[[:alnum:]]/' to match the
1 alphabetic and numeric characters in your character set.
1 
1    Some utilities that match regular expressions provide a nonstandard
1 '[:ascii:]' character class; 'awk' does not.  However, you can simulate
1 such a construct using '[\x00-\x7F]'.  This matches all values
1 numerically between zero and 127, which is the defined range of the
1 ASCII character set.  Use a complemented character list ('[^\x00-\x7F]')
1 to match any single-byte characters that are not in the ASCII range.
1 
1    Two additional special sequences can appear in bracket expressions.
1 These apply to non-ASCII character sets, which can have single symbols
1 (called "collating elements") that are represented with more than one
1 character.  They can also have several characters that are equivalent
1 for "collating", or sorting, purposes.  (For example, in French, a plain
1 "e" and a grave-accented "e`" are equivalent.)  These sequences are:
1 
1 Collating symbols
1      Multicharacter collating elements enclosed between '[.' and '.]'.
1      For example, if 'ch' is a collating element, then '[[.ch.]]' is a
1      regexp that matches this collating element, whereas '[ch]' is a
1      regexp that matches either 'c' or 'h'.
1 
1 Equivalence classes
1      Locale-specific names for a list of characters that are equal.  The
1      name is enclosed between '[=' and '=]'.  For example, the name 'e'
1      might be used to represent all of "e," "e^," "e`," and "e'."  In
1      this case, '[[=e=]]' is a regexp that matches any of 'e', 'e^',
1      'e'', or 'e`'.
1 
1    These features are very valuable in non-English-speaking locales.
1 
1      CAUTION: The library functions that 'gawk' uses for regular
1      expression matching currently recognize only POSIX character
1      classes; they do not recognize collating symbols or equivalence
1      classes.
1 
1    Inside a bracket expression, an opening bracket ('[') that does not
1 start a character class, collating element or equivalence class is taken
1 literally.  This is also true of '.' and '*'.
1