gawk: Bracket Expressions
1
1 3.4 Using Bracket Expressions
1 =============================
1
1 As mentioned earlier, a bracket expression matches any character among
1 those listed between the opening and closing square brackets.
1
1 Within a bracket expression, a "range expression" consists of two
1 characters separated by a hyphen. It matches any single character that
1 sorts between the two characters, based upon the system's native
1 character set. For example, '[0-9]' is equivalent to '[0123456789]'.
1 (See ⇒Ranges and Locales for an explanation of how the POSIX
1 standard and 'gawk' have changed over time. This is mainly of
1 historical interest.)
1
1 With the increasing popularity of the Unicode character standard
1 (http://www.unicode.org), there is an additional wrinkle to consider.
1 Octal and hexadecimal escape sequences inside bracket expressions are
1 taken to represent only single-byte characters (characters whose values
1 fit within the range 0-256). To match a range of characters where the
1 endpoints of the range are larger than 256, enter the multibyte
1 encodings of the characters directly.
1
1 To include one of the characters '\', ']', '-', or '^' in a bracket
1 expression, put a '\' in front of it. For example:
1
1 [d\]]
1
1 matches either 'd' or ']'. Additionally, if you place ']' right after
1 the opening '[', the closing bracket is treated as one of the characters
1 to be matched.
1
1 The treatment of '\' in bracket expressions is compatible with other
1 'awk' implementations and is also mandated by POSIX. The regular
1 expressions in 'awk' are a superset of the POSIX specification for
1 Extended Regular Expressions (EREs). POSIX EREs are based on the
1 regular expressions accepted by the traditional 'egrep' utility.
1
1 "Character classes" are a feature introduced in the POSIX standard.
1 A character class is a special notation for describing lists of
1 characters that have a specific attribute, but the actual characters can
1 vary from country to country and/or from character set to character set.
1 For example, the notion of what is an alphabetic character differs
1 between the United States and France.
1
1 A character class is only valid in a regexp _inside_ the brackets of
1 a bracket expression. Character classes consist of '[:', a keyword
1 denoting the class, and ':]'. ⇒Table 3.1 table-char-classes.
1 lists the character classes defined by the POSIX standard.
1
1 Class Meaning
1 --------------------------------------------------------------------------
1 '[:alnum:]' Alphanumeric characters
1 '[:alpha:]' Alphabetic characters
1 '[:blank:]' Space and TAB characters
1 '[:cntrl:]' Control characters
1 '[:digit:]' Numeric characters
1 '[:graph:]' Characters that are both printable and visible (a space is
1 printable but not visible, whereas an 'a' is both)
1 '[:lower:]' Lowercase alphabetic characters
1 '[:print:]' Printable characters (characters that are not control
1 characters)
1 '[:punct:]' Punctuation characters (characters that are not letters,
1 digits, control characters, or space characters)
1 '[:space:]' Space characters (such as space, TAB, and formfeed, to name
1 a few)
1 '[:upper:]' Uppercase alphabetic characters
1 '[:xdigit:]'Characters that are hexadecimal digits
1
1 Table 3.1: POSIX character classes
1
1 For example, before the POSIX standard, you had to write
1 '/[A-Za-z0-9]/' to match alphanumeric characters. If your character set
1 had other alphabetic characters in it, this would not match them. With
1 the POSIX character classes, you can write '/[[:alnum:]]/' to match the
1 alphabetic and numeric characters in your character set.
1
1 Some utilities that match regular expressions provide a nonstandard
1 '[:ascii:]' character class; 'awk' does not. However, you can simulate
1 such a construct using '[\x00-\x7F]'. This matches all values
1 numerically between zero and 127, which is the defined range of the
1 ASCII character set. Use a complemented character list ('[^\x00-\x7F]')
1 to match any single-byte characters that are not in the ASCII range.
1
1 Two additional special sequences can appear in bracket expressions.
1 These apply to non-ASCII character sets, which can have single symbols
1 (called "collating elements") that are represented with more than one
1 character. They can also have several characters that are equivalent
1 for "collating", or sorting, purposes. (For example, in French, a plain
1 "e" and a grave-accented "e`" are equivalent.) These sequences are:
1
1 Collating symbols
1 Multicharacter collating elements enclosed between '[.' and '.]'.
1 For example, if 'ch' is a collating element, then '[[.ch.]]' is a
1 regexp that matches this collating element, whereas '[ch]' is a
1 regexp that matches either 'c' or 'h'.
1
1 Equivalence classes
1 Locale-specific names for a list of characters that are equal. The
1 name is enclosed between '[=' and '=]'. For example, the name 'e'
1 might be used to represent all of "e," "e^," "e`," and "e'." In
1 this case, '[[=e=]]' is a regexp that matches any of 'e', 'e^',
1 'e'', or 'e`'.
1
1 These features are very valuable in non-English-speaking locales.
1
1 CAUTION: The library functions that 'gawk' uses for regular
1 expression matching currently recognize only POSIX character
1 classes; they do not recognize collating symbols or equivalence
1 classes.
1
1 Inside a bracket expression, an opening bracket ('[') that does not
1 start a character class, collating element or equivalence class is taken
1 literally. This is also true of '.' and '*'.
1