coreutils: Character sets

1 
1 9.1.1 Specifying sets of characters
1 -----------------------------------
1 
1 The format of the SET1 and SET2 arguments resembles the format of
1 regular expressions; however, they are not regular expressions, only
1 lists of characters.  Most characters simply represent themselves in
1 these strings, but the strings can contain the shorthands listed below,
1 for convenience.  Some of them can be used only in SET1 or SET2, as
1 noted below.
1 
1 Backslash escapes
1 
1      The following backslash escape sequences are recognized:
1 
1      ‘\a’
1           Control-G.
1      ‘\b’
1           Control-H.
1      ‘\f’
1           Control-L.
1      ‘\n’
1           Control-J.
1      ‘\r’
1           Control-M.
1      ‘\t’
1           Control-I.
1      ‘\v’
1           Control-K.
1      ‘\OOO’
1           The 8-bit character with the value given by OOO, which is 1 to
1           3 octal digits.  Note that ‘\400’ is interpreted as the
1           two-byte sequence, ‘\040’ ‘0’.
1      ‘\\’
1           A backslash.
1 
1      While a backslash followed by a character not listed above is
1      interpreted as that character, the backslash also effectively
1      removes any special significance, so it is useful to escape ‘[’,
1      ‘]’, ‘*’, and ‘-’.
1 
1 Ranges
1 
1      The notation ‘M-N’ expands to all of the characters from M through
1      N, in ascending order.  M should collate before N; if it doesn’t,
1      an error results.  As an example, ‘0-9’ is the same as
1      ‘0123456789’.
1 
1      GNU ‘tr’ does not support the System V syntax that uses square
1      brackets to enclose ranges.  Translations specified in that format
1      sometimes work as expected, since the brackets are often
1      transliterated to themselves.  However, they should be avoided
1      because they sometimes behave unexpectedly.  For example, ‘tr -d
1      '[0-9]'’ deletes brackets as well as digits.
1 
1      Many historically common and even accepted uses of ranges are not
1      portable.  For example, on EBCDIC hosts using the ‘A-Z’ range will
1      not do what most would expect because ‘A’ through ‘Z’ are not
1      contiguous as they are in ASCII.  If you can rely on a POSIX
1      compliant version of ‘tr’, then the best way to work around this is
1      to use character classes (see below).  Otherwise, it is most
1      portable (and most ugly) to enumerate the members of the ranges.
1 
1 Repeated characters
1 
1      The notation ‘[C*N]’ in SET2 expands to N copies of character C.
1      Thus, ‘[y*6]’ is the same as ‘yyyyyy’.  The notation ‘[C*]’ in
1      STRING2 expands to as many copies of C as are needed to make SET2
1      as long as SET1.  If N begins with ‘0’, it is interpreted in octal,
1      otherwise in decimal.
1 
1 Character classes
1 
1      The notation ‘[:CLASS:]’ expands to all of the characters in the
1      (predefined) class CLASS.  The characters expand in no particular
1      order, except for the ‘upper’ and ‘lower’ classes, which expand in
1      ascending order.  When the ‘--delete’ (‘-d’) and
1      ‘--squeeze-repeats’ (‘-s’) options are both given, any character
1      class can be used in SET2.  Otherwise, only the character classes
1      ‘lower’ and ‘upper’ are accepted in SET2, and then only if the
1      corresponding character class (‘upper’ and ‘lower’, respectively)
1      is specified in the same relative position in SET1.  Doing this
1      specifies case conversion.  The class names are given below; an
1      error results when an invalid class name is given.
1 
1      ‘alnum’
1           Letters and digits.
1      ‘alpha’
1           Letters.
1      ‘blank’
1           Horizontal whitespace.
1      ‘cntrl’
1           Control characters.
1      ‘digit’
1           Digits.
1      ‘graph’
1           Printable characters, not including space.
1      ‘lower’
1           Lowercase letters.
1      ‘print’
1           Printable characters, including space.
1      ‘punct’
1           Punctuation characters.
1      ‘space’
1           Horizontal or vertical whitespace.
1      ‘upper’
1           Uppercase letters.
1      ‘xdigit’
1           Hexadecimal digits.
1 
1 Equivalence classes
1 
1      The syntax ‘[=C=]’ expands to all of the characters that are
1      equivalent to C, in no particular order.  Equivalence classes are a
1      relatively recent invention intended to support non-English
1      alphabets.  But there seems to be no standard way to define them or
1      determine their contents.  Therefore, they are not fully
1      implemented in GNU ‘tr’; each character’s equivalence class
1      consists only of that character, which is of no particular use.
1