sed: Locale Considerations

1 
1 5.9 Multibyte characters and Locale Considerations
1 ==================================================
1 
1 GNU 'sed' processes valid multibyte characters in multibyte locales
1 (e.g.  'UTF-8').  (1)
1 
1 The following example uses the Greek letter Capital Sigma (U+03A3,
1 Unicode code point '0x03A3').  In a 'UTF-8' locale, 'sed' correctly
1 processes the Sigma as one character despite it being 2 octets (bytes):
1 
1      $ locale | grep LANG
1      LANG=en_US.UTF-8
1 
1      $ printf 'a\u03A3b'
1      aU+03A3b
1 
1      $ printf 'a\u03A3b' | sed 's/./X/g'
1      XXX
1 
1      $ printf 'a\u03A3b' | od -tx1 -An
1       61 ce a3 62
1 
1 To force 'sed' to process octets separately, use the 'C' locale (also
1 known as the 'POSIX' locale):
1 
1      $ printf 'a\u03A3b' | LC_ALL=C sed 's/./X/g'
1      XXXX
1 
1 5.9.1 Invalid multibyte characters
1 ----------------------------------
1 
1 'sed''s regular expressions _do not_ match invalid multibyte sequences
1 in a multibyte locale.
1 
1 In the following examples, the ascii value '0xCE' is an incomplete
1 multibyte character (shown here as U+FFFD). The regular expression '.'
1 does not match it:
1 
1      $ printf 'a\xCEb\n'
1      aU+FFFDe
1 
1      $ printf 'a\xCEb\n' | sed 's/./X/g'
1      XU+FFFDX
1 
1      $ printf 'a\xCEc\n' | sed 's/./X/g' | od -tx1c -An
1        58  ce  58  0a
1         X      X   \n
1 
1 Similarly, the 'catch-all' regular expression '.*' does not match the
1 entire line:
1 
1      $ printf 'a\xCEc\n' | sed 's/.*//' | od -tx1c -An
1        ce  63  0a
1             c  \n
1 
1 GNU 'sed' offers the special 'z' command to clear the current pattern
1 space regardless of invalid multibyte characters (i.e.  it works like
1 's/.*//' but also removes invalid multibyte characters):
1 
1      $ printf 'a\xCEc\n' | sed 'z' | od -tx1c -An
1         0a
1         \n
1 
1 Alternatively, force the 'C' locale to process each octet separately
1 (every octet is a valid character in the 'C' locale):
1 
1      $ printf 'a\xCEc\n' | LC_ALL=C sed 's/.*//' | od -tx1c -An
1        0a
1        \n
1 
1    'sed''s inability to process invalid multibyte characters can be used
1 to detect such invalid sequences in a file.  In the following examples,
1 the '\xCE\xCE' is an invalid multibyte sequence, while '\xCE\A3' is a
1 valid multibyte sequence (of the Greek Sigma character).
1 
1 The following 'sed' program removes all valid characters using 's/.//g'.
1 Any content left in the pattern space (the invalid characters) are added
1 to the hold space using the 'H' command.  On the last line ('$'), the
1 hold space is retrieved ('x'), newlines are removed ('s/\n//g'), and any
1 remaining octets are printed unambiguously ('l').  Thus, any invalid
1 multibyte sequences are printed as octal values:
1 
1      $ printf 'ab\nc\n\xCE\xCEde\n\xCE\xA3f\n' > invalid.txt
1 
1      $ cat invalid.txt
1      ab
1      c
1      U+FFFDU+FFFDde
1      U+03A3f
1 
1      $ sed -n 's/.//g ; H ; ${x;s/\n//g;l}' invalid.txt
1      \316\316$
1 
1 With a few more commands, 'sed' can print the exact line number
1 corresponding to each invalid characters (line 3).  These characters can
1 then be removed by forcing the 'C' locale and using octal escape
1 sequences:
1 
1      $ sed -n 's/.//g;=;l' invalid.txt | paste - -  | awk '$2!="$"'
1      3       \316\316$
1 
1      $ LC_ALL=C sed '3s/\o316\o316//' invalid.txt > fixed.txt
1 
1 5.9.2 Upper/Lower case conversion
1 ---------------------------------
1 
1 GNU 'sed''s substitute command ('s') supports upper/lower case
1 conversions using '\U','\L' codes.  These conversions support multibyte
1 characters:
1 
1      $ printf 'ABC\u03a3\n'
1      ABCU+03A3
1 
1      $ printf 'ABC\u03a3\n' | sed 's/.*/\L&/'
1      abcU+03C3
1 
1 ⇒The "s" Command.
1 
1 5.9.3 Multibyte regexp character classes
1 ----------------------------------------
1 
1 In other locales, the sorting sequence is not specified, and '[a-d]'
1 might be equivalent to '[abcd]' or to '[aBbCcDd]', or it might fail to
1 match any character, or the set of characters that it matches might even
1 be erratic.  To obtain the traditional interpretation of bracket
1 expressions, you can use the 'C' locale by setting the 'LC_ALL'
1 environment variable to the value 'C'.
1 
1      # TODO: is there any real-world system/locale where 'A'
1      #       is replaced by '-' ?
1      $ echo A | sed 's/[a-z]/-/'
1      A
1 
1    Their interpretation depends on the 'LC_CTYPE' locale; for example,
1 '[[:alnum:]]' means the character class of numbers and letters in the
1 current locale.
1 
1    TODO: show example of collation
1 
1      # TODO: this works on glibc systems, not on musl-libc/freebsd/macosx.
1      $ printf 'cliché\n' | LC_ALL=fr_FR.utf8 sed 's/[[=e=]]/X/g'
1      clichX
1 
1    ---------- Footnotes ----------
1 
1    (1) Some regexp edge-cases depends on the operating system and libc
1 implementation.  The examples shown are known to work as-expected on
1 GNU/Linux systems using glibc.
1