sed: Locale Considerations
1
1 5.9 Multibyte characters and Locale Considerations
1 ==================================================
1
1 GNU 'sed' processes valid multibyte characters in multibyte locales
1 (e.g. 'UTF-8'). (1)
1
1 The following example uses the Greek letter Capital Sigma (U+03A3,
1 Unicode code point '0x03A3'). In a 'UTF-8' locale, 'sed' correctly
1 processes the Sigma as one character despite it being 2 octets (bytes):
1
1 $ locale | grep LANG
1 LANG=en_US.UTF-8
1
1 $ printf 'a\u03A3b'
1 aU+03A3b
1
1 $ printf 'a\u03A3b' | sed 's/./X/g'
1 XXX
1
1 $ printf 'a\u03A3b' | od -tx1 -An
1 61 ce a3 62
1
1 To force 'sed' to process octets separately, use the 'C' locale (also
1 known as the 'POSIX' locale):
1
1 $ printf 'a\u03A3b' | LC_ALL=C sed 's/./X/g'
1 XXXX
1
1 5.9.1 Invalid multibyte characters
1 ----------------------------------
1
1 'sed''s regular expressions _do not_ match invalid multibyte sequences
1 in a multibyte locale.
1
1 In the following examples, the ascii value '0xCE' is an incomplete
1 multibyte character (shown here as U+FFFD). The regular expression '.'
1 does not match it:
1
1 $ printf 'a\xCEb\n'
1 aU+FFFDe
1
1 $ printf 'a\xCEb\n' | sed 's/./X/g'
1 XU+FFFDX
1
1 $ printf 'a\xCEc\n' | sed 's/./X/g' | od -tx1c -An
1 58 ce 58 0a
1 X X \n
1
1 Similarly, the 'catch-all' regular expression '.*' does not match the
1 entire line:
1
1 $ printf 'a\xCEc\n' | sed 's/.*//' | od -tx1c -An
1 ce 63 0a
1 c \n
1
1 GNU 'sed' offers the special 'z' command to clear the current pattern
1 space regardless of invalid multibyte characters (i.e. it works like
1 's/.*//' but also removes invalid multibyte characters):
1
1 $ printf 'a\xCEc\n' | sed 'z' | od -tx1c -An
1 0a
1 \n
1
1 Alternatively, force the 'C' locale to process each octet separately
1 (every octet is a valid character in the 'C' locale):
1
1 $ printf 'a\xCEc\n' | LC_ALL=C sed 's/.*//' | od -tx1c -An
1 0a
1 \n
1
1 'sed''s inability to process invalid multibyte characters can be used
1 to detect such invalid sequences in a file. In the following examples,
1 the '\xCE\xCE' is an invalid multibyte sequence, while '\xCE\A3' is a
1 valid multibyte sequence (of the Greek Sigma character).
1
1 The following 'sed' program removes all valid characters using 's/.//g'.
1 Any content left in the pattern space (the invalid characters) are added
1 to the hold space using the 'H' command. On the last line ('$'), the
1 hold space is retrieved ('x'), newlines are removed ('s/\n//g'), and any
1 remaining octets are printed unambiguously ('l'). Thus, any invalid
1 multibyte sequences are printed as octal values:
1
1 $ printf 'ab\nc\n\xCE\xCEde\n\xCE\xA3f\n' > invalid.txt
1
1 $ cat invalid.txt
1 ab
1 c
1 U+FFFDU+FFFDde
1 U+03A3f
1
1 $ sed -n 's/.//g ; H ; ${x;s/\n//g;l}' invalid.txt
1 \316\316$
1
1 With a few more commands, 'sed' can print the exact line number
1 corresponding to each invalid characters (line 3). These characters can
1 then be removed by forcing the 'C' locale and using octal escape
1 sequences:
1
1 $ sed -n 's/.//g;=;l' invalid.txt | paste - - | awk '$2!="$"'
1 3 \316\316$
1
1 $ LC_ALL=C sed '3s/\o316\o316//' invalid.txt > fixed.txt
1
1 5.9.2 Upper/Lower case conversion
1 ---------------------------------
1
1 GNU 'sed''s substitute command ('s') supports upper/lower case
1 conversions using '\U','\L' codes. These conversions support multibyte
1 characters:
1
1 $ printf 'ABC\u03a3\n'
1 ABCU+03A3
1
1 $ printf 'ABC\u03a3\n' | sed 's/.*/\L&/'
1 abcU+03C3
1
1 ⇒The "s" Command.
1
1 5.9.3 Multibyte regexp character classes
1 ----------------------------------------
1
1 In other locales, the sorting sequence is not specified, and '[a-d]'
1 might be equivalent to '[abcd]' or to '[aBbCcDd]', or it might fail to
1 match any character, or the set of characters that it matches might even
1 be erratic. To obtain the traditional interpretation of bracket
1 expressions, you can use the 'C' locale by setting the 'LC_ALL'
1 environment variable to the value 'C'.
1
1 # TODO: is there any real-world system/locale where 'A'
1 # is replaced by '-' ?
1 $ echo A | sed 's/[a-z]/-/'
1 A
1
1 Their interpretation depends on the 'LC_CTYPE' locale; for example,
1 '[[:alnum:]]' means the character class of numbers and letters in the
1 current locale.
1
1 TODO: show example of collation
1
1 # TODO: this works on glibc systems, not on musl-libc/freebsd/macosx.
1 $ printf 'cliché\n' | LC_ALL=fr_FR.utf8 sed 's/[[=e=]]/X/g'
1 clichX
1
1 ---------- Footnotes ----------
1
1 (1) Some regexp edge-cases depends on the operating system and libc
1 implementation. The examples shown are known to work as-expected on
1 GNU/Linux systems using glibc.
1