gawk: Ranges and Locales
1
1 A.8 Regexp Ranges and Locales: A Long Sad Story
1 ===============================================
1
1 This minor node describes the confusing history of ranges within regular
1 expressions and their interactions with locales, and how this affected
1 different versions of 'gawk'.
1
1 The original Unix tools that worked with regular expressions defined
1 character ranges (such as '[a-z]') to match any character between the
1 first character in the range and the last character in the range,
1 inclusive. Ordering was based on the numeric value of each character in
1 the machine's native character set. Thus, on ASCII-based systems,
1 '[a-z]' matched all the lowercase letters, and only the lowercase
1 letters, as the numeric values for the letters from 'a' through 'z' were
1 contiguous. (On an EBCDIC system, the range '[a-z]' includes additional
1 nonalphabetic characters as well.)
1
1 Almost all introductory Unix literature explained range expressions
1 as working in this fashion, and in particular, would teach that the
1 "correct" way to match lowercase letters was with '[a-z]', and that
1 '[A-Z]' was the "correct" way to match uppercase letters. And indeed,
1 this was true.(1)
1
11 The 1992 POSIX standard introduced the idea of locales (⇒
Locales). Because many locales include other letters besides the
1 plain 26 letters of the English alphabet, the POSIX standard added
1 character classes (⇒Bracket Expressions) as a way to match
1 different kinds of characters besides the traditional ones in the ASCII
1 character set.
1
1 However, the standard _changed_ the interpretation of range
1 expressions. In the '"C"' and '"POSIX"' locales, a range expression
1 like '[a-dx-z]' is still equivalent to '[abcdxyz]', as in ASCII. But
1 outside those locales, the ordering was defined to be based on
1 "collation order".
1
1 What does that mean? In many locales, 'A' and 'a' are both less than
1 'B'. In other words, these locales sort characters in dictionary order,
1 and '[a-dx-z]' is typically not equivalent to '[abcdxyz]'; instead, it
1 might be equivalent to '[ABCXYabcdxyz]', for example.
1
1 This point needs to be emphasized: much literature teaches that you
1 should use '[a-z]' to match a lowercase character. But on systems with
1 non-ASCII locales, this also matches all of the uppercase characters
1 except 'A' or 'Z'! This was a continuous cause of confusion, even well
1 into the twenty-first century.
1
1 To demonstrate these issues, the following example uses the 'sub()'
1 function, which does text replacement (⇒String Functions). Here,
1 the intent is to remove trailing uppercase characters:
1
1 $ echo something1234abc | gawk-3.1.8 '{ sub("[A-Z]*$", ""); print }'
1 -| something1234a
1
1 This output is unexpected, as the 'bc' at the end of 'something1234abc'
1 should not normally match '[A-Z]*'. This result is due to the locale
1 setting (and thus you may not see it on your system).
1
1 Similar considerations apply to other ranges. For example, '["-/]'
1 is perfectly valid in ASCII, but is not valid in many Unicode locales,
1 such as 'en_US.UTF-8'.
1
1 Early versions of 'gawk' used regexp matching code that was not
1 locale-aware, so ranges had their traditional interpretation.
1
1 When 'gawk' switched to using locale-aware regexp matchers, the
1 problems began; especially as both GNU/Linux and commercial Unix vendors
1 started implementing non-ASCII locales, _and making them the default_.
1 Perhaps the most frequently asked question became something like, "Why
1 does '[A-Z]' match lowercase letters?!?"
1
1 This situation existed for close to 10 years, if not more, and the
1 'gawk' maintainer grew weary of trying to explain that 'gawk' was being
1 nicely standards-compliant, and that the issue was in the user's locale.
1 During the development of version 4.0, he modified 'gawk' to always
1 treat ranges in the original, pre-POSIX fashion, unless '--posix' was
1 used (⇒Options).(2)
1
1 Fortunately, shortly before the final release of 'gawk' 4.0, the
1 maintainer learned that the 2008 standard had changed the definition of
1 ranges, such that outside the '"C"' and '"POSIX"' locales, the meaning
1 of range expressions was _undefined_.(3)
1
1 By using this lovely technical term, the standard gives license to
1 implementers to implement ranges in whatever way they choose. The
1 'gawk' maintainer chose to apply the pre-POSIX meaning both with the
1 default regexp matching and when '--traditional' or '--posix' are used.
1 In all cases 'gawk' remains POSIX-compliant.
1
1 ---------- Footnotes ----------
1
1 (1) And Life was good.
1
1 (2) And thus was born the Campaign for Rational Range Interpretation
1 (or RRI). A number of GNU tools have already implemented this change, or
1 will soon. Thanks to Karl Berry for coining the phrase "Rational Range
1 Interpretation."
1
1 (3) See the standard
1 (http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_03_05)
1 and its rationale
1 (http://pubs.opengroup.org/onlinepubs/9699919799/xrat/V4_xbd_chap09.html#tag_21_09_03_05).
1