gawk: Ranges and Locales

1 
1 A.8 Regexp Ranges and Locales: A Long Sad Story
1 ===============================================
1 
1 This minor node describes the confusing history of ranges within regular
1 expressions and their interactions with locales, and how this affected
1 different versions of 'gawk'.
1 
1    The original Unix tools that worked with regular expressions defined
1 character ranges (such as '[a-z]') to match any character between the
1 first character in the range and the last character in the range,
1 inclusive.  Ordering was based on the numeric value of each character in
1 the machine's native character set.  Thus, on ASCII-based systems,
1 '[a-z]' matched all the lowercase letters, and only the lowercase
1 letters, as the numeric values for the letters from 'a' through 'z' were
1 contiguous.  (On an EBCDIC system, the range '[a-z]' includes additional
1 nonalphabetic characters as well.)
1 
1    Almost all introductory Unix literature explained range expressions
1 as working in this fashion, and in particular, would teach that the
1 "correct" way to match lowercase letters was with '[a-z]', and that
1 '[A-Z]' was the "correct" way to match uppercase letters.  And indeed,
1 this was true.(1)
1 
11    The 1992 POSIX standard introduced the idea of locales (⇒
 Locales).  Because many locales include other letters besides the
1 plain 26 letters of the English alphabet, the POSIX standard added
1 character classes (⇒Bracket Expressions) as a way to match
1 different kinds of characters besides the traditional ones in the ASCII
1 character set.
1 
1    However, the standard _changed_ the interpretation of range
1 expressions.  In the '"C"' and '"POSIX"' locales, a range expression
1 like '[a-dx-z]' is still equivalent to '[abcdxyz]', as in ASCII. But
1 outside those locales, the ordering was defined to be based on
1 "collation order".
1 
1    What does that mean?  In many locales, 'A' and 'a' are both less than
1 'B'.  In other words, these locales sort characters in dictionary order,
1 and '[a-dx-z]' is typically not equivalent to '[abcdxyz]'; instead, it
1 might be equivalent to '[ABCXYabcdxyz]', for example.
1 
1    This point needs to be emphasized: much literature teaches that you
1 should use '[a-z]' to match a lowercase character.  But on systems with
1 non-ASCII locales, this also matches all of the uppercase characters
1 except 'A' or 'Z'!  This was a continuous cause of confusion, even well
1 into the twenty-first century.
1 
1    To demonstrate these issues, the following example uses the 'sub()'
1 function, which does text replacement (⇒String Functions).  Here,
1 the intent is to remove trailing uppercase characters:
1 
1      $ echo something1234abc | gawk-3.1.8 '{ sub("[A-Z]*$", ""); print }'
1      -| something1234a
1 
1 This output is unexpected, as the 'bc' at the end of 'something1234abc'
1 should not normally match '[A-Z]*'.  This result is due to the locale
1 setting (and thus you may not see it on your system).
1 
1    Similar considerations apply to other ranges.  For example, '["-/]'
1 is perfectly valid in ASCII, but is not valid in many Unicode locales,
1 such as 'en_US.UTF-8'.
1 
1    Early versions of 'gawk' used regexp matching code that was not
1 locale-aware, so ranges had their traditional interpretation.
1 
1    When 'gawk' switched to using locale-aware regexp matchers, the
1 problems began; especially as both GNU/Linux and commercial Unix vendors
1 started implementing non-ASCII locales, _and making them the default_.
1 Perhaps the most frequently asked question became something like, "Why
1 does '[A-Z]' match lowercase letters?!?"
1 
1    This situation existed for close to 10 years, if not more, and the
1 'gawk' maintainer grew weary of trying to explain that 'gawk' was being
1 nicely standards-compliant, and that the issue was in the user's locale.
1 During the development of version 4.0, he modified 'gawk' to always
1 treat ranges in the original, pre-POSIX fashion, unless '--posix' was
1 used (⇒Options).(2)
1 
1    Fortunately, shortly before the final release of 'gawk' 4.0, the
1 maintainer learned that the 2008 standard had changed the definition of
1 ranges, such that outside the '"C"' and '"POSIX"' locales, the meaning
1 of range expressions was _undefined_.(3)
1 
1    By using this lovely technical term, the standard gives license to
1 implementers to implement ranges in whatever way they choose.  The
1 'gawk' maintainer chose to apply the pre-POSIX meaning both with the
1 default regexp matching and when '--traditional' or '--posix' are used.
1 In all cases 'gawk' remains POSIX-compliant.
1 
1    ---------- Footnotes ----------
1 
1    (1) And Life was good.
1 
1    (2) And thus was born the Campaign for Rational Range Interpretation
1 (or RRI). A number of GNU tools have already implemented this change, or
1 will soon.  Thanks to Karl Berry for coining the phrase "Rational Range
1 Interpretation."
1 
1    (3) See the standard
1 (http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_03_05)
1 and its rationale
1 (http://pubs.opengroup.org/onlinepubs/9699919799/xrat/V4_xbd_chap09.html#tag_21_09_03_05).
1