cpp: Character sets

1 
1 1.1 Character sets
1 ==================
1 
1 Source code character set processing in C and related languages is
1 rather complicated.  The C standard discusses two character sets, but
1 there are really at least four.
1 
1    The files input to CPP might be in any character set at all.  CPP's
1 very first action, before it even looks for line boundaries, is to
1 convert the file into the character set it uses for internal processing.
1 That set is what the C standard calls the "source" character set.  It
1 must be isomorphic with ISO 10646, also known as Unicode.  CPP uses the
1 UTF-8 encoding of Unicode.
1 
1    The character sets of the input files are specified using the
1 '-finput-charset=' option.
1 
1    All preprocessing work (the subject of the rest of this manual) is
1 carried out in the source character set.  If you request textual output
1 from the preprocessor with the '-E' option, it will be in UTF-8.
1 
1    After preprocessing is complete, string and character constants are
1 converted again, into the "execution" character set.  This character set
1 is under control of the user; the default is UTF-8, matching the source
1 character set.  Wide string and character constants have their own
1 character set, which is not called out specifically in the standard.
1 Again, it is under control of the user.  The default is UTF-16 or
1 UTF-32, whichever fits in the target's 'wchar_t' type, in the target
1 machine's byte order.(1)  Octal and hexadecimal escape sequences do not
1 undergo conversion; '\x12' has the value 0x12 regardless of the
1 currently selected execution character set.  All other escapes are
1 replaced by the character in the source character set that they
1 represent, then converted to the execution character set, just like
1 unescaped characters.
1 
1    In identifiers, characters outside the ASCII range can only be
1 specified with the '\u' and '\U' escapes, not used directly.  If strict
1 ISO C90 conformance is specified with an option such as '-std=c90', or
1 '-fno-extended-identifiers' is used, then those escapes are not
1 permitted in identifiers.
1 
1    ---------- Footnotes ----------
1 
1    (1) UTF-16 does not meet the requirements of the C standard for a
1 wide character set, but the choice of 16-bit 'wchar_t' is enshrined in
1 some system ABIs so we cannot fix this.
1