cpp: Tokenization

1 
1 1.3 Tokenization
1 ================
1 
1 After the textual transformations are finished, the input file is
1 converted into a sequence of "preprocessing tokens".  These mostly
1 correspond to the syntactic tokens used by the C compiler, but there are
1 a few differences.  White space separates tokens; it is not itself a
1 token of any kind.  Tokens do not have to be separated by white space,
1 but it is often necessary to avoid ambiguities.
1 
1    When faced with a sequence of characters that has more than one
1 possible tokenization, the preprocessor is greedy.  It always makes each
1 token, starting from the left, as big as possible before moving on to
1 the next token.  For instance, 'a+++++b' is interpreted as
1 'a ++ ++ + b', not as 'a ++ + ++ b', even though the latter tokenization
1 could be part of a valid C program and the former could not.
1 
1    Once the input file is broken into tokens, the token boundaries never
1 change, except when the '##' preprocessing operator is used to paste
1 tokens together.  ⇒Concatenation.  For example,
1 
1      #define foo() bar
1      foo()baz
1           ==> bar baz
1      _not_
1           ==> barbaz
1 
1    The compiler does not re-tokenize the preprocessor's output.  Each
1 preprocessing token becomes one compiler token.
1 
1    Preprocessing tokens fall into five broad classes: identifiers,
1 preprocessing numbers, string literals, punctuators, and other.  An
1 "identifier" is the same as an identifier in C: any sequence of letters,
1 digits, or underscores, which begins with a letter or underscore.
1 Keywords of C have no significance to the preprocessor; they are
1 ordinary identifiers.  You can define a macro whose name is a keyword,
1 for instance.  The only identifier which can be considered a
1 preprocessing keyword is 'defined'.  ⇒Defined.
1 
1    This is mostly true of other languages which use the C preprocessor.
1 However, a few of the keywords of C++ are significant even in the
1 preprocessor.  ⇒C++ Named Operators.
1 
1    In the 1999 C standard, identifiers may contain letters which are not
1 part of the "basic source character set", at the implementation's
1 discretion (such as accented Latin letters, Greek letters, or Chinese
1 ideograms).  This may be done with an extended character set, or the
1 '\u' and '\U' escape sequences.  GCC only accepts such characters in the
1 '\u' and '\U' forms.
1 
1    As an extension, GCC treats '$' as a letter.  This is for
1 compatibility with some systems, such as VMS, where '$' is commonly used
1 in system-defined function and object names.  '$' is not a letter in
11 strictly conforming mode, or if you specify the '-$' option.  ⇒
 Invocation.
1 
1    A "preprocessing number" has a rather bizarre definition.  The
1 category includes all the normal integer and floating point constants
1 one expects of C, but also a number of other things one might not
1 initially recognize as a number.  Formally, preprocessing numbers begin
1 with an optional period, a required decimal digit, and then continue
1 with any sequence of letters, digits, underscores, periods, and
1 exponents.  Exponents are the two-character sequences 'e+', 'e-', 'E+',
1 'E-', 'p+', 'p-', 'P+', and 'P-'.  (The exponents that begin with 'p' or
1 'P' are used for hexadecimal floating-point constants.)
1 
1    The purpose of this unusual definition is to isolate the preprocessor
1 from the full complexity of numeric constants.  It does not have to
1 distinguish between lexically valid and invalid floating-point numbers,
1 which is complicated.  The definition also permits you to split an
1 identifier at any position and get exactly two tokens, which can then be
1 pasted back together with the '##' operator.
1 
1    It's possible for preprocessing numbers to cause programs to be
1 misinterpreted.  For example, '0xE+12' is a preprocessing number which
1 does not translate to any valid numeric constant, therefore a syntax
1 error.  It does not mean '0xE + 12', which is what you might have
1 intended.
1 
1    "String literals" are string constants, character constants, and
1 header file names (the argument of '#include').(1)  String constants and
1 character constants are straightforward: "..." or '...'.  In either case
1 embedded quotes should be escaped with a backslash: '\'' is the
1 character constant for '''.  There is no limit on the length of a
1 character constant, but the value of a character constant that contains
11 more than one character is implementation-defined.  ⇒Implementation
 Details.
1 
1    Header file names either look like string constants, "...", or are
1 written with angle brackets instead, <...>.  In either case, backslash
1 is an ordinary character.  There is no way to escape the closing quote
1 or angle bracket.  The preprocessor looks for the header file in
11 different places depending on which form you use.  ⇒Include
 Operation.
1 
1    No string literal may extend past the end of a line.  You may use
1 continued lines instead, or string constant concatenation.
1 
1    "Punctuators" are all the usual bits of punctuation which are
1 meaningful to C and C++.  All but three of the punctuation characters in
1 ASCII are C punctuators.  The exceptions are '@', '$', and '`'.  In
1 addition, all the two- and three-character operators are punctuators.
1 There are also six "digraphs", which the C++ standard calls "alternative
1 tokens", which are merely alternate ways to spell other punctuators.
1 This is a second attempt to work around missing punctuation in obsolete
1 systems.  It has no negative side effects, unlike trigraphs, but does
1 not cover as much ground.  The digraphs and their corresponding normal
1 punctuators are:
1 
1      Digraph:        <%  %>  <:  :>  %:  %:%:
1      Punctuator:      {   }   [   ]   #    ##
1 
1    Any other single character is considered "other".  It is passed on to
1 the preprocessor's output unmolested.  The C compiler will almost
1 certainly reject source code containing "other" tokens.  In ASCII, the
1 only other characters are '@', '$', '`', and control characters other
1 than NUL (all bits zero).  (Note that '$' is normally considered a
1 letter.)  All characters with the high bit set (numeric range 0x7F-0xFF)
1 are also "other" in the present implementation.  This will change when
1 proper support for international character sets is added to GCC.
1 
1    NUL is a special case because of the high probability that its
1 appearance is accidental, and because it may be invisible to the user
1 (many terminals do not display NUL at all).  Within comments, NULs are
1 silently ignored, just as any other character would be.  In running
1 text, NUL is considered white space.  For example, these two directives
1 have the same meaning.
1 
1      #define X^@1
1      #define X 1
1 
1 (where '^@' is ASCII NUL).  Within string or character constants, NULs
1 are preserved.  In the latter two cases the preprocessor emits a warning
1 message.
1 
1    ---------- Footnotes ----------
1 
1    (1) The C standard uses the term "string literal" to refer only to
1 what we are calling "string constants".
1