cpp: Tokenization
1
1 1.3 Tokenization
1 ================
1
1 After the textual transformations are finished, the input file is
1 converted into a sequence of "preprocessing tokens". These mostly
1 correspond to the syntactic tokens used by the C compiler, but there are
1 a few differences. White space separates tokens; it is not itself a
1 token of any kind. Tokens do not have to be separated by white space,
1 but it is often necessary to avoid ambiguities.
1
1 When faced with a sequence of characters that has more than one
1 possible tokenization, the preprocessor is greedy. It always makes each
1 token, starting from the left, as big as possible before moving on to
1 the next token. For instance, 'a+++++b' is interpreted as
1 'a ++ ++ + b', not as 'a ++ + ++ b', even though the latter tokenization
1 could be part of a valid C program and the former could not.
1
1 Once the input file is broken into tokens, the token boundaries never
1 change, except when the '##' preprocessing operator is used to paste
1 tokens together. ⇒Concatenation. For example,
1
1 #define foo() bar
1 foo()baz
1 ==> bar baz
1 _not_
1 ==> barbaz
1
1 The compiler does not re-tokenize the preprocessor's output. Each
1 preprocessing token becomes one compiler token.
1
1 Preprocessing tokens fall into five broad classes: identifiers,
1 preprocessing numbers, string literals, punctuators, and other. An
1 "identifier" is the same as an identifier in C: any sequence of letters,
1 digits, or underscores, which begins with a letter or underscore.
1 Keywords of C have no significance to the preprocessor; they are
1 ordinary identifiers. You can define a macro whose name is a keyword,
1 for instance. The only identifier which can be considered a
1 preprocessing keyword is 'defined'. ⇒Defined.
1
1 This is mostly true of other languages which use the C preprocessor.
1 However, a few of the keywords of C++ are significant even in the
1 preprocessor. ⇒C++ Named Operators.
1
1 In the 1999 C standard, identifiers may contain letters which are not
1 part of the "basic source character set", at the implementation's
1 discretion (such as accented Latin letters, Greek letters, or Chinese
1 ideograms). This may be done with an extended character set, or the
1 '\u' and '\U' escape sequences. GCC only accepts such characters in the
1 '\u' and '\U' forms.
1
1 As an extension, GCC treats '$' as a letter. This is for
1 compatibility with some systems, such as VMS, where '$' is commonly used
1 in system-defined function and object names. '$' is not a letter in
11 strictly conforming mode, or if you specify the '-$' option. ⇒
Invocation.
1
1 A "preprocessing number" has a rather bizarre definition. The
1 category includes all the normal integer and floating point constants
1 one expects of C, but also a number of other things one might not
1 initially recognize as a number. Formally, preprocessing numbers begin
1 with an optional period, a required decimal digit, and then continue
1 with any sequence of letters, digits, underscores, periods, and
1 exponents. Exponents are the two-character sequences 'e+', 'e-', 'E+',
1 'E-', 'p+', 'p-', 'P+', and 'P-'. (The exponents that begin with 'p' or
1 'P' are used for hexadecimal floating-point constants.)
1
1 The purpose of this unusual definition is to isolate the preprocessor
1 from the full complexity of numeric constants. It does not have to
1 distinguish between lexically valid and invalid floating-point numbers,
1 which is complicated. The definition also permits you to split an
1 identifier at any position and get exactly two tokens, which can then be
1 pasted back together with the '##' operator.
1
1 It's possible for preprocessing numbers to cause programs to be
1 misinterpreted. For example, '0xE+12' is a preprocessing number which
1 does not translate to any valid numeric constant, therefore a syntax
1 error. It does not mean '0xE + 12', which is what you might have
1 intended.
1
1 "String literals" are string constants, character constants, and
1 header file names (the argument of '#include').(1) String constants and
1 character constants are straightforward: "..." or '...'. In either case
1 embedded quotes should be escaped with a backslash: '\'' is the
1 character constant for '''. There is no limit on the length of a
1 character constant, but the value of a character constant that contains
11 more than one character is implementation-defined. ⇒Implementation
Details.
1
1 Header file names either look like string constants, "...", or are
1 written with angle brackets instead, <...>. In either case, backslash
1 is an ordinary character. There is no way to escape the closing quote
1 or angle bracket. The preprocessor looks for the header file in
11 different places depending on which form you use. ⇒Include
Operation.
1
1 No string literal may extend past the end of a line. You may use
1 continued lines instead, or string constant concatenation.
1
1 "Punctuators" are all the usual bits of punctuation which are
1 meaningful to C and C++. All but three of the punctuation characters in
1 ASCII are C punctuators. The exceptions are '@', '$', and '`'. In
1 addition, all the two- and three-character operators are punctuators.
1 There are also six "digraphs", which the C++ standard calls "alternative
1 tokens", which are merely alternate ways to spell other punctuators.
1 This is a second attempt to work around missing punctuation in obsolete
1 systems. It has no negative side effects, unlike trigraphs, but does
1 not cover as much ground. The digraphs and their corresponding normal
1 punctuators are:
1
1 Digraph: <% %> <: :> %: %:%:
1 Punctuator: { } [ ] # ##
1
1 Any other single character is considered "other". It is passed on to
1 the preprocessor's output unmolested. The C compiler will almost
1 certainly reject source code containing "other" tokens. In ASCII, the
1 only other characters are '@', '$', '`', and control characters other
1 than NUL (all bits zero). (Note that '$' is normally considered a
1 letter.) All characters with the high bit set (numeric range 0x7F-0xFF)
1 are also "other" in the present implementation. This will change when
1 proper support for international character sets is added to GCC.
1
1 NUL is a special case because of the high probability that its
1 appearance is accidental, and because it may be invisible to the user
1 (many terminals do not display NUL at all). Within comments, NULs are
1 silently ignored, just as any other character would be. In running
1 text, NUL is considered white space. For example, these two directives
1 have the same meaning.
1
1 #define X^@1
1 #define X 1
1
1 (where '^@' is ASCII NUL). Within string or character constants, NULs
1 are preserved. In the latter two cases the preprocessor emits a warning
1 message.
1
1 ---------- Footnotes ----------
1
1 (1) The C standard uses the term "string literal" to refer only to
1 what we are calling "string constants".
1