cppinternals: Token Spacing
1
1 Token Spacing
1 *************
1
1 First, consider an issue that only concerns the stand-alone
1 preprocessor: there needs to be a guarantee that re-reading its
1 preprocessed output results in an identical token stream. Without
1 taking special measures, this might not be the case because of macro
1 substitution. For example:
1
1 #define PLUS +
1 #define EMPTY
1 #define f(x) =x=
1 +PLUS -EMPTY- PLUS+ f(=)
1 ==> + + - - + + = = =
1 _not_
1 ==> ++ -- ++ ===
1
1 One solution would be to simply insert a space between all adjacent
1 tokens. However, we would like to keep space insertion to a minimum,
1 both for aesthetic reasons and because it causes problems for people who
1 still try to abuse the preprocessor for things like Fortran source and
1 Makefiles.
1
1 For now, just notice that when tokens are added (or removed, as shown
1 by the 'EMPTY' example) from the original lexed token stream, we need to
1 check for accidental token pasting. We call this "paste avoidance".
1 Token addition and removal can only occur because of macro expansion,
1 but accidental pasting can occur in many places: both before and after
1 each macro replacement, each argument replacement, and additionally each
1 token created by the '#' and '##' operators.
1
1 Look at how the preprocessor gets whitespace output correct normally.
1 The 'cpp_token' structure contains a flags byte, and one of those flags
1 is 'PREV_WHITE'. This is flagged by the lexer, and indicates that the
1 token was preceded by whitespace of some form other than a new line.
1 The stand-alone preprocessor can use this flag to decide whether to
1 insert a space between tokens in the output.
1
1 Now consider the result of the following macro expansion:
1
1 #define add(x, y, z) x + y +z;
1 sum = add (1,2, 3);
1 ==> sum = 1 + 2 +3;
1
1 The interesting thing here is that the tokens '1' and '2' are output
1 with a preceding space, and '3' is output without a preceding space, but
1 when lexed none of these tokens had that property. Careful
1 consideration reveals that '1' gets its preceding whitespace from the
1 space preceding 'add' in the macro invocation, _not_ replacement list.
1 '2' gets its whitespace from the space preceding the parameter 'y' in
1 the macro replacement list, and '3' has no preceding space because
1 parameter 'z' has none in the replacement list.
1
1 Once lexed, tokens are effectively fixed and cannot be altered, since
1 pointers to them might be held in many places, in particular by
1 in-progress macro expansions. So instead of modifying the two tokens
1 above, the preprocessor inserts a special token, which I call a "padding
1 token", into the token stream to indicate that spacing of the subsequent
1 token is special. The preprocessor inserts padding tokens in front of
1 every macro expansion and expanded macro argument. These point to a
1 "source token" from which the subsequent real token should inherit its
1 spacing. In the above example, the source tokens are 'add' in the macro
1 invocation, and 'y' and 'z' in the macro replacement list, respectively.
1
1 It is quite easy to get multiple padding tokens in a row, for example
1 if a macro's first replacement token expands straight into another
1 macro.
1
1 #define foo bar
1 #define bar baz
1 [foo]
1 ==> [baz]
1
1 Here, two padding tokens are generated with sources the 'foo' token
1 between the brackets, and the 'bar' token from foo's replacement list,
1 respectively. Clearly the first padding token is the one to use, so the
1 output code should contain a rule that the first padding token in a
1 sequence is the one that matters.
1
1 But what if a macro expansion is left? Adjusting the above example
1 slightly:
1
1 #define foo bar
1 #define bar EMPTY baz
1 #define EMPTY
1 [foo] EMPTY;
1 ==> [ baz] ;
1
1 As shown, now there should be a space before 'baz' and the semicolon
1 in the output.
1
1 The rules we decided above fail for 'baz': we generate three padding
1 tokens, one per macro invocation, before the token 'baz'. We would then
1 have it take its spacing from the first of these, which carries source
1 token 'foo' with no leading space.
1
1 It is vital that cpplib get spacing correct in these examples since
1 any of these macro expansions could be stringized, where spacing
1 matters.
1
1 So, this demonstrates that not just entering macro and argument
1 expansions, but leaving them requires special handling too. I made
1 cpplib insert a padding token with a 'NULL' source token when leaving
1 macro expansions, as well as after each replaced argument in a macro's
1 replacement list. It also inserts appropriate padding tokens on either
1 side of tokens created by the '#' and '##' operators. I expanded the
1 rule so that, if we see a padding token with a 'NULL' source token,
1 _and_ that source token has no leading space, then we behave as if we
1 have seen no padding tokens at all. A quick check shows this rule will
1 then get the above example correct as well.
1
1 Now a relationship with paste avoidance is apparent: we have to be
1 careful about paste avoidance in exactly the same locations we have
1 padding tokens in order to get white space correct. This makes
1 implementation of paste avoidance easy: wherever the stand-alone
1 preprocessor is fixing up spacing because of padding tokens, and it
1 turns out that no space is needed, it has to take the extra step to
1 check that a space is not needed after all to avoid an accidental paste.
1 The function 'cpp_avoid_paste' advises whether a space is required
1 between two consecutive tokens. To avoid excessive spacing, it tries
1 hard to only require a space if one is likely to be necessary, but for
1 reasons of efficiency it is slightly conservative and might recommend a
1 space where one is not strictly needed.
1