cppinternals: Token Spacing

1 
1 Token Spacing
1 *************
1 
1 First, consider an issue that only concerns the stand-alone
1 preprocessor: there needs to be a guarantee that re-reading its
1 preprocessed output results in an identical token stream.  Without
1 taking special measures, this might not be the case because of macro
1 substitution.  For example:
1 
1      #define PLUS +
1      #define EMPTY
1      #define f(x) =x=
1      +PLUS -EMPTY- PLUS+ f(=)
1              ==> + + - - + + = = =
1      _not_
1              ==> ++ -- ++ ===
1 
1    One solution would be to simply insert a space between all adjacent
1 tokens.  However, we would like to keep space insertion to a minimum,
1 both for aesthetic reasons and because it causes problems for people who
1 still try to abuse the preprocessor for things like Fortran source and
1 Makefiles.
1 
1    For now, just notice that when tokens are added (or removed, as shown
1 by the 'EMPTY' example) from the original lexed token stream, we need to
1 check for accidental token pasting.  We call this "paste avoidance".
1 Token addition and removal can only occur because of macro expansion,
1 but accidental pasting can occur in many places: both before and after
1 each macro replacement, each argument replacement, and additionally each
1 token created by the '#' and '##' operators.
1 
1    Look at how the preprocessor gets whitespace output correct normally.
1 The 'cpp_token' structure contains a flags byte, and one of those flags
1 is 'PREV_WHITE'.  This is flagged by the lexer, and indicates that the
1 token was preceded by whitespace of some form other than a new line.
1 The stand-alone preprocessor can use this flag to decide whether to
1 insert a space between tokens in the output.
1 
1    Now consider the result of the following macro expansion:
1 
1      #define add(x, y, z) x + y +z;
1      sum = add (1,2, 3);
1              ==> sum = 1 + 2 +3;
1 
1    The interesting thing here is that the tokens '1' and '2' are output
1 with a preceding space, and '3' is output without a preceding space, but
1 when lexed none of these tokens had that property.  Careful
1 consideration reveals that '1' gets its preceding whitespace from the
1 space preceding 'add' in the macro invocation, _not_ replacement list.
1 '2' gets its whitespace from the space preceding the parameter 'y' in
1 the macro replacement list, and '3' has no preceding space because
1 parameter 'z' has none in the replacement list.
1 
1    Once lexed, tokens are effectively fixed and cannot be altered, since
1 pointers to them might be held in many places, in particular by
1 in-progress macro expansions.  So instead of modifying the two tokens
1 above, the preprocessor inserts a special token, which I call a "padding
1 token", into the token stream to indicate that spacing of the subsequent
1 token is special.  The preprocessor inserts padding tokens in front of
1 every macro expansion and expanded macro argument.  These point to a
1 "source token" from which the subsequent real token should inherit its
1 spacing.  In the above example, the source tokens are 'add' in the macro
1 invocation, and 'y' and 'z' in the macro replacement list, respectively.
1 
1    It is quite easy to get multiple padding tokens in a row, for example
1 if a macro's first replacement token expands straight into another
1 macro.
1 
1      #define foo bar
1      #define bar baz
1      [foo]
1              ==> [baz]
1 
1    Here, two padding tokens are generated with sources the 'foo' token
1 between the brackets, and the 'bar' token from foo's replacement list,
1 respectively.  Clearly the first padding token is the one to use, so the
1 output code should contain a rule that the first padding token in a
1 sequence is the one that matters.
1 
1    But what if a macro expansion is left?  Adjusting the above example
1 slightly:
1 
1      #define foo bar
1      #define bar EMPTY baz
1      #define EMPTY
1      [foo] EMPTY;
1              ==> [ baz] ;
1 
1    As shown, now there should be a space before 'baz' and the semicolon
1 in the output.
1 
1    The rules we decided above fail for 'baz': we generate three padding
1 tokens, one per macro invocation, before the token 'baz'.  We would then
1 have it take its spacing from the first of these, which carries source
1 token 'foo' with no leading space.
1 
1    It is vital that cpplib get spacing correct in these examples since
1 any of these macro expansions could be stringized, where spacing
1 matters.
1 
1    So, this demonstrates that not just entering macro and argument
1 expansions, but leaving them requires special handling too.  I made
1 cpplib insert a padding token with a 'NULL' source token when leaving
1 macro expansions, as well as after each replaced argument in a macro's
1 replacement list.  It also inserts appropriate padding tokens on either
1 side of tokens created by the '#' and '##' operators.  I expanded the
1 rule so that, if we see a padding token with a 'NULL' source token,
1 _and_ that source token has no leading space, then we behave as if we
1 have seen no padding tokens at all.  A quick check shows this rule will
1 then get the above example correct as well.
1 
1    Now a relationship with paste avoidance is apparent: we have to be
1 careful about paste avoidance in exactly the same locations we have
1 padding tokens in order to get white space correct.  This makes
1 implementation of paste avoidance easy: wherever the stand-alone
1 preprocessor is fixing up spacing because of padding tokens, and it
1 turns out that no space is needed, it has to take the extra step to
1 check that a space is not needed after all to avoid an accidental paste.
1 The function 'cpp_avoid_paste' advises whether a space is required
1 between two consecutive tokens.  To avoid excessive spacing, it tries
1 hard to only require a space if one is likely to be necessary, but for
1 reasons of efficiency it is slightly conservative and might recommend a
1 space where one is not strictly needed.
1