cppinternals: Macro Expansion

1 
1 Macro Expansion Algorithm
1 *************************
1 
1 Macro expansion is a tricky operation, fraught with nasty corner cases
1 and situations that render what you thought was a nifty way to optimize
1 the preprocessor's expansion algorithm wrong in quite subtle ways.
1 
1    I strongly recommend you have a good grasp of how the C and C++
1 standards require macros to be expanded before diving into this section,
1 let alone the code!.  If you don't have a clear mental picture of how
1 things like nested macro expansion, stringizing and token pasting are
1 supposed to work, damage to your sanity can quickly result.
1 
1 Internal representation of macros
1 =================================
1 
1 The preprocessor stores macro expansions in tokenized form.  This saves
1 repeated lexing passes during expansion, at the cost of a small increase
1 in memory consumption on average.  The tokens are stored contiguously in
1 memory, so a pointer to the first one and a token count is all you need
1 to get the replacement list of a macro.
1 
1    If the macro is a function-like macro the preprocessor also stores
1 its parameters, in the form of an ordered list of pointers to the hash
1 table entry of each parameter's identifier.  Further, in the macro's
1 stored expansion each occurrence of a parameter is replaced with a
1 special token of type 'CPP_MACRO_ARG'.  Each such token holds the index
1 of the parameter it represents in the parameter list, which allows rapid
1 replacement of parameters with their arguments during expansion.
1 Despite this optimization it is still necessary to store the original
1 parameters to the macro, both for dumping with e.g., '-dD', and to warn
1 about non-trivial macro redefinitions when the parameter names have
1 changed.
1 
1 Macro expansion overview
1 ========================
1 
1 The preprocessor maintains a "context stack", implemented as a linked
1 list of 'cpp_context' structures, which together represent the macro
1 expansion state at any one time.  The 'struct cpp_reader' member
1 variable 'context' points to the current top of this stack.  The top
1 normally holds the unexpanded replacement list of the innermost macro
1 under expansion, except when cpplib is about to pre-expand an argument,
1 in which case it holds that argument's unexpanded tokens.
1 
1    When there are no macros under expansion, cpplib is in "base
1 context".  All contexts other than the base context contain a contiguous
1 list of tokens delimited by a starting and ending token.  When not in
1 base context, cpplib obtains the next token from the list of the top
1 context.  If there are no tokens left in the list, it pops that context
1 off the stack, and subsequent ones if necessary, until an unexhausted
1 context is found or it returns to base context.  In base context, cpplib
1 reads tokens directly from the lexer.
1 
1    If it encounters an identifier that is both a macro and enabled for
1 expansion, cpplib prepares to push a new context for that macro on the
1 stack by calling the routine 'enter_macro_context'.  When this routine
1 returns, the new context will contain the unexpanded tokens of the
1 replacement list of that macro.  In the case of function-like macros,
1 'enter_macro_context' also replaces any parameters in the replacement
1 list, stored as 'CPP_MACRO_ARG' tokens, with the appropriate macro
1 argument.  If the standard requires that the parameter be replaced with
1 its expanded argument, the argument will have been fully macro expanded
1 first.
1 
1    'enter_macro_context' also handles special macros like '__LINE__'.
1 Although these macros expand to a single token which cannot contain any
1 further macros, for reasons of token spacing (⇒Token Spacing) and
1 simplicity of implementation, cpplib handles these special macros by
1 pushing a context containing just that one token.
1 
1    The final thing that 'enter_macro_context' does before returning is
1 to mark the macro disabled for expansion (except for special macros like
1 '__TIME__').  The macro is re-enabled when its context is later popped
1 from the context stack, as described above.  This strict ordering
1 ensures that a macro is disabled whilst its expansion is being scanned,
1 but that it is _not_ disabled whilst any arguments to it are being
1 expanded.
1 
1 Scanning the replacement list for macros to expand
1 ==================================================
1 
1 The C standard states that, after any parameters have been replaced with
1 their possibly-expanded arguments, the replacement list is scanned for
1 nested macros.  Further, any identifiers in the replacement list that
1 are not expanded during this scan are never again eligible for expansion
1 in the future, if the reason they were not expanded is that the macro in
1 question was disabled.
1 
1    Clearly this latter condition can only apply to tokens resulting from
1 argument pre-expansion.  Other tokens never have an opportunity to be
1 re-tested for expansion.  It is possible for identifiers that are
1 function-like macros to not expand initially but to expand during a
1 later scan.  This occurs when the identifier is the last token of an
1 argument (and therefore originally followed by a comma or a closing
1 parenthesis in its macro's argument list), and when it replaces its
1 parameter in the macro's replacement list, the subsequent token happens
1 to be an opening parenthesis (itself possibly the first token of an
1 argument).
1 
1    It is important to note that when cpplib reads the last token of a
1 given context, that context still remains on the stack.  Only when
1 looking for the _next_ token do we pop it off the stack and drop to a
1 lower context.  This makes backing up by one token easy, but more
1 importantly ensures that the macro corresponding to the current context
1 is still disabled when we are considering the last token of its
1 replacement list for expansion (or indeed expanding it).  As an example,
1 which illustrates many of the points above, consider
1 
1      #define foo(x) bar x
1      foo(foo) (2)
1 
1 which fully expands to 'bar foo (2)'.  During pre-expansion of the
1 argument, 'foo' does not expand even though the macro is enabled, since
1 it has no following parenthesis [pre-expansion of an argument only uses
1 tokens from that argument; it cannot take tokens from whatever follows
1 the macro invocation].  This still leaves the argument token 'foo'
1 eligible for future expansion.  Then, when re-scanning after argument
1 replacement, the token 'foo' is rejected for expansion, and marked
1 ineligible for future expansion, since the macro is now disabled.  It is
1 disabled because the replacement list 'bar foo' of the macro is still on
1 the context stack.
1 
1    If instead the algorithm looked for an opening parenthesis first and
1 then tested whether the macro were disabled it would be subtly wrong.
1 In the example above, the replacement list of 'foo' would be popped in
1 the process of finding the parenthesis, re-enabling 'foo' and expanding
1 it a second time.
1 
1 Looking for a function-like macro's opening parenthesis
1 =======================================================
1 
1 Function-like macros only expand when immediately followed by a
1 parenthesis.  To do this cpplib needs to temporarily disable macros and
11 read the next token.  Unfortunately, because of spacing issues (⇒
 Token Spacing), there can be fake padding tokens in-between, and if
1 the next real token is not a parenthesis cpplib needs to be able to back
1 up that one token as well as retain the information in any intervening
1 padding tokens.
1 
1    Backing up more than one token when macros are involved is not
1 permitted by cpplib, because in general it might involve issues like
1 restoring popped contexts onto the context stack, which are too hard.
1 Instead, searching for the parenthesis is handled by a special function,
1 'funlike_invocation_p', which remembers padding information as it reads
1 tokens.  If the next real token is not an opening parenthesis, it backs
1 up that one token, and then pushes an extra context just containing the
1 padding information if necessary.
1 
1 Marking tokens ineligible for future expansion
1 ==============================================
1 
1 As discussed above, cpplib needs a way of marking tokens as
1 unexpandable.  Since the tokens cpplib handles are read-only once they
1 have been lexed, it instead makes a copy of the token and adds the flag
1 'NO_EXPAND' to the copy.
1 
1    For efficiency and to simplify memory management by avoiding having
1 to remember to free these tokens, they are allocated as temporary tokens
1 from the lexer's current token run (⇒Lexing a line) using the
1 function '_cpp_temp_token'.  The tokens are then re-used once the
1 current line of tokens has been read in.
1 
1    This might sound unsafe.  However, tokens runs are not re-used at the
1 end of a line if it happens to be in the middle of a macro argument
1 list, and cpplib only wants to back-up more than one lexer token in
1 situations where no macro expansion is involved, so the optimization is
1 safe.
1