cppinternals: Lexer

1 
1 The Lexer
1 *********
1 
1 Overview
1 ========
1 
1 The lexer is contained in the file 'lex.c'.  It is a hand-coded lexer,
1 and not implemented as a state machine.  It can understand C, C++ and
1 Objective-C source code, and has been extended to allow reasonably
1 successful preprocessing of assembly language.  The lexer does not make
1 an initial pass to strip out trigraphs and escaped newlines, but handles
1 them as they are encountered in a single pass of the input file.  It
1 returns preprocessing tokens individually, not a line at a time.
1 
1    It is mostly transparent to users of the library, since the library's
1 interface for obtaining the next token, 'cpp_get_token', takes care of
1 lexing new tokens, handling directives, and expanding macros as
1 necessary.  However, the lexer does expose some functionality so that
1 clients of the library can easily spell a given token, such as
1 'cpp_spell_token' and 'cpp_token_len'.  These functions are useful when
1 generating diagnostics, and for emitting the preprocessed output.
1 
1 Lexing a token
1 ==============
1 
1 Lexing of an individual token is handled by '_cpp_lex_direct' and its
1 subroutines.  In its current form the code is quite complicated, with
1 read ahead characters and such-like, since it strives to not step back
1 in the character stream in preparation for handling non-ASCII file
1 encodings.  The current plan is to convert any such files to UTF-8
1 before processing them.  This complexity is therefore unnecessary and
1 will be removed, so I'll not discuss it further here.
1 
1    The job of '_cpp_lex_direct' is simply to lex a token.  It is not
1 responsible for issues like directive handling, returning lookahead
1 tokens directly, multiple-include optimization, or conditional block
1 skipping.  It necessarily has a minor ro^le to play in memory management
11 of lexed lines.  I discuss these issues in a separate section (⇒
 Lexing a line).
1 
1    The lexer places the token it lexes into storage pointed to by the
1 variable 'cur_token', and then increments it.  This variable is
1 important for correct diagnostic positioning.  Unless a specific line
1 and column are passed to the diagnostic routines, they will examine the
1 'line' and 'col' values of the token just before the location that
1 'cur_token' points to, and use that location to report the diagnostic.
1 
1    The lexer does not consider whitespace to be a token in its own
1 right.  If whitespace (other than a new line) precedes a token, it sets
1 the 'PREV_WHITE' bit in the token's flags.  Each token has its 'line'
1 and 'col' variables set to the line and column of the first character of
1 the token.  This line number is the line number in the translation unit,
1 and can be converted to a source (file, line) pair using the line map
1 code.
1 
1    The first token on a logical, i.e. unescaped, line has the flag 'BOL'
1 set for beginning-of-line.  This flag is intended for internal use, both
1 to distinguish a '#' that begins a directive from one that doesn't, and
1 to generate a call-back to clients that want to be notified about the
1 start of every non-directive line with tokens on it.  Clients cannot
1 reliably determine this for themselves: the first token might be a
1 macro, and the tokens of a macro expansion do not have the 'BOL' flag
1 set.  The macro expansion may even be empty, and the next token on the
1 line certainly won't have the 'BOL' flag set.
1 
1    New lines are treated specially; exactly how the lexer handles them
1 is context-dependent.  The C standard mandates that directives are
1 terminated by the first unescaped newline character, even if it appears
1 in the middle of a macro expansion.  Therefore, if the state variable
1 'in_directive' is set, the lexer returns a 'CPP_EOF' token, which is
1 normally used to indicate end-of-file, to indicate end-of-directive.  In
1 a directive a 'CPP_EOF' token never means end-of-file.  Conveniently, if
1 the caller was 'collect_args', it already handles 'CPP_EOF' as if it
1 were end-of-file, and reports an error about an unterminated macro
1 argument list.
1 
1    The C standard also specifies that a new line in the middle of the
1 arguments to a macro is treated as whitespace.  This white space is
1 important in case the macro argument is stringized.  The state variable
1 'parsing_args' is nonzero when the preprocessor is collecting the
1 arguments to a macro call.  It is set to 1 when looking for the opening
1 parenthesis to a function-like macro, and 2 when collecting the actual
1 arguments up to the closing parenthesis, since these two cases need to
1 be distinguished sometimes.  One such time is here: the lexer sets the
1 'PREV_WHITE' flag of a token if it meets a new line when 'parsing_args'
1 is set to 2.  It doesn't set it if it meets a new line when
1 'parsing_args' is 1, since then code like
1 
1      #define foo() bar
1      foo
1      baz
1 
1 would be output with an erroneous space before 'baz':
1 
1      foo
1       baz
1 
1    This is a good example of the subtlety of getting token spacing
1 correct in the preprocessor; there are plenty of tests in the testsuite
1 for corner cases like this.
1 
1    The lexer is written to treat each of '\r', '\n', '\r\n' and '\n\r'
1 as a single new line indicator.  This allows it to transparently
1 preprocess MS-DOS, Macintosh and Unix files without their needing to
1 pass through a special filter beforehand.
1 
1    We also decided to treat a backslash, either '\' or the trigraph
1 '??/', separated from one of the above newline indicators by non-comment
1 whitespace only, as intending to escape the newline.  It tends to be a
1 typing mistake, and cannot reasonably be mistaken for anything else in
1 any of the C-family grammars.  Since handling it this way is not
1 strictly conforming to the ISO standard, the library issues a warning
1 wherever it encounters it.
1 
1    Handling newlines like this is made simpler by doing it in one place
1 only.  The function 'handle_newline' takes care of all newline
1 characters, and 'skip_escaped_newlines' takes care of arbitrarily long
1 sequences of escaped newlines, deferring to 'handle_newline' to handle
1 the newlines themselves.
1 
1    The most painful aspect of lexing ISO-standard C and C++ is handling
1 trigraphs and backlash-escaped newlines.  Trigraphs are processed before
1 any interpretation of the meaning of a character is made, and
1 unfortunately there is a trigraph representation for a backslash, so it
1 is possible for the trigraph '??/' to introduce an escaped newline.
1 
1    Escaped newlines are tedious because theoretically they can occur
1 anywhere--between the '+' and '=' of the '+=' token, within the
1 characters of an identifier, and even between the '*' and '/' that
1 terminates a comment.  Moreover, you cannot be sure there is just
1 one--there might be an arbitrarily long sequence of them.
1 
1    So, for example, the routine that lexes a number, 'parse_number',
1 cannot assume that it can scan forwards until the first non-number
1 character and be done with it, because this could be the '\' introducing
1 an escaped newline, or the '?' introducing the trigraph sequence that
1 represents the '\' of an escaped newline.  If it encounters a '?' or
1 '\', it calls 'skip_escaped_newlines' to skip over any potential escaped
1 newlines before checking whether the number has been finished.
1 
1    Similarly code in the main body of '_cpp_lex_direct' cannot simply
1 check for a '=' after a '+' character to determine whether it has a '+='
1 token; it needs to be prepared for an escaped newline of some sort.
1 Such cases use the function 'get_effective_char', which returns the
1 first character after any intervening escaped newlines.
1 
1    The lexer needs to keep track of the correct column position,
1 including counting tabs as specified by the '-ftabstop=' option.  This
1 should be done even within C-style comments; they can appear in the
1 middle of a line, and we want to report diagnostics in the correct
1 position for text appearing after the end of the comment.
1 
1    Some identifiers, such as '__VA_ARGS__' and poisoned identifiers, may
1 be invalid and require a diagnostic.  However, if they appear in a macro
1 expansion we don't want to complain with each use of the macro.  It is
1 therefore best to catch them during the lexing stage, in
1 'parse_identifier'.  In both cases, whether a diagnostic is needed or
1 not is dependent upon the lexer's state.  For example, we don't want to
1 issue a diagnostic for re-poisoning a poisoned identifier, or for using
1 '__VA_ARGS__' in the expansion of a variable-argument macro.  Therefore
1 'parse_identifier' makes use of state flags to determine whether a
1 diagnostic is appropriate.  Since we change state on a per-token basis,
1 and don't lex whole lines at a time, this is not a problem.
1 
1    Another place where state flags are used to change behavior is whilst
1 lexing header names.  Normally, a '<' would be lexed as a single token.
1 After a '#include' directive, though, it should be lexed as a single
1 token as far as the nearest '>' character.  Note that we don't allow the
1 terminators of header names to be escaped; the first '"' or '>'
1 terminates the header name.
1 
1    Interpretation of some character sequences depends upon whether we
1 are lexing C, C++ or Objective-C, and on the revision of the standard in
1 force.  For example, '::' is a single token in C++, but in C it is two
1 separate ':' tokens and almost certainly a syntax error.  Such cases are
1 handled by '_cpp_lex_direct' based upon command-line flags stored in the
1 'cpp_options' structure.
1 
1    Once a token has been lexed, it leads an independent existence.  The
1 spelling of numbers, identifiers and strings is copied to permanent
1 storage from the original input buffer, so a token remains valid and
1 correct even if its source buffer is freed with '_cpp_pop_buffer'.  The
1 storage holding the spellings of such tokens remains until the client
1 program calls cpp_destroy, probably at the end of the translation unit.
1 
1 Lexing a line
1 =============
1 
1 When the preprocessor was changed to return pointers to tokens, one
1 feature I wanted was some sort of guarantee regarding how long a
1 returned pointer remains valid.  This is important to the stand-alone
1 preprocessor, the future direction of the C family front ends, and even
1 to cpplib itself internally.
1 
1    Occasionally the preprocessor wants to be able to peek ahead in the
1 token stream.  For example, after the name of a function-like macro, it
1 wants to check the next token to see if it is an opening parenthesis.
1 Another example is that, after reading the first few tokens of a
1 '#pragma' directive and not recognizing it as a registered pragma, it
1 wants to backtrack and allow the user-defined handler for unknown
1 pragmas to access the full '#pragma' token stream.  The stand-alone
1 preprocessor wants to be able to test the current token with the
1 previous one to see if a space needs to be inserted to preserve their
1 separate tokenization upon re-lexing (paste avoidance), so it needs to
1 be sure the pointer to the previous token is still valid.  The
1 recursive-descent C++ parser wants to be able to perform tentative
1 parsing arbitrarily far ahead in the token stream, and then to be able
1 to jump back to a prior position in that stream if necessary.
1 
1    The rule I chose, which is fairly natural, is to arrange that the
1 preprocessor lex all tokens on a line consecutively into a token buffer,
1 which I call a "token run", and when meeting an unescaped new line
1 (newlines within comments do not count either), to start lexing back at
1 the beginning of the run.  Note that we do _not_ lex a line of tokens at
1 once; if we did that 'parse_identifier' would not have state flags
11 available to warn about invalid identifiers (⇒Invalid
 identifiers).
1 
1    In other words, accessing tokens that appeared earlier in the current
1 line is valid, but since each logical line overwrites the tokens of the
1 previous line, tokens from prior lines are unavailable.  In particular,
1 since a directive only occupies a single logical line, this means that
1 the directive handlers like the '#pragma' handler can jump around in the
1 directive's tokens if necessary.
1 
1    Two issues remain: what about tokens that arise from macro
1 expansions, and what happens when we have a long line that overflows the
1 token run?
1 
1    Since we promise clients that we preserve the validity of pointers
1 that we have already returned for tokens that appeared earlier in the
1 line, we cannot reallocate the run.  Instead, on overflow it is expanded
1 by chaining a new token run on to the end of the existing one.
1 
1    The tokens forming a macro's replacement list are collected by the
1 '#define' handler, and placed in storage that is only freed by
1 'cpp_destroy'.  So if a macro is expanded in the line of tokens, the
1 pointers to the tokens of its expansion that are returned will always
1 remain valid.  However, macros are a little trickier than that, since
1 they give rise to three sources of fresh tokens.  They are the built-in
1 macros like '__LINE__', and the '#' and '##' operators for stringizing
1 and token pasting.  I handled this by allocating space for these tokens
1 from the lexer's token run chain.  This means they automatically receive
1 the same lifetime guarantees as lexed tokens, and we don't need to
1 concern ourselves with freeing them.
1 
1    Lexing into a line of tokens solves some of the token memory
1 management issues, but not all.  The opening parenthesis after a
1 function-like macro name might lie on a different line, and the front
1 ends definitely want the ability to look ahead past the end of the
1 current line.  So cpplib only moves back to the start of the token run
1 at the end of a line if the variable 'keep_tokens' is zero.
1 Line-buffering is quite natural for the preprocessor, and as a result
1 the only time cpplib needs to increment this variable is whilst looking
1 for the opening parenthesis to, and reading the arguments of, a
1 function-like macro.  In the near future cpplib will export an interface
1 to increment and decrement this variable, so that clients can share full
1 control over the lifetime of token pointers too.
1 
1    The routine '_cpp_lex_token' handles moving to new token runs,
1 calling '_cpp_lex_direct' to lex new tokens, or returning
1 previously-lexed tokens if we stepped back in the token stream.  It also
1 checks each token for the 'BOL' flag, which might indicate a directive
1 that needs to be handled, or require a start-of-line call-back to be
1 made.  '_cpp_lex_token' also handles skipping over tokens in failed
1 conditional blocks, and invalidates the control macro of the
1 multiple-include optimization if a token was successfully lexed outside
1 a directive.  In other words, its callers do not need to concern
1 themselves with such issues.
1