cppinternals: Lexer
1
1 The Lexer
1 *********
1
1 Overview
1 ========
1
1 The lexer is contained in the file 'lex.c'. It is a hand-coded lexer,
1 and not implemented as a state machine. It can understand C, C++ and
1 Objective-C source code, and has been extended to allow reasonably
1 successful preprocessing of assembly language. The lexer does not make
1 an initial pass to strip out trigraphs and escaped newlines, but handles
1 them as they are encountered in a single pass of the input file. It
1 returns preprocessing tokens individually, not a line at a time.
1
1 It is mostly transparent to users of the library, since the library's
1 interface for obtaining the next token, 'cpp_get_token', takes care of
1 lexing new tokens, handling directives, and expanding macros as
1 necessary. However, the lexer does expose some functionality so that
1 clients of the library can easily spell a given token, such as
1 'cpp_spell_token' and 'cpp_token_len'. These functions are useful when
1 generating diagnostics, and for emitting the preprocessed output.
1
1 Lexing a token
1 ==============
1
1 Lexing of an individual token is handled by '_cpp_lex_direct' and its
1 subroutines. In its current form the code is quite complicated, with
1 read ahead characters and such-like, since it strives to not step back
1 in the character stream in preparation for handling non-ASCII file
1 encodings. The current plan is to convert any such files to UTF-8
1 before processing them. This complexity is therefore unnecessary and
1 will be removed, so I'll not discuss it further here.
1
1 The job of '_cpp_lex_direct' is simply to lex a token. It is not
1 responsible for issues like directive handling, returning lookahead
1 tokens directly, multiple-include optimization, or conditional block
1 skipping. It necessarily has a minor ro^le to play in memory management
11 of lexed lines. I discuss these issues in a separate section (⇒
Lexing a line).
1
1 The lexer places the token it lexes into storage pointed to by the
1 variable 'cur_token', and then increments it. This variable is
1 important for correct diagnostic positioning. Unless a specific line
1 and column are passed to the diagnostic routines, they will examine the
1 'line' and 'col' values of the token just before the location that
1 'cur_token' points to, and use that location to report the diagnostic.
1
1 The lexer does not consider whitespace to be a token in its own
1 right. If whitespace (other than a new line) precedes a token, it sets
1 the 'PREV_WHITE' bit in the token's flags. Each token has its 'line'
1 and 'col' variables set to the line and column of the first character of
1 the token. This line number is the line number in the translation unit,
1 and can be converted to a source (file, line) pair using the line map
1 code.
1
1 The first token on a logical, i.e. unescaped, line has the flag 'BOL'
1 set for beginning-of-line. This flag is intended for internal use, both
1 to distinguish a '#' that begins a directive from one that doesn't, and
1 to generate a call-back to clients that want to be notified about the
1 start of every non-directive line with tokens on it. Clients cannot
1 reliably determine this for themselves: the first token might be a
1 macro, and the tokens of a macro expansion do not have the 'BOL' flag
1 set. The macro expansion may even be empty, and the next token on the
1 line certainly won't have the 'BOL' flag set.
1
1 New lines are treated specially; exactly how the lexer handles them
1 is context-dependent. The C standard mandates that directives are
1 terminated by the first unescaped newline character, even if it appears
1 in the middle of a macro expansion. Therefore, if the state variable
1 'in_directive' is set, the lexer returns a 'CPP_EOF' token, which is
1 normally used to indicate end-of-file, to indicate end-of-directive. In
1 a directive a 'CPP_EOF' token never means end-of-file. Conveniently, if
1 the caller was 'collect_args', it already handles 'CPP_EOF' as if it
1 were end-of-file, and reports an error about an unterminated macro
1 argument list.
1
1 The C standard also specifies that a new line in the middle of the
1 arguments to a macro is treated as whitespace. This white space is
1 important in case the macro argument is stringized. The state variable
1 'parsing_args' is nonzero when the preprocessor is collecting the
1 arguments to a macro call. It is set to 1 when looking for the opening
1 parenthesis to a function-like macro, and 2 when collecting the actual
1 arguments up to the closing parenthesis, since these two cases need to
1 be distinguished sometimes. One such time is here: the lexer sets the
1 'PREV_WHITE' flag of a token if it meets a new line when 'parsing_args'
1 is set to 2. It doesn't set it if it meets a new line when
1 'parsing_args' is 1, since then code like
1
1 #define foo() bar
1 foo
1 baz
1
1 would be output with an erroneous space before 'baz':
1
1 foo
1 baz
1
1 This is a good example of the subtlety of getting token spacing
1 correct in the preprocessor; there are plenty of tests in the testsuite
1 for corner cases like this.
1
1 The lexer is written to treat each of '\r', '\n', '\r\n' and '\n\r'
1 as a single new line indicator. This allows it to transparently
1 preprocess MS-DOS, Macintosh and Unix files without their needing to
1 pass through a special filter beforehand.
1
1 We also decided to treat a backslash, either '\' or the trigraph
1 '??/', separated from one of the above newline indicators by non-comment
1 whitespace only, as intending to escape the newline. It tends to be a
1 typing mistake, and cannot reasonably be mistaken for anything else in
1 any of the C-family grammars. Since handling it this way is not
1 strictly conforming to the ISO standard, the library issues a warning
1 wherever it encounters it.
1
1 Handling newlines like this is made simpler by doing it in one place
1 only. The function 'handle_newline' takes care of all newline
1 characters, and 'skip_escaped_newlines' takes care of arbitrarily long
1 sequences of escaped newlines, deferring to 'handle_newline' to handle
1 the newlines themselves.
1
1 The most painful aspect of lexing ISO-standard C and C++ is handling
1 trigraphs and backlash-escaped newlines. Trigraphs are processed before
1 any interpretation of the meaning of a character is made, and
1 unfortunately there is a trigraph representation for a backslash, so it
1 is possible for the trigraph '??/' to introduce an escaped newline.
1
1 Escaped newlines are tedious because theoretically they can occur
1 anywhere--between the '+' and '=' of the '+=' token, within the
1 characters of an identifier, and even between the '*' and '/' that
1 terminates a comment. Moreover, you cannot be sure there is just
1 one--there might be an arbitrarily long sequence of them.
1
1 So, for example, the routine that lexes a number, 'parse_number',
1 cannot assume that it can scan forwards until the first non-number
1 character and be done with it, because this could be the '\' introducing
1 an escaped newline, or the '?' introducing the trigraph sequence that
1 represents the '\' of an escaped newline. If it encounters a '?' or
1 '\', it calls 'skip_escaped_newlines' to skip over any potential escaped
1 newlines before checking whether the number has been finished.
1
1 Similarly code in the main body of '_cpp_lex_direct' cannot simply
1 check for a '=' after a '+' character to determine whether it has a '+='
1 token; it needs to be prepared for an escaped newline of some sort.
1 Such cases use the function 'get_effective_char', which returns the
1 first character after any intervening escaped newlines.
1
1 The lexer needs to keep track of the correct column position,
1 including counting tabs as specified by the '-ftabstop=' option. This
1 should be done even within C-style comments; they can appear in the
1 middle of a line, and we want to report diagnostics in the correct
1 position for text appearing after the end of the comment.
1
1 Some identifiers, such as '__VA_ARGS__' and poisoned identifiers, may
1 be invalid and require a diagnostic. However, if they appear in a macro
1 expansion we don't want to complain with each use of the macro. It is
1 therefore best to catch them during the lexing stage, in
1 'parse_identifier'. In both cases, whether a diagnostic is needed or
1 not is dependent upon the lexer's state. For example, we don't want to
1 issue a diagnostic for re-poisoning a poisoned identifier, or for using
1 '__VA_ARGS__' in the expansion of a variable-argument macro. Therefore
1 'parse_identifier' makes use of state flags to determine whether a
1 diagnostic is appropriate. Since we change state on a per-token basis,
1 and don't lex whole lines at a time, this is not a problem.
1
1 Another place where state flags are used to change behavior is whilst
1 lexing header names. Normally, a '<' would be lexed as a single token.
1 After a '#include' directive, though, it should be lexed as a single
1 token as far as the nearest '>' character. Note that we don't allow the
1 terminators of header names to be escaped; the first '"' or '>'
1 terminates the header name.
1
1 Interpretation of some character sequences depends upon whether we
1 are lexing C, C++ or Objective-C, and on the revision of the standard in
1 force. For example, '::' is a single token in C++, but in C it is two
1 separate ':' tokens and almost certainly a syntax error. Such cases are
1 handled by '_cpp_lex_direct' based upon command-line flags stored in the
1 'cpp_options' structure.
1
1 Once a token has been lexed, it leads an independent existence. The
1 spelling of numbers, identifiers and strings is copied to permanent
1 storage from the original input buffer, so a token remains valid and
1 correct even if its source buffer is freed with '_cpp_pop_buffer'. The
1 storage holding the spellings of such tokens remains until the client
1 program calls cpp_destroy, probably at the end of the translation unit.
1
1 Lexing a line
1 =============
1
1 When the preprocessor was changed to return pointers to tokens, one
1 feature I wanted was some sort of guarantee regarding how long a
1 returned pointer remains valid. This is important to the stand-alone
1 preprocessor, the future direction of the C family front ends, and even
1 to cpplib itself internally.
1
1 Occasionally the preprocessor wants to be able to peek ahead in the
1 token stream. For example, after the name of a function-like macro, it
1 wants to check the next token to see if it is an opening parenthesis.
1 Another example is that, after reading the first few tokens of a
1 '#pragma' directive and not recognizing it as a registered pragma, it
1 wants to backtrack and allow the user-defined handler for unknown
1 pragmas to access the full '#pragma' token stream. The stand-alone
1 preprocessor wants to be able to test the current token with the
1 previous one to see if a space needs to be inserted to preserve their
1 separate tokenization upon re-lexing (paste avoidance), so it needs to
1 be sure the pointer to the previous token is still valid. The
1 recursive-descent C++ parser wants to be able to perform tentative
1 parsing arbitrarily far ahead in the token stream, and then to be able
1 to jump back to a prior position in that stream if necessary.
1
1 The rule I chose, which is fairly natural, is to arrange that the
1 preprocessor lex all tokens on a line consecutively into a token buffer,
1 which I call a "token run", and when meeting an unescaped new line
1 (newlines within comments do not count either), to start lexing back at
1 the beginning of the run. Note that we do _not_ lex a line of tokens at
1 once; if we did that 'parse_identifier' would not have state flags
11 available to warn about invalid identifiers (⇒Invalid
identifiers).
1
1 In other words, accessing tokens that appeared earlier in the current
1 line is valid, but since each logical line overwrites the tokens of the
1 previous line, tokens from prior lines are unavailable. In particular,
1 since a directive only occupies a single logical line, this means that
1 the directive handlers like the '#pragma' handler can jump around in the
1 directive's tokens if necessary.
1
1 Two issues remain: what about tokens that arise from macro
1 expansions, and what happens when we have a long line that overflows the
1 token run?
1
1 Since we promise clients that we preserve the validity of pointers
1 that we have already returned for tokens that appeared earlier in the
1 line, we cannot reallocate the run. Instead, on overflow it is expanded
1 by chaining a new token run on to the end of the existing one.
1
1 The tokens forming a macro's replacement list are collected by the
1 '#define' handler, and placed in storage that is only freed by
1 'cpp_destroy'. So if a macro is expanded in the line of tokens, the
1 pointers to the tokens of its expansion that are returned will always
1 remain valid. However, macros are a little trickier than that, since
1 they give rise to three sources of fresh tokens. They are the built-in
1 macros like '__LINE__', and the '#' and '##' operators for stringizing
1 and token pasting. I handled this by allocating space for these tokens
1 from the lexer's token run chain. This means they automatically receive
1 the same lifetime guarantees as lexed tokens, and we don't need to
1 concern ourselves with freeing them.
1
1 Lexing into a line of tokens solves some of the token memory
1 management issues, but not all. The opening parenthesis after a
1 function-like macro name might lie on a different line, and the front
1 ends definitely want the ability to look ahead past the end of the
1 current line. So cpplib only moves back to the start of the token run
1 at the end of a line if the variable 'keep_tokens' is zero.
1 Line-buffering is quite natural for the preprocessor, and as a result
1 the only time cpplib needs to increment this variable is whilst looking
1 for the opening parenthesis to, and reading the arguments of, a
1 function-like macro. In the near future cpplib will export an interface
1 to increment and decrement this variable, so that clients can share full
1 control over the lifetime of token pointers too.
1
1 The routine '_cpp_lex_token' handles moving to new token runs,
1 calling '_cpp_lex_direct' to lex new tokens, or returning
1 previously-lexed tokens if we stepped back in the token stream. It also
1 checks each token for the 'BOL' flag, which might indicate a directive
1 that needs to be handled, or require a start-of-line call-back to be
1 made. '_cpp_lex_token' also handles skipping over tokens in failed
1 conditional blocks, and invalidates the control macro of the
1 multiple-include optimization if a token was successfully lexed outside
1 a directive. In other words, its callers do not need to concern
1 themselves with such issues.
1