m4: Input processing

1 
1 3.5 How 'm4' copies input to output
1 ===================================
1 
1 As 'm4' reads the input token by token, it will copy each token directly
1 to the output immediately.
1 
1    The exception is when it finds a word with a macro definition.  In
1 that case 'm4' will calculate the macro's expansion, possibly reading
1 more input to get the arguments.  It then inserts the expansion in front
1 of the remaining input.  In other words, the resulting text from a macro
1 call will be read and parsed into tokens again.
1 
1    'm4' expands a macro as soon as possible.  If it finds a macro call
1 when collecting the arguments to another, it will expand the second call
1 first.  This process continues until there are no more macro calls to
1 expand and all the input has been consumed.
1 
1    For a running example, examine how 'm4' handles this input:
1 
1      format(`Result is %d', eval(`2**15'))
1 
1 First, 'm4' sees that the token 'format' is a macro name, so it collects
1 the tokens '(', '`Result is %d'', ',', and ' ', before encountering
1 another potential macro.  Sure enough, 'eval' is a macro name, so the
1 nested argument collection picks up '(', '`2**15'', and ')', invoking
1 the eval macro with the lone argument of '2**15'.  The expansion of
1 'eval(2**15)' is '32768', which is then rescanned as the five tokens
1 '3', '2', '7', '6', and '8'; and combined with the next ')', the format
1 macro now has all its arguments, as if the user had typed:
1 
1      format(`Result is %d', 32768)
1 
1 The format macro expands to 'Result is 32768', and we have another round
1 of scanning for the tokens 'Result', ' ', 'is', ' ', '3', '2', '7', '6',
1 and '8'.  None of these are macros, so the final output is
1 
1      =>Result is 32768
1 
1    As a more complicated example, we will contrast an actual code
1 example from the Gnulib project(1), showing both a buggy approach and
1 the desired results.  The user desires to output a shell assignment
1 statement that takes its argument and turns it into a shell variable by
1 converting it to uppercase and prepending a prefix.  The original
1 attempt looks like this:
1 
1      changequote([,])dnl
1      define([gl_STRING_MODULE_INDICATOR],
1        [
1          dnl comment
1          GNULIB_]translit([$1],[a-z],[A-Z])[=1
1        ])dnl
1        gl_STRING_MODULE_INDICATOR([strcase])
1      =>  
1      =>        GNULIB_strcase=1
1      =>  
1 
1    Oops - the argument did not get capitalized.  And although the manual
1 is not able to easily show it, both lines that appear empty actually
1 contain two trailing spaces.  By stepping through the parse, it is easy
1 to see what happened.  First, 'm4' sees the token 'changequote', which
1 it recognizes as a macro, followed by '(', '[', ',', ']', and ')' to
1 form the argument list.  The macro expands to the empty string, but
1 changes the quoting characters to something more useful for generating
1 shell code (unbalanced '`' and ''' appear all the time in shell scripts,
1 but unbalanced '[]' tend to be rare).  Also in the first line, 'm4' sees
1 the token 'dnl', which it recognizes as a builtin macro that consumes
1 the rest of the line, resulting in no output for that line.
1 
1    The second line starts a macro definition.  'm4' sees the token
1 'define', which it recognizes as a macro, followed by a '(',
1 '[gl_STRING_MODULE_INDICATOR]', and ','.  Because an unquoted comma was
1 encountered, the first argument is known to be the expansion of the
1 single-quoted string token, or 'gl_STRING_MODULE_INDICATOR'.  Next, 'm4'
1 sees '<NL>', ' ', and ' ', but this whitespace is discarded as part of
1 argument collection.  Then comes a rather lengthy single-quoted string
1 token, '[<NL>    dnl comment<NL>    GNULIB_]'.  This is followed by the
1 token 'translit', which 'm4' recognizes as a macro name, so a nested
1 macro expansion has started.
1 
1    The arguments to the 'translit' are found by the tokens '(', '[$1]',
1 ',', '[a-z]', ',', '[A-Z]', and finally ')'.  All three string arguments
1 are expanded (or in other words, the quotes are stripped), and since
1 neither '$' nor '1' need capitalization, the result of the macro is
1 '$1'.  This expansion is rescanned, resulting in the two literal
1 characters '$' and '1'.
1 
1    Scanning of the outer macro resumes, and picks up with '[=1<NL>  ]',
1 and finally ')'.  The collected pieces of expanded text are
1 concatenated, with the end result that the macro
1 'gl_STRING_MODULE_INDICATOR' is now defined to be the sequence
1 '<NL>    dnl comment<NL>    GNULIB_$1=1<NL>  '.  Once again, 'dnl' is
1 recognized and avoids a newline in the output.
1 
1    The final line is then parsed, beginning with ' ' and ' ' that are
1 output literally.  Then 'gl_STRING_MODULE_INDICATOR' is recognized as a
1 macro name, with an argument list of '(', '[strcase]', and ')'.  Since
1 the definition of the macro contains the sequence '$1', that sequence is
1 replaced with the argument 'strcase' prior to starting the rescan.  The
1 rescan sees '<NL>' and four spaces, which are output literally, then
1 'dnl', which discards the text ' comment<NL>'.  Next comes four more
1 spaces, also output literally, and the token 'GNULIB_strcase', which
1 resulted from the earlier parameter substitution.  Since that is not a
1 macro name, it is output literally, followed by the literal tokens '=',
1 '1', '<NL>', and two more spaces.  Finally, the original '<NL>' seen
1 after the macro invocation is scanned and output literally.
1 
1    Now for a corrected approach.  This rearranges the use of newlines
1 and whitespace so that less whitespace is output (which, although
1 harmless to shell scripts, can be visually unappealing), and fixes the
1 quoting issues so that the capitalization occurs when the macro
1 'gl_STRING_MODULE_INDICATOR' is invoked, rather then when it is defined.
1 It also adds another layer of quoting to the first argument of
1 'translit', to ensure that the output will be rescanned as a string
1 rather than a potential uppercase macro name needing further expansion.
1 
1      changequote([,])dnl
1      define([gl_STRING_MODULE_INDICATOR],
1        [dnl comment
1        GNULIB_[]translit([[$1]], [a-z], [A-Z])=1dnl
1      ])dnl
1        gl_STRING_MODULE_INDICATOR([strcase])
1      =>    GNULIB_STRCASE=1
1 
1    The parsing of the first line is unchanged.  The second line sees the
1 name of the macro to define, then sees the discarded '<NL>' and two
1 spaces, as before.  But this time, the next token is '[dnl
1 comment<NL>  GNULIB_[]translit([[$1]], [a-z], [A-Z])=1dnl<NL>]', which
1 includes nested quotes, followed by ')' to end the macro definition and
1 'dnl' to skip the newline.  No early expansion of 'translit' occurs, so
1 the entire string becomes the definition of the macro.
1 
1    The final line is then parsed, beginning with two spaces that are
1 output literally, and an invocation of 'gl_STRING_MODULE_INDICATOR' with
1 the argument 'strcase'.  Again, the '$1' in the macro definition is
1 substituted prior to rescanning.  Rescanning first encounters 'dnl', and
1 discards ' comment<NL>'.  Then two spaces are output literally.  Next
1 comes the token 'GNULIB_', but that is not a macro, so it is output
1 literally.  The token '[]' is an empty string, so it does not affect
1 output.  Then the token 'translit' is encountered.
1 
1    This time, the arguments to 'translit' are parsed as '(',
1 '[[strcase]]', ',', ' ', '[a-z]', ',', ' ', '[A-Z]', and ')'.  The two
1 spaces are discarded, and the translit results in the desired result
1 '[STRCASE]'.  This is rescanned, but since it is a string, the quotes
1 are stripped and the only output is a literal 'STRCASE'.  Then the
1 scanner sees '=' and '1', which are output literally, followed by 'dnl'
1 which discards the rest of the definition of
1 'gl_STRING_MODULE_INDICATOR'.  The newline at the end of output is the
1 literal '<NL>' that appeared after the invocation of the macro.
1 
1    The order in which 'm4' expands the macros can be further explored
1 using the trace facilities of GNU 'm4' (⇒Trace).
1 
1    ---------- Footnotes ----------
1 
1    (1) Derived from a patch in
1 <http://lists.gnu.org/archive/html/bug-gnulib/2007-01/msg00389.html>,
1 and a followup patch in
1 <http://lists.gnu.org/archive/html/bug-gnulib/2007-02/msg00000.html>
1