gawk: Extract Program

1 
1 11.3.7 Extracting Programs from Texinfo Source Files
1 ----------------------------------------------------
1 
1 The nodes ⇒Library Functions, and ⇒Sample Programs, are
1 the top level nodes for a large number of 'awk' programs.  If you want
1 to experiment with these programs, it is tedious to type them in by
1 hand.  Here we present a program that can extract parts of a Texinfo
1 input file into separate files.
1 
1    This Info file is written in Texinfo
1 (https://www.gnu.org/software/texinfo/), the GNU Project's document
1 formatting language.  A single Texinfo source file can be used to
1 produce both printed documentation, with TeX, and online documentation.
1 (The Texinfo language is described fully, starting with *note(Texinfo,
1 texinfo,Texinfo---The GNU Documentation Format)Top::.)
1 
1    For our purposes, it is enough to know three things about Texinfo
1 input files:
1 
1    * The "at" symbol ('@') is special in Texinfo, much as the backslash
1      ('\') is in C or 'awk'.  Literal '@' symbols are represented in
1      Texinfo source files as '@@'.
1 
1    * Comments start with either '@c' or '@comment'.  The file-extraction
1      program works by using special comments that start at the beginning
1      of a line.
1 
1    * Lines containing '@group' and '@end group' commands bracket example
1      text that should not be split across a page boundary.
1      (Unfortunately, TeX isn't always smart enough to do things exactly
1      right, so we have to give it some help.)
1 
1    The following program, 'extract.awk', reads through a Texinfo source
1 file and does two things, based on the special comments.  Upon seeing
1 '@c system ...', it runs a command, by extracting the command text from
11 the control line and passing it on to the 'system()' function (⇒I/O
 Functions).  Upon seeing '@c file FILENAME', each subsequent line is
1 sent to the file FILENAME, until '@c endfile' is encountered.  The rules
1 in 'extract.awk' match either '@c' or '@comment' by letting the 'omment'
1 part be optional.  Lines containing '@group' and '@end group' are simply
11 removed.  'extract.awk' uses the 'join()' library function (⇒Join
 Function).
1 
1    The example programs in the online Texinfo source for 'GAWK:
1 Effective AWK Programming' ('gawktexi.in') have all been bracketed
1 inside 'file' and 'endfile' lines.  The 'gawk' distribution uses a copy
1 of 'extract.awk' to extract the sample programs and install many of them
1 in a standard directory where 'gawk' can find them.  The Texinfo file
1 looks something like this:
1 
1      ...
1      This program has a @code{BEGIN} rule
1      that prints a nice message:
1 
1      @example
1      @c file examples/messages.awk
1      BEGIN @{ print "Don't panic!" @}
1      @c endfile
1      @end example
1 
1      It also prints some final advice:
1 
1      @example
1      @c file examples/messages.awk
1      END @{ print "Always avoid bored archaeologists!" @}
1      @c endfile
1      @end example
1      ...
1 
1    'extract.awk' begins by setting 'IGNORECASE' to one, so that mixed
1 upper- and lowercase letters in the directives won't matter.
1 
1    The first rule handles calling 'system()', checking that a command is
1 given ('NF' is at least three) and also checking that the command exits
1 with a zero exit status, signifying OK:
1 
1      # extract.awk --- extract files and run programs from Texinfo files
1 
1      BEGIN    { IGNORECASE = 1 }
1 
1      /^@c(omment)?[ \t]+system/ {
1          if (NF < 3) {
1              e = ("extract: " FILENAME ":" FNR)
1              e = (e  ": badly formed `system' line")
1              print e > "/dev/stderr"
1              next
1          }
1          $1 = ""
1          $2 = ""
1          stat = system($0)
1          if (stat != 0) {
1              e = ("extract: " FILENAME ":" FNR)
1              e = (e ": warning: system returned " stat)
1              print e > "/dev/stderr"
1          }
1      }
1 
1 The variable 'e' is used so that the rule fits nicely on the screen.
1 
1    The second rule handles moving data into files.  It verifies that a
1 file name is given in the directive.  If the file named is not the
1 current file, then the current file is closed.  Keeping the current file
1 open until a new file is encountered allows the use of the '>'
1 redirection for printing the contents, keeping open-file management
1 simple.
1 
11    The 'for' loop does the work.  It reads lines using 'getline' (⇒
 Getline).  For an unexpected end-of-file, it calls the
1 'unexpected_eof()' function.  If the line is an "endfile" line, then it
1 breaks out of the loop.  If the line is an '@group' or '@end group'
1 line, then it ignores it and goes on to the next line.  Similarly,
1 comments within examples are also ignored.
1 
1    Most of the work is in the following few lines.  If the line has no
1 '@' symbols, the program can print it directly.  Otherwise, each leading
1 '@' must be stripped off.  To remove the '@' symbols, the line is split
1 into separate elements of the array 'a', using the 'split()' function
1 (⇒String Functions).  The '@' symbol is used as the separator
1 character.  Each element of 'a' that is empty indicates two successive
1 '@' symbols in the original line.  For each two empty elements ('@@' in
1 the original file), we have to add a single '@' symbol back in.
1 
1    When the processing of the array is finished, 'join()' is called with
1 the value of 'SUBSEP' (⇒Multidimensional), to rejoin the pieces
1 back into a single line.  That line is then printed to the output file:
1 
1      /^@c(omment)?[ \t]+file/ {
1          if (NF != 3) {
1              e = ("extract: " FILENAME ":" FNR ": badly formed `file' line")
1              print e > "/dev/stderr"
1              next
1          }
1          if ($3 != curfile) {
1              if (curfile != "")
1                  close(curfile)
1              curfile = $3
1          }
1 
1          for (;;) {
1              if ((getline line) <= 0)
1                  unexpected_eof()
1              if (line ~ /^@c(omment)?[ \t]+endfile/)
1                  break
1              else if (line ~ /^@(end[ \t]+)?group/)
1                  continue
1              else if (line ~ /^@c(omment+)?[ \t]+/)
1                  continue
1              if (index(line, "@") == 0) {
1                  print line > curfile
1                  continue
1              }
1              n = split(line, a, "@")
1              # if a[1] == "", means leading @,
1              # don't add one back in.
1              for (i = 2; i <= n; i++) {
1                  if (a[i] == "") { # was an @@
1                      a[i] = "@"
1                      if (a[i+1] == "")
1                          i++
1                  }
1              }
1              print join(a, 1, n, SUBSEP) > curfile
1          }
1      }
1 
1    An important thing to note is the use of the '>' redirection.  Output
1 done with '>' only opens the file once; it stays open and subsequent
1 output is appended to the file (⇒Redirection).  This makes it
1 easy to mix program text and explanatory prose for the same sample
1 source file (as has been done here!)  without any hassle.  The file is
1 only closed when a new data file name is encountered or at the end of
1 the input file.
1 
1    Finally, the function 'unexpected_eof()' prints an appropriate error
1 message and then exits.  The 'END' rule handles the final cleanup,
1 closing the open file:
1 
1      function unexpected_eof()
1      {
1          printf("extract: %s:%d: unexpected EOF or error\n",
1                           FILENAME, FNR) > "/dev/stderr"
1          exit 1
1      }
1 
1      END {
1          if (curfile)
1              close(curfile)
1      }
1