gawk: Igawk Program

1 
1 11.3.9 An Easy Way to Use Library Functions
1 -------------------------------------------
1 
1 In ⇒Include Files, we saw how 'gawk' provides a built-in
1 file-inclusion capability.  However, this is a 'gawk' extension.  This
1 minor node provides the motivation for making file inclusion available
1 for standard 'awk', and shows how to do it using a combination of shell
1 and 'awk' programming.
1 
1    Using library functions in 'awk' can be very beneficial.  It
1 encourages code reuse and the writing of general functions.  Programs
1 are smaller and therefore clearer.  However, using library functions is
1 only easy when writing 'awk' programs; it is painful when running them,
1 requiring multiple '-f' options.  If 'gawk' is unavailable, then so too
1 is the 'AWKPATH' environment variable and the ability to put 'awk'
1 functions into a library directory (⇒Options).  It would be nice
1 to be able to write programs in the following manner:
1 
1      # library functions
1      @include getopt.awk
1      @include join.awk
1      ...
1 
1      # main program
1      BEGIN {
1          while ((c = getopt(ARGC, ARGV, "a:b:cde")) != -1)
1              ...
1          ...
1      }
1 
1    The following program, 'igawk.sh', provides this service.  It
1 simulates 'gawk''s searching of the 'AWKPATH' variable and also allows
1 "nested" includes (i.e., a file that is included with '@include' can
1 contain further '@include' statements).  'igawk' makes an effort to only
1 include files once, so that nested includes don't accidentally include a
1 library function twice.
1 
1    'igawk' should behave just like 'gawk' externally.  This means it
1 should accept all of 'gawk''s command-line arguments, including the
1 ability to have multiple source files specified via '-f' and the ability
1 to mix command-line and library source files.
1 
1    The program is written using the POSIX Shell ('sh') command
1 language.(1)  It works as follows:
1 
1   1. Loop through the arguments, saving anything that doesn't represent
1      'awk' source code for later, when the expanded program is run.
1 
1   2. For any arguments that do represent 'awk' text, put the arguments
1      into a shell variable that will be expanded.  There are two cases:
1 
1        a. Literal text, provided with '-e' or '--source'.  This text is
1           just appended directly.
1 
1        b. Source file names, provided with '-f'.  We use a neat trick
1           and append '@include FILENAME' to the shell variable's
1           contents.  Because the file-inclusion program works the way
1           'gawk' does, this gets the text of the file included in the
1           program at the correct point.
1 
1   3. Run an 'awk' program (naturally) over the shell variable's contents
1      to expand '@include' statements.  The expanded program is placed in
1      a second shell variable.
1 
1   4. Run the expanded program with 'gawk' and any other original
1      command-line arguments that the user supplied (such as the data
1      file names).
1 
1    This program uses shell variables extensively: for storing
1 command-line arguments and the text of the 'awk' program that will
1 expand the user's program, for the user's original program, and for the
1 expanded program.  Doing so removes some potential problems that might
1 arise were we to use temporary files instead, at the cost of making the
1 script somewhat more complicated.
1 
1    The initial part of the program turns on shell tracing if the first
1 argument is 'debug'.
1 
1    The next part loops through all the command-line arguments.  There
1 are several cases of interest:
1 
1 '--'
1      This ends the arguments to 'igawk'.  Anything else should be passed
1      on to the user's 'awk' program without being evaluated.
1 
1 '-W'
1      This indicates that the next option is specific to 'gawk'.  To make
1      argument processing easier, the '-W' is appended to the front of
1      the remaining arguments and the loop continues.  (This is an 'sh'
1      programming trick.  Don't worry about it if you are not familiar
1      with 'sh'.)
1 
1 '-v', '-F'
1      These are saved and passed on to 'gawk'.
1 
1 '-f', '--file', '--file=', '-Wfile='
1      The file name is appended to the shell variable 'program' with an
1      '@include' statement.  The 'expr' utility is used to remove the
1      leading option part of the argument (e.g., '--file=').  (Typical
1      'sh' usage would be to use the 'echo' and 'sed' utilities to do
1      this work.  Unfortunately, some versions of 'echo' evaluate escape
1      sequences in their arguments, possibly mangling the program text.
1      Using 'expr' avoids this problem.)
1 
1 '--source', '--source=', '-Wsource='
1      The source text is appended to 'program'.
1 
1 '--version', '-Wversion'
1      'igawk' prints its version number, runs 'gawk --version' to get the
1      'gawk' version information, and then exits.
1 
1    If none of the '-f', '--file', '-Wfile', '--source', or '-Wsource'
1 arguments are supplied, then the first nonoption argument should be the
1 'awk' program.  If there are no command-line arguments left, 'igawk'
1 prints an error message and exits.  Otherwise, the first argument is
1 appended to 'program'.  In any case, after the arguments have been
1 processed, the shell variable 'program' contains the complete text of
1 the original 'awk' program.
1 
1    The program is as follows:
1 
1      #! /bin/sh
1      # igawk --- like gawk but do @include processing
1 
1      if [ "$1" = debug ]
1      then
1          set -x
1          shift
1      fi
1 
1      # A literal newline, so that program text is formatted correctly
1      n='
1      '
1 
1      # Initialize variables to empty
1      program=
1      opts=
1 
1      while [ $# -ne 0 ] # loop over arguments
1      do
1          case $1 in
1          --)     shift
1                  break ;;
1 
1          -W)     shift
1                  # The ${x?'message here'} construct prints a
1                  # diagnostic if $x is the null string
1                  set -- -W"${@?'missing operand'}"
1                  continue ;;
1 
1          -[vF])  opts="$opts $1 '${2?'missing operand'}'"
1                  shift ;;
1 
1          -[vF]*) opts="$opts '$1'" ;;
1 
1          -f)     program="$program$n@include ${2?'missing operand'}"
1                  shift ;;
1 
1          -f*)    f=$(expr "$1" : '-f\(.*\)')
1                  program="$program$n@include $f" ;;
1 
1          -[W-]file=*)
1                  f=$(expr "$1" : '-.file=\(.*\)')
1                  program="$program$n@include $f" ;;
1 
1          -[W-]file)
1                  program="$program$n@include ${2?'missing operand'}"
1                  shift ;;
1 
1          -[W-]source=*)
1                  t=$(expr "$1" : '-.source=\(.*\)')
1                  program="$program$n$t" ;;
1 
1          -[W-]source)
1                  program="$program$n${2?'missing operand'}"
1                  shift ;;
1 
1          -[W-]version)
1                  echo igawk: version 3.0 1>&2
1                  gawk --version
1                  exit 0 ;;
1 
1          -[W-]*) opts="$opts '$1'" ;;
1 
1          *)      break ;;
1          esac
1          shift
1      done
1 
1      if [ -z "$program" ]
1      then
1           program=${1?'missing program'}
1           shift
1      fi
1 
1      # At this point, `program' has the program.
1 
1    The 'awk' program to process '@include' directives is stored in the
1 shell variable 'expand_prog'.  Doing this keeps the shell script
1 readable.  The 'awk' program reads through the user's program, one line
1 at a time, using 'getline' (⇒Getline).  The input file names and
1 '@include' statements are managed using a stack.  As each '@include' is
1 encountered, the current file name is "pushed" onto the stack and the
1 file named in the '@include' directive becomes the current file name.
1 As each file is finished, the stack is "popped," and the previous input
1 file becomes the current input file again.  The process is started by
1 making the original file the first one on the stack.
1 
1    The 'pathto()' function does the work of finding the full path to a
1 file.  It simulates 'gawk''s behavior when searching the 'AWKPATH'
1 environment variable (⇒AWKPATH Variable).  If a file name has a
1 '/' in it, no path search is done.  Similarly, if the file name is
1 '"-"', then that string is used as-is.  Otherwise, the file name is
1 concatenated with the name of each directory in the path, and an attempt
1 is made to open the generated file name.  The only way to test if a file
1 can be read in 'awk' is to go ahead and try to read it with 'getline';
1 this is what 'pathto()' does.(2)  If the file can be read, it is closed
1 and the file name is returned:
1 
1      expand_prog='
1 
1      function pathto(file,    i, t, junk)
1      {
1          if (index(file, "/") != 0)
1              return file
1 
1          if (file == "-")
1              return file
1 
1          for (i = 1; i <= ndirs; i++) {
1              t = (pathlist[i] "/" file)
1              if ((getline junk < t) > 0) {
1                  # found it
1                  close(t)
1                  return t
1              }
1          }
1          return ""
1      }
1 
1    The main program is contained inside one 'BEGIN' rule.  The first
1 thing it does is set up the 'pathlist' array that 'pathto()' uses.
1 After splitting the path on ':', null elements are replaced with '"."',
1 which represents the current directory:
1 
1      BEGIN {
1          path = ENVIRON["AWKPATH"]
1          ndirs = split(path, pathlist, ":")
1          for (i = 1; i <= ndirs; i++) {
1              if (pathlist[i] == "")
1                  pathlist[i] = "."
1          }
1 
1    The stack is initialized with 'ARGV[1]', which will be
1 '"/dev/stdin"'.  The main loop comes next.  Input lines are read in
1 succession.  Lines that do not start with '@include' are printed
1 verbatim.  If the line does start with '@include', the file name is in
1 '$2'.  'pathto()' is called to generate the full path.  If it cannot,
1 then the program prints an error message and continues.
1 
1    The next thing to check is if the file is included already.  The
1 'processed' array is indexed by the full file name of each included file
1 and it tracks this information for us.  If the file is seen again, a
1 warning message is printed.  Otherwise, the new file name is pushed onto
1 the stack and processing continues.
1 
1    Finally, when 'getline' encounters the end of the input file, the
1 file is closed and the stack is popped.  When 'stackptr' is less than
1 zero, the program is done:
1 
1          stackptr = 0
1          input[stackptr] = ARGV[1] # ARGV[1] is first file
1 
1          for (; stackptr >= 0; stackptr--) {
1              while ((getline < input[stackptr]) > 0) {
1                  if (tolower($1) != "@include") {
1                      print
1                      continue
1                  }
1                  fpath = pathto($2)
1                  if (fpath == "") {
1                      printf("igawk: %s:%d: cannot find %s\n",
1                          input[stackptr], FNR, $2) > "/dev/stderr"
1                      continue
1                  }
1                  if (! (fpath in processed)) {
1                      processed[fpath] = input[stackptr]
1                      input[++stackptr] = fpath  # push onto stack
1                  } else
1                      print $2, "included in", input[stackptr],
1                          "already included in",
1                          processed[fpath] > "/dev/stderr"
1              }
1              close(input[stackptr])
1          }
1      }'  # close quote ends `expand_prog' variable
1 
1      processed_program=$(gawk -- "$expand_prog" /dev/stdin << EOF
1      $program
1      EOF
1      )
1 
1    The shell construct 'COMMAND << MARKER' is called a "here document".
1 Everything in the shell script up to the MARKER is fed to COMMAND as
1 input.  The shell processes the contents of the here document for
1 variable and command substitution (and possibly other things as well,
1 depending upon the shell).
1 
1    The shell construct '$(...)' is called "command substitution".  The
1 output of the command inside the parentheses is substituted into the
1 command line.  Because the result is used in a variable assignment, it
1 is saved as a single string, even if the results contain whitespace.
1 
1    The expanded program is saved in the variable 'processed_program'.
1 It's done in these steps:
1 
1   1. Run 'gawk' with the '@include'-processing program (the value of the
1      'expand_prog' shell variable) reading standard input.
1 
1   2. Standard input is the contents of the user's program, from the
1      shell variable 'program'.  Feed its contents to 'gawk' via a here
1      document.
1 
1   3. Save the results of this processing in the shell variable
1      'processed_program' by using command substitution.
1 
1    The last step is to call 'gawk' with the expanded program, along with
1 the original options and command-line arguments that the user supplied:
1 
1      eval gawk $opts -- '"$processed_program"' '"$@"'
1 
1    The 'eval' command is a shell construct that reruns the shell's
1 parsing process.  This keeps things properly quoted.
1 
1    This version of 'igawk' represents the fifth version of this program.
1 There are four key simplifications that make the program work better:
1 
1    * Using '@include' even for the files named with '-f' makes building
1      the initial collected 'awk' program much simpler; all the
1      '@include' processing can be done once.
1 
1    * Not trying to save the line read with 'getline' in the 'pathto()'
1      function when testing for the file's accessibility for use with the
1      main program simplifies things considerably.
1 
1    * Using a 'getline' loop in the 'BEGIN' rule does it all in one
1      place.  It is not necessary to call out to a separate loop for
1      processing nested '@include' statements.
1 
1    * Instead of saving the expanded program in a temporary file, putting
1      it in a shell variable avoids some potential security problems.
1      This has the disadvantage that the script relies upon more features
1      of the 'sh' language, making it harder to follow for those who
1      aren't familiar with 'sh'.
1 
1    Also, this program illustrates that it is often worthwhile to combine
1 'sh' and 'awk' programming together.  You can usually accomplish quite a
1 lot, without having to resort to low-level programming in C or C++, and
1 it is frequently easier to do certain kinds of string and argument
1 manipulation using the shell than it is in 'awk'.
1 
1    Finally, 'igawk' shows that it is not always necessary to add new
1 features to a program; they can often be layered on top.(3)
1 
1    ---------- Footnotes ----------
1 
1    (1) Fully explaining the 'sh' language is beyond the scope of this
1 book.  We provide some minimal explanations, but see a good shell
1 programming book if you wish to understand things in more depth.
1 
1    (2) On some very old versions of 'awk', the test 'getline junk < t'
1 can loop forever if the file exists but is empty.
1 
1    (3) 'gawk' does '@include' processing itself in order to support the
1 use of 'awk' programs as Web CGI scripts.
1