gawk: Igawk Program
1
1 11.3.9 An Easy Way to Use Library Functions
1 -------------------------------------------
1
1 In ⇒Include Files, we saw how 'gawk' provides a built-in
1 file-inclusion capability. However, this is a 'gawk' extension. This
1 minor node provides the motivation for making file inclusion available
1 for standard 'awk', and shows how to do it using a combination of shell
1 and 'awk' programming.
1
1 Using library functions in 'awk' can be very beneficial. It
1 encourages code reuse and the writing of general functions. Programs
1 are smaller and therefore clearer. However, using library functions is
1 only easy when writing 'awk' programs; it is painful when running them,
1 requiring multiple '-f' options. If 'gawk' is unavailable, then so too
1 is the 'AWKPATH' environment variable and the ability to put 'awk'
1 functions into a library directory (⇒Options). It would be nice
1 to be able to write programs in the following manner:
1
1 # library functions
1 @include getopt.awk
1 @include join.awk
1 ...
1
1 # main program
1 BEGIN {
1 while ((c = getopt(ARGC, ARGV, "a:b:cde")) != -1)
1 ...
1 ...
1 }
1
1 The following program, 'igawk.sh', provides this service. It
1 simulates 'gawk''s searching of the 'AWKPATH' variable and also allows
1 "nested" includes (i.e., a file that is included with '@include' can
1 contain further '@include' statements). 'igawk' makes an effort to only
1 include files once, so that nested includes don't accidentally include a
1 library function twice.
1
1 'igawk' should behave just like 'gawk' externally. This means it
1 should accept all of 'gawk''s command-line arguments, including the
1 ability to have multiple source files specified via '-f' and the ability
1 to mix command-line and library source files.
1
1 The program is written using the POSIX Shell ('sh') command
1 language.(1) It works as follows:
1
1 1. Loop through the arguments, saving anything that doesn't represent
1 'awk' source code for later, when the expanded program is run.
1
1 2. For any arguments that do represent 'awk' text, put the arguments
1 into a shell variable that will be expanded. There are two cases:
1
1 a. Literal text, provided with '-e' or '--source'. This text is
1 just appended directly.
1
1 b. Source file names, provided with '-f'. We use a neat trick
1 and append '@include FILENAME' to the shell variable's
1 contents. Because the file-inclusion program works the way
1 'gawk' does, this gets the text of the file included in the
1 program at the correct point.
1
1 3. Run an 'awk' program (naturally) over the shell variable's contents
1 to expand '@include' statements. The expanded program is placed in
1 a second shell variable.
1
1 4. Run the expanded program with 'gawk' and any other original
1 command-line arguments that the user supplied (such as the data
1 file names).
1
1 This program uses shell variables extensively: for storing
1 command-line arguments and the text of the 'awk' program that will
1 expand the user's program, for the user's original program, and for the
1 expanded program. Doing so removes some potential problems that might
1 arise were we to use temporary files instead, at the cost of making the
1 script somewhat more complicated.
1
1 The initial part of the program turns on shell tracing if the first
1 argument is 'debug'.
1
1 The next part loops through all the command-line arguments. There
1 are several cases of interest:
1
1 '--'
1 This ends the arguments to 'igawk'. Anything else should be passed
1 on to the user's 'awk' program without being evaluated.
1
1 '-W'
1 This indicates that the next option is specific to 'gawk'. To make
1 argument processing easier, the '-W' is appended to the front of
1 the remaining arguments and the loop continues. (This is an 'sh'
1 programming trick. Don't worry about it if you are not familiar
1 with 'sh'.)
1
1 '-v', '-F'
1 These are saved and passed on to 'gawk'.
1
1 '-f', '--file', '--file=', '-Wfile='
1 The file name is appended to the shell variable 'program' with an
1 '@include' statement. The 'expr' utility is used to remove the
1 leading option part of the argument (e.g., '--file='). (Typical
1 'sh' usage would be to use the 'echo' and 'sed' utilities to do
1 this work. Unfortunately, some versions of 'echo' evaluate escape
1 sequences in their arguments, possibly mangling the program text.
1 Using 'expr' avoids this problem.)
1
1 '--source', '--source=', '-Wsource='
1 The source text is appended to 'program'.
1
1 '--version', '-Wversion'
1 'igawk' prints its version number, runs 'gawk --version' to get the
1 'gawk' version information, and then exits.
1
1 If none of the '-f', '--file', '-Wfile', '--source', or '-Wsource'
1 arguments are supplied, then the first nonoption argument should be the
1 'awk' program. If there are no command-line arguments left, 'igawk'
1 prints an error message and exits. Otherwise, the first argument is
1 appended to 'program'. In any case, after the arguments have been
1 processed, the shell variable 'program' contains the complete text of
1 the original 'awk' program.
1
1 The program is as follows:
1
1 #! /bin/sh
1 # igawk --- like gawk but do @include processing
1
1 if [ "$1" = debug ]
1 then
1 set -x
1 shift
1 fi
1
1 # A literal newline, so that program text is formatted correctly
1 n='
1 '
1
1 # Initialize variables to empty
1 program=
1 opts=
1
1 while [ $# -ne 0 ] # loop over arguments
1 do
1 case $1 in
1 --) shift
1 break ;;
1
1 -W) shift
1 # The ${x?'message here'} construct prints a
1 # diagnostic if $x is the null string
1 set -- -W"${@?'missing operand'}"
1 continue ;;
1
1 -[vF]) opts="$opts $1 '${2?'missing operand'}'"
1 shift ;;
1
1 -[vF]*) opts="$opts '$1'" ;;
1
1 -f) program="$program$n@include ${2?'missing operand'}"
1 shift ;;
1
1 -f*) f=$(expr "$1" : '-f\(.*\)')
1 program="$program$n@include $f" ;;
1
1 -[W-]file=*)
1 f=$(expr "$1" : '-.file=\(.*\)')
1 program="$program$n@include $f" ;;
1
1 -[W-]file)
1 program="$program$n@include ${2?'missing operand'}"
1 shift ;;
1
1 -[W-]source=*)
1 t=$(expr "$1" : '-.source=\(.*\)')
1 program="$program$n$t" ;;
1
1 -[W-]source)
1 program="$program$n${2?'missing operand'}"
1 shift ;;
1
1 -[W-]version)
1 echo igawk: version 3.0 1>&2
1 gawk --version
1 exit 0 ;;
1
1 -[W-]*) opts="$opts '$1'" ;;
1
1 *) break ;;
1 esac
1 shift
1 done
1
1 if [ -z "$program" ]
1 then
1 program=${1?'missing program'}
1 shift
1 fi
1
1 # At this point, `program' has the program.
1
1 The 'awk' program to process '@include' directives is stored in the
1 shell variable 'expand_prog'. Doing this keeps the shell script
1 readable. The 'awk' program reads through the user's program, one line
1 at a time, using 'getline' (⇒Getline). The input file names and
1 '@include' statements are managed using a stack. As each '@include' is
1 encountered, the current file name is "pushed" onto the stack and the
1 file named in the '@include' directive becomes the current file name.
1 As each file is finished, the stack is "popped," and the previous input
1 file becomes the current input file again. The process is started by
1 making the original file the first one on the stack.
1
1 The 'pathto()' function does the work of finding the full path to a
1 file. It simulates 'gawk''s behavior when searching the 'AWKPATH'
1 environment variable (⇒AWKPATH Variable). If a file name has a
1 '/' in it, no path search is done. Similarly, if the file name is
1 '"-"', then that string is used as-is. Otherwise, the file name is
1 concatenated with the name of each directory in the path, and an attempt
1 is made to open the generated file name. The only way to test if a file
1 can be read in 'awk' is to go ahead and try to read it with 'getline';
1 this is what 'pathto()' does.(2) If the file can be read, it is closed
1 and the file name is returned:
1
1 expand_prog='
1
1 function pathto(file, i, t, junk)
1 {
1 if (index(file, "/") != 0)
1 return file
1
1 if (file == "-")
1 return file
1
1 for (i = 1; i <= ndirs; i++) {
1 t = (pathlist[i] "/" file)
1 if ((getline junk < t) > 0) {
1 # found it
1 close(t)
1 return t
1 }
1 }
1 return ""
1 }
1
1 The main program is contained inside one 'BEGIN' rule. The first
1 thing it does is set up the 'pathlist' array that 'pathto()' uses.
1 After splitting the path on ':', null elements are replaced with '"."',
1 which represents the current directory:
1
1 BEGIN {
1 path = ENVIRON["AWKPATH"]
1 ndirs = split(path, pathlist, ":")
1 for (i = 1; i <= ndirs; i++) {
1 if (pathlist[i] == "")
1 pathlist[i] = "."
1 }
1
1 The stack is initialized with 'ARGV[1]', which will be
1 '"/dev/stdin"'. The main loop comes next. Input lines are read in
1 succession. Lines that do not start with '@include' are printed
1 verbatim. If the line does start with '@include', the file name is in
1 '$2'. 'pathto()' is called to generate the full path. If it cannot,
1 then the program prints an error message and continues.
1
1 The next thing to check is if the file is included already. The
1 'processed' array is indexed by the full file name of each included file
1 and it tracks this information for us. If the file is seen again, a
1 warning message is printed. Otherwise, the new file name is pushed onto
1 the stack and processing continues.
1
1 Finally, when 'getline' encounters the end of the input file, the
1 file is closed and the stack is popped. When 'stackptr' is less than
1 zero, the program is done:
1
1 stackptr = 0
1 input[stackptr] = ARGV[1] # ARGV[1] is first file
1
1 for (; stackptr >= 0; stackptr--) {
1 while ((getline < input[stackptr]) > 0) {
1 if (tolower($1) != "@include") {
1 print
1 continue
1 }
1 fpath = pathto($2)
1 if (fpath == "") {
1 printf("igawk: %s:%d: cannot find %s\n",
1 input[stackptr], FNR, $2) > "/dev/stderr"
1 continue
1 }
1 if (! (fpath in processed)) {
1 processed[fpath] = input[stackptr]
1 input[++stackptr] = fpath # push onto stack
1 } else
1 print $2, "included in", input[stackptr],
1 "already included in",
1 processed[fpath] > "/dev/stderr"
1 }
1 close(input[stackptr])
1 }
1 }' # close quote ends `expand_prog' variable
1
1 processed_program=$(gawk -- "$expand_prog" /dev/stdin << EOF
1 $program
1 EOF
1 )
1
1 The shell construct 'COMMAND << MARKER' is called a "here document".
1 Everything in the shell script up to the MARKER is fed to COMMAND as
1 input. The shell processes the contents of the here document for
1 variable and command substitution (and possibly other things as well,
1 depending upon the shell).
1
1 The shell construct '$(...)' is called "command substitution". The
1 output of the command inside the parentheses is substituted into the
1 command line. Because the result is used in a variable assignment, it
1 is saved as a single string, even if the results contain whitespace.
1
1 The expanded program is saved in the variable 'processed_program'.
1 It's done in these steps:
1
1 1. Run 'gawk' with the '@include'-processing program (the value of the
1 'expand_prog' shell variable) reading standard input.
1
1 2. Standard input is the contents of the user's program, from the
1 shell variable 'program'. Feed its contents to 'gawk' via a here
1 document.
1
1 3. Save the results of this processing in the shell variable
1 'processed_program' by using command substitution.
1
1 The last step is to call 'gawk' with the expanded program, along with
1 the original options and command-line arguments that the user supplied:
1
1 eval gawk $opts -- '"$processed_program"' '"$@"'
1
1 The 'eval' command is a shell construct that reruns the shell's
1 parsing process. This keeps things properly quoted.
1
1 This version of 'igawk' represents the fifth version of this program.
1 There are four key simplifications that make the program work better:
1
1 * Using '@include' even for the files named with '-f' makes building
1 the initial collected 'awk' program much simpler; all the
1 '@include' processing can be done once.
1
1 * Not trying to save the line read with 'getline' in the 'pathto()'
1 function when testing for the file's accessibility for use with the
1 main program simplifies things considerably.
1
1 * Using a 'getline' loop in the 'BEGIN' rule does it all in one
1 place. It is not necessary to call out to a separate loop for
1 processing nested '@include' statements.
1
1 * Instead of saving the expanded program in a temporary file, putting
1 it in a shell variable avoids some potential security problems.
1 This has the disadvantage that the script relies upon more features
1 of the 'sh' language, making it harder to follow for those who
1 aren't familiar with 'sh'.
1
1 Also, this program illustrates that it is often worthwhile to combine
1 'sh' and 'awk' programming together. You can usually accomplish quite a
1 lot, without having to resort to low-level programming in C or C++, and
1 it is frequently easier to do certain kinds of string and argument
1 manipulation using the shell than it is in 'awk'.
1
1 Finally, 'igawk' shows that it is not always necessary to add new
1 features to a program; they can often be layered on top.(3)
1
1 ---------- Footnotes ----------
1
1 (1) Fully explaining the 'sh' language is beyond the scope of this
1 book. We provide some minimal explanations, but see a good shell
1 programming book if you wish to understand things in more depth.
1
1 (2) On some very old versions of 'awk', the test 'getline junk < t'
1 can loop forever if the file exists but is empty.
1
1 (3) 'gawk' does '@include' processing itself in order to support the
1 use of 'awk' programs as Web CGI scripts.
1