gawk: Getopt Function

1 
1 10.4 Processing Command-Line Options
1 ====================================
1 
1 Most utilities on POSIX-compatible systems take options on the command
1 line that can be used to change the way a program behaves.  'awk' is an
1 example of such a program (⇒Options).  Often, options take
1 "arguments" (i.e., data that the program needs to correctly obey the
1 command-line option).  For example, 'awk''s '-F' option requires a
1 string to use as the field separator.  The first occurrence on the
1 command line of either '--' or a string that does not begin with '-'
1 ends the options.
1 
1    Modern Unix systems provide a C function named 'getopt()' for
1 processing command-line arguments.  The programmer provides a string
1 describing the one-letter options.  If an option requires an argument,
1 it is followed in the string with a colon.  'getopt()' is also passed
1 the count and values of the command-line arguments and is called in a
1 loop.  'getopt()' processes the command-line arguments for option
1 letters.  Each time around the loop, it returns a single character
1 representing the next option letter that it finds, or '?' if it finds an
1 invalid option.  When it returns -1, there are no options left on the
1 command line.
1 
1    When using 'getopt()', options that do not take arguments can be
1 grouped together.  Furthermore, options that take arguments require that
1 the argument be present.  The argument can immediately follow the option
1 letter, or it can be a separate command-line argument.
1 
1    Given a hypothetical program that takes three command-line options,
1 '-a', '-b', and '-c', where '-b' requires an argument, all of the
1 following are valid ways of invoking the program:
1 
1      prog -a -b foo -c data1 data2 data3
1      prog -ac -bfoo -- data1 data2 data3
1      prog -acbfoo data1 data2 data3
1 
1    Notice that when the argument is grouped with its option, the rest of
1 the argument is considered to be the option's argument.  In this
1 example, '-acbfoo' indicates that all of the '-a', '-b', and '-c'
1 options were supplied, and that 'foo' is the argument to the '-b'
1 option.
1 
1    'getopt()' provides four external variables that the programmer can
1 use:
1 
1 'optind'
1      The index in the argument value array ('argv') where the first
1      nonoption command-line argument can be found.
1 
1 'optarg'
1      The string value of the argument to an option.
1 
1 'opterr'
1      Usually 'getopt()' prints an error message when it finds an invalid
1      option.  Setting 'opterr' to zero disables this feature.  (An
1      application might want to print its own error message.)
1 
1 'optopt'
1      The letter representing the command-line option.
1 
1    The following C fragment shows how 'getopt()' might process
1 command-line arguments for 'awk':
1 
1      int
1      main(int argc, char *argv[])
1      {
1          ...
1          /* print our own message */
1          opterr = 0;
1          while ((c = getopt(argc, argv, "v:f:F:W:")) != -1) {
1              switch (c) {
1              case 'f':    /* file */
1                  ...
1                  break;
1              case 'F':    /* field separator */
1                  ...
1                  break;
1              case 'v':    /* variable assignment */
1                  ...
1                  break;
1              case 'W':    /* extension */
1                  ...
1                  break;
1              case '?':
1              default:
1                  usage();
1                  break;
1              }
1          }
1          ...
1      }
1 
1    As a side point, 'gawk' actually uses the GNU 'getopt_long()'
11 function to process both normal and GNU-style long options (⇒
 Options).
1 
1    The abstraction provided by 'getopt()' is very useful and is quite
1 handy in 'awk' programs as well.  Following is an 'awk' version of
1 'getopt()'.  This function highlights one of the greatest weaknesses in
1 'awk', which is that it is very poor at manipulating single characters.
1 Repeated calls to 'substr()' are necessary for accessing individual
1 characters (⇒String Functions).(1)
1 
1    The discussion that follows walks through the code a bit at a time:
1 
1      # getopt.awk --- Do C library getopt(3) function in awk
1 
1      # External variables:
1      #    Optind -- index in ARGV of first nonoption argument
1      #    Optarg -- string value of argument to current option
1      #    Opterr -- if nonzero, print our own diagnostic
1      #    Optopt -- current option letter
1 
1      # Returns:
1      #    -1     at end of options
1      #    "?"    for unrecognized option
1      #    <c>    a character representing the current option
1 
1      # Private Data:
1      #    _opti  -- index in multiflag option, e.g., -abc
1 
1    The function starts out with comments presenting a list of the global
1 variables it uses, what the return values are, what they mean, and any
1 global variables that are "private" to this library function.  Such
1 documentation is essential for any program, and particularly for library
1 functions.
1 
1    The 'getopt()' function first checks that it was indeed called with a
1 string of options (the 'options' parameter).  If 'options' has a zero
1 length, 'getopt()' immediately returns -1:
1 
1      function getopt(argc, argv, options,    thisopt, i)
1      {
1          if (length(options) == 0)    # no options given
1              return -1
1 
1          if (argv[Optind] == "--") {  # all done
1              Optind++
1              _opti = 0
1              return -1
1          } else if (argv[Optind] !~ /^-[^:[:space:]]/) {
1              _opti = 0
1              return -1
1          }
1 
1    The next thing to check for is the end of the options.  A '--' ends
1 the command-line options, as does any command-line argument that does
1 not begin with a '-'.  'Optind' is used to step through the array of
1 command-line arguments; it retains its value across calls to 'getopt()',
1 because it is a global variable.
1 
1    The regular expression that is used, '/^-[^:[:space:]/', checks for a
1 '-' followed by anything that is not whitespace and not a colon.  If the
1 current command-line argument does not match this pattern, it is not an
1 option, and it ends option processing.  Continuing on:
1 
1          if (_opti == 0)
1              _opti = 2
1          thisopt = substr(argv[Optind], _opti, 1)
1          Optopt = thisopt
1          i = index(options, thisopt)
1          if (i == 0) {
1              if (Opterr)
1                  printf("%c -- invalid option\n", thisopt) > "/dev/stderr"
1              if (_opti >= length(argv[Optind])) {
1                  Optind++
1                  _opti = 0
1              } else
1                  _opti++
1              return "?"
1          }
1 
1    The '_opti' variable tracks the position in the current command-line
1 argument ('argv[Optind]').  If multiple options are grouped together
1 with one '-' (e.g., '-abx'), it is necessary to return them to the user
1 one at a time.
1 
1    If '_opti' is equal to zero, it is set to two, which is the index in
1 the string of the next character to look at (we skip the '-', which is
1 at position one).  The variable 'thisopt' holds the character, obtained
1 with 'substr()'.  It is saved in 'Optopt' for the main program to use.
1 
1    If 'thisopt' is not in the 'options' string, then it is an invalid
1 option.  If 'Opterr' is nonzero, 'getopt()' prints an error message on
1 the standard error that is similar to the message from the C version of
1 'getopt()'.
1 
1    Because the option is invalid, it is necessary to skip it and move on
1 to the next option character.  If '_opti' is greater than or equal to
1 the length of the current command-line argument, it is necessary to move
1 on to the next argument, so 'Optind' is incremented and '_opti' is reset
1 to zero.  Otherwise, 'Optind' is left alone and '_opti' is merely
1 incremented.
1 
1    In any case, because the option is invalid, 'getopt()' returns '"?"'.
1 The main program can examine 'Optopt' if it needs to know what the
1 invalid option letter actually is.  Continuing on:
1 
1          if (substr(options, i + 1, 1) == ":") {
1              # get option argument
1              if (length(substr(argv[Optind], _opti + 1)) > 0)
1                  Optarg = substr(argv[Optind], _opti + 1)
1              else
1                  Optarg = argv[++Optind]
1              _opti = 0
1          } else
1              Optarg = ""
1 
1    If the option requires an argument, the option letter is followed by
1 a colon in the 'options' string.  If there are remaining characters in
1 the current command-line argument ('argv[Optind]'), then the rest of
1 that string is assigned to 'Optarg'.  Otherwise, the next command-line
1 argument is used ('-xFOO' versus '-x FOO').  In either case, '_opti' is
1 reset to zero, because there are no more characters left to examine in
1 the current command-line argument.  Continuing:
1 
1          if (_opti == 0 || _opti >= length(argv[Optind])) {
1              Optind++
1              _opti = 0
1          } else
1              _opti++
1          return thisopt
1      }
1 
1    Finally, if '_opti' is either zero or greater than the length of the
1 current command-line argument, it means this element in 'argv' is
1 through being processed, so 'Optind' is incremented to point to the next
1 element in 'argv'.  If neither condition is true, then only '_opti' is
1 incremented, so that the next option letter can be processed on the next
1 call to 'getopt()'.
1 
1    The 'BEGIN' rule initializes both 'Opterr' and 'Optind' to one.
1 'Opterr' is set to one, because the default behavior is for 'getopt()'
1 to print a diagnostic message upon seeing an invalid option.  'Optind'
1 is set to one, because there's no reason to look at the program name,
1 which is in 'ARGV[0]':
1 
1      BEGIN {
1          Opterr = 1    # default is to diagnose
1          Optind = 1    # skip ARGV[0]
1 
1          # test program
1          if (_getopt_test) {
1              while ((_go_c = getopt(ARGC, ARGV, "ab:cd")) != -1)
1                  printf("c = <%c>, Optarg = <%s>\n",
1                                             _go_c, Optarg)
1              printf("non-option arguments:\n")
1              for (; Optind < ARGC; Optind++)
1                  printf("\tARGV[%d] = <%s>\n",
1                                          Optind, ARGV[Optind])
1          }
1      }
1 
1    The rest of the 'BEGIN' rule is a simple test program.  Here are the
1 results of two sample runs of the test program:
1 
1      $ awk -f getopt.awk -v _getopt_test=1 -- -a -cbARG bax -x
1      -| c = <a>, Optarg = <>
1      -| c = <c>, Optarg = <>
1      -| c = <b>, Optarg = <ARG>
1      -| non-option arguments:
1      -|         ARGV[3] = <bax>
1      -|         ARGV[4] = <-x>
1 
1      $ awk -f getopt.awk -v _getopt_test=1 -- -a -x -- xyz abc
1      -| c = <a>, Optarg = <>
1      error-> x -- invalid option
1      -| c = <?>, Optarg = <>
1      -| non-option arguments:
1      -|         ARGV[4] = <xyz>
1      -|         ARGV[5] = <abc>
1 
1    In both runs, the first '--' terminates the arguments to 'awk', so
1 that it does not try to interpret the '-a', etc., as its own options.
1 
1      NOTE: After 'getopt()' is through, user-level code must clear out
1      all the elements of 'ARGV' from 1 to 'Optind', so that 'awk' does
1      not try to process the command-line options as file names.
1 
1    Using '#!' with the '-E' option may help avoid conflicts between your
1 program's options and 'gawk''s options, as '-E' causes 'gawk' to abandon
DONTPRINTYET 1 processing of further options (⇒Executable Scripts and *note1DONTPRINTYET 1 processing of further options (⇒Executable Scripts and ⇒
 Options).
1 
1    Several of the sample programs presented in ⇒Sample Programs,
1 use 'getopt()' to process their arguments.
1 
1    ---------- Footnotes ----------
1 
1    (1) This function was written before 'gawk' acquired the ability to
1 split strings into single characters using '""' as the separator.  We
1 have left it alone, as using 'substr()' is more portable.
1