gawk: Egrep Program

1 
1 11.2.2 Searching for Regular Expressions in Files
1 -------------------------------------------------
1 
1 The 'egrep' utility searches files for patterns.  It uses regular
11 expressions that are almost identical to those available in 'awk' (⇒
 Regexp).  You invoke it as follows:
1 
1      'egrep' [OPTIONS] ''PATTERN'' FILES ...
1 
1    The PATTERN is a regular expression.  In typical usage, the regular
1 expression is quoted to prevent the shell from expanding any of the
1 special characters as file name wildcards.  Normally, 'egrep' prints the
1 lines that matched.  If multiple file names are provided on the command
1 line, each output line is preceded by the name of the file and a colon.
1 
1    The options to 'egrep' are as follows:
1 
1 '-c'
1      Print out a count of the lines that matched the pattern, instead of
1      the lines themselves.
1 
1 '-s'
1      Be silent.  No output is produced and the exit value indicates
1      whether the pattern was matched.
1 
1 '-v'
1      Invert the sense of the test.  'egrep' prints the lines that do
1      _not_ match the pattern and exits successfully if the pattern is
1      not matched.
1 
1 '-i'
1      Ignore case distinctions in both the pattern and the input data.
1 
1 '-l'
1      Only print (list) the names of the files that matched, not the
1      lines that matched.
1 
1 '-e PATTERN'
1      Use PATTERN as the regexp to match.  The purpose of the '-e' option
1      is to allow patterns that start with a '-'.
1 
11 Function::) and the file transition library program (⇒Filetrans
 Function).
1 
1    The program begins with a descriptive comment and then a 'BEGIN' rule
1 that processes the command-line arguments with 'getopt()'.  The '-i'
1 (ignore case) option is particularly easy with 'gawk'; we just use the
1 'IGNORECASE' predefined variable (⇒Built-in Variables):
1 
1      # egrep.awk --- simulate egrep in awk
1      #
1      # Options:
1      #    -c    count of lines
1      #    -s    silent - use exit value
1      #    -v    invert test, success if no match
1      #    -i    ignore case
1      #    -l    print filenames only
1      #    -e    argument is pattern
1      #
1      # Requires getopt and file transition library functions
1 
1      BEGIN {
1          while ((c = getopt(ARGC, ARGV, "ce:svil")) != -1) {
1              if (c == "c")
1                  count_only++
1              else if (c == "s")
1                  no_print++
1              else if (c == "v")
1                  invert++
1              else if (c == "i")
1                  IGNORECASE = 1
1              else if (c == "l")
1                  filenames_only++
1              else if (c == "e")
1                  pattern = Optarg
1              else
1                  usage()
1          }
1 
1    Next comes the code that handles the 'egrep'-specific behavior.  If
1 no pattern is supplied with '-e', the first nonoption on the command
1 line is used.  The 'awk' command-line arguments up to 'ARGV[Optind]' are
1 cleared, so that 'awk' won't try to process them as files.  If no files
1 are specified, the standard input is used, and if multiple files are
1 specified, we make sure to note this so that the file names can precede
1 the matched lines in the output:
1 
1          if (pattern == "")
1              pattern = ARGV[Optind++]
1 
1          for (i = 1; i < Optind; i++)
1              ARGV[i] = ""
1          if (Optind >= ARGC) {
1              ARGV[1] = "-"
1              ARGC = 2
1          } else if (ARGC - Optind > 1)
1              do_filenames++
1 
1      #    if (IGNORECASE)
1      #        pattern = tolower(pattern)
1      }
1 
1    The last two lines are commented out, as they are not needed in
1 'gawk'.  They should be uncommented if you have to use another version
1 of 'awk'.
1 
1    The next set of lines should be uncommented if you are not using
1 'gawk'.  This rule translates all the characters in the input line into
1 lowercase if the '-i' option is specified.(1)  The rule is commented out
1 as it is not necessary with 'gawk':
1 
1      #{
1      #    if (IGNORECASE)
1      #        $0 = tolower($0)
1      #}
1 
1    The 'beginfile()' function is called by the rule in 'ftrans.awk' when
1 each new file is processed.  In this case, it is very simple; all it
1 does is initialize a variable 'fcount' to zero.  'fcount' tracks how
1 many lines in the current file matched the pattern.  Naming the
1 parameter 'junk' shows we know that 'beginfile()' is called with a
1 parameter, but that we're not interested in its value:
1 
1      function beginfile(junk)
1      {
1          fcount = 0
1      }
1 
1    The 'endfile()' function is called after each file has been
1 processed.  It affects the output only when the user wants a count of
1 the number of lines that matched.  'no_print' is true only if the exit
1 status is desired.  'count_only' is true if line counts are desired.
1 'egrep' therefore only prints line counts if printing and counting are
1 enabled.  The output format must be adjusted depending upon the number
1 of files to process.  Finally, 'fcount' is added to 'total', so that we
1 know the total number of lines that matched the pattern:
1 
1      function endfile(file)
1      {
1          if (! no_print && count_only) {
1              if (do_filenames)
1                  print file ":" fcount
1              else
1                  print fcount
1          }
1 
1          total += fcount
1      }
1 
11    The 'BEGINFILE' and 'ENDFILE' special patterns (⇒
 BEGINFILE/ENDFILE) could be used, but then the program would be
1 'gawk'-specific.  Additionally, this example was written before 'gawk'
1 acquired 'BEGINFILE' and 'ENDFILE'.
1 
1    The following rule does most of the work of matching lines.  The
1 variable 'matches' is true if the line matched the pattern.  If the user
1 wants lines that did not match, the sense of 'matches' is inverted using
1 the '!' operator.  'fcount' is incremented with the value of 'matches',
1 which is either one or zero, depending upon a successful or unsuccessful
1 match.  If the line does not match, the 'next' statement just moves on
1 to the next record.
1 
1    A number of additional tests are made, but they are only done if we
1 are not counting lines.  First, if the user only wants the exit status
1 ('no_print' is true), then it is enough to know that _one_ line in this
1 file matched, and we can skip on to the next file with 'nextfile'.
1 Similarly, if we are only printing file names, we can print the file
1 name, and then skip to the next file with 'nextfile'.  Finally, each
1 line is printed, with a leading file name and colon if necessary:
1 
1      {
1          matches = ($0 ~ pattern)
1          if (invert)
1              matches = ! matches
1 
1          fcount += matches    # 1 or 0
1 
1          if (! matches)
1              next
1 
1          if (! count_only) {
1              if (no_print)
1                  nextfile
1 
1              if (filenames_only) {
1                  print FILENAME
1                  nextfile
1              }
1 
1              if (do_filenames)
1                  print FILENAME ":" $0
1              else
1                  print
1          }
1      }
1 
1    The 'END' rule takes care of producing the correct exit status.  If
1 there are no matches, the exit status is one; otherwise, it is zero:
1 
1      END {
1          exit (total == 0)
1      }
1 
1    The 'usage()' function prints a usage message in case of invalid
1 options, and then exits:
1 
1      function usage()
1      {
1          print("Usage: egrep [-csvil] [-e pat] [files ...]") > "/dev/stderr"
1          print("\n\tegrep [-csvil] pat [files ...]") > "/dev/stderr"
1          exit 1
1      }
1 
1    ---------- Footnotes ----------
1 
1    (1) It also introduces a subtle bug; if a match happens, we output
1 the translated line, not the original.
1