gawk: Two-way I/O

1 
1 12.3 Two-Way Communications with Another Process
1 ================================================
1 
1 It is often useful to be able to send data to a separate program for
1 processing and then read the result.  This can always be done with
1 temporary files:
1 
1      # Write the data for processing
1      tempfile = ("mydata." PROCINFO["pid"])
1      while (NOT DONE WITH DATA)
1          print DATA | ("subprogram > " tempfile)
1      close("subprogram > " tempfile)
1 
1      # Read the results, remove tempfile when done
1      while ((getline newdata < tempfile) > 0)
1          PROCESS newdata APPROPRIATELY
1      close(tempfile)
1      system("rm " tempfile)
1 
1 This works, but not elegantly.  Among other things, it requires that the
1 program be run in a directory that cannot be shared among users; for
1 example, '/tmp' will not do, as another user might happen to be using a
1 temporary file with the same name.(1)
1 
1    However, with 'gawk', it is possible to open a _two-way_ pipe to
1 another process.  The second process is termed a "coprocess", as it runs
1 in parallel with 'gawk'.  The two-way connection is created using the
1 '|&' operator (borrowed from the Korn shell, 'ksh'):(2)
1 
1      do {
1          print DATA |& "subprogram"
1          "subprogram" |& getline results
1      } while (DATA LEFT TO PROCESS)
1      close("subprogram")
1 
1    The first time an I/O operation is executed using the '|&' operator,
1 'gawk' creates a two-way pipeline to a child process that runs the other
1 program.  Output created with 'print' or 'printf' is written to the
1 program's standard input, and output from the program's standard output
1 can be read by the 'gawk' program using 'getline'.  As is the case with
1 processes started by '|', the subprogram can be any program, or pipeline
1 of programs, that can be started by the shell.
1 
1    There are some cautionary items to be aware of:
1 
1    * As the code inside 'gawk' currently stands, the coprocess's
1      standard error goes to the same place that the parent 'gawk''s
1      standard error goes.  It is not possible to read the child's
1      standard error separately.
1 
1    * I/O buffering may be a problem.  'gawk' automatically flushes all
1      output down the pipe to the coprocess.  However, if the coprocess
1      does not flush its output, 'gawk' may hang when doing a 'getline'
1      in order to read the coprocess's results.  This could lead to a
1      situation known as "deadlock", where each process is waiting for
1      the other one to do something.
1 
1    It is possible to close just one end of the two-way pipe to a
1 coprocess, by supplying a second argument to the 'close()' function of
1 either '"to"' or '"from"' (⇒Close Files And Pipes).  These
1 strings tell 'gawk' to close the end of the pipe that sends data to the
1 coprocess or the end that reads from it, respectively.
1 
1    This is particularly necessary in order to use the system 'sort'
1 utility as part of a coprocess; 'sort' must read _all_ of its input data
1 before it can produce any output.  The 'sort' program does not receive
1 an end-of-file indication until 'gawk' closes the write end of the pipe.
1 
1    When you have finished writing data to the 'sort' utility, you can
1 close the '"to"' end of the pipe, and then start reading sorted data via
1 'getline'.  For example:
1 
1      BEGIN {
1          command = "LC_ALL=C sort"
1          n = split("abcdefghijklmnopqrstuvwxyz", a, "")
1 
1          for (i = n; i > 0; i--)
1              print a[i] |& command
1          close(command, "to")
1 
1          while ((command |& getline line) > 0)
1              print "got", line
1          close(command)
1      }
1 
1    This program writes the letters of the alphabet in reverse order, one
1 per line, down the two-way pipe to 'sort'.  It then closes the write end
1 of the pipe, so that 'sort' receives an end-of-file indication.  This
1 causes 'sort' to sort the data and write the sorted data back to the
1 'gawk' program.  Once all of the data has been read, 'gawk' terminates
1 the coprocess and exits.
1 
1    As a side note, the assignment 'LC_ALL=C' in the 'sort' command
1 ensures traditional Unix (ASCII) sorting from 'sort'.  This is not
1 strictly necessary here, but it's good to know how to do this.
1 
1    Be careful when closing the '"from"' end of a two-way pipe; in this
1 case 'gawk' waits for the child process to exit, which may cause your
1 program to hang.  (Thus, this particular feature is of much less use in
1 practice than being able to close the '"to"' end.)
1 
1      CAUTION: Normally, it is a fatal error to write to the '"to"' end
1      of a two-way pipe which has been closed, and it is also a fatal
1      error to read from the '"from"' end of a two-way pipe that has been
1      closed.
1 
1      You may set 'PROCINFO["COMMAND", "NONFATAL"]' to make such
1      operations become nonfatal.  If you do so, you then need to check
11      'ERRNO' after each 'print', 'printf', or 'getline'.  ⇒
      Nonfatal, for more information.
1 
1    You may also use pseudo-ttys (ptys) for two-way communication instead
1 of pipes, if your system supports them.  This is done on a per-command
11 basis, by setting a special element in the 'PROCINFO' array (⇒
 Auto-set), like so:
1 
1      command = "sort -nr"           # command, save in convenience variable
1      PROCINFO[command, "pty"] = 1   # update PROCINFO
1      print ... |& command           # start two-way pipe
1      ...
1 
1 If your system does not have ptys, or if all the system's ptys are in
1 use, 'gawk' automatically falls back to using regular pipes.
1 
1    Using ptys usually avoids the buffer deadlock issues described
1 earlier, at some loss in performance.  This is because the tty driver
1 buffers and sends data line-by-line.  On systems with the 'stdbuf' (part
1 of the GNU Coreutils package
1 (https://www.gnu.org/software/coreutils/coreutils.html)), you can use
1 that program instead of ptys.
1 
1    Note also that ptys are not fully transparent.  Certain binary
1 control codes, such 'Ctrl-d' for end-of-file, are interpreted by the tty
1 driver and not passed through.
1 
1      CAUTION: Finally, coprocesses open up the possibility of "deadlock"
1      between 'gawk' and the program running in the coprocess.  This can
1      occur if you send "too much" data to the coprocess before reading
1      any back; each process is blocked writing data with noone available
1      to read what they've already written.  There is no workaround for
1      deadlock; careful programming and knowledge of the behavior of the
1      coprocess are required.
1 
1    ---------- Footnotes ----------
1 
1    (1) Michael Brennan suggests the use of 'rand()' to generate unique
1 file names.  This is a valid point; nevertheless, temporary files remain
1 more difficult to use than two-way pipes.
1 
1    (2) This is very different from the same operator in the C shell and
1 in Bash.
1