gawk: Two-way I/O
1
1 12.3 Two-Way Communications with Another Process
1 ================================================
1
1 It is often useful to be able to send data to a separate program for
1 processing and then read the result. This can always be done with
1 temporary files:
1
1 # Write the data for processing
1 tempfile = ("mydata." PROCINFO["pid"])
1 while (NOT DONE WITH DATA)
1 print DATA | ("subprogram > " tempfile)
1 close("subprogram > " tempfile)
1
1 # Read the results, remove tempfile when done
1 while ((getline newdata < tempfile) > 0)
1 PROCESS newdata APPROPRIATELY
1 close(tempfile)
1 system("rm " tempfile)
1
1 This works, but not elegantly. Among other things, it requires that the
1 program be run in a directory that cannot be shared among users; for
1 example, '/tmp' will not do, as another user might happen to be using a
1 temporary file with the same name.(1)
1
1 However, with 'gawk', it is possible to open a _two-way_ pipe to
1 another process. The second process is termed a "coprocess", as it runs
1 in parallel with 'gawk'. The two-way connection is created using the
1 '|&' operator (borrowed from the Korn shell, 'ksh'):(2)
1
1 do {
1 print DATA |& "subprogram"
1 "subprogram" |& getline results
1 } while (DATA LEFT TO PROCESS)
1 close("subprogram")
1
1 The first time an I/O operation is executed using the '|&' operator,
1 'gawk' creates a two-way pipeline to a child process that runs the other
1 program. Output created with 'print' or 'printf' is written to the
1 program's standard input, and output from the program's standard output
1 can be read by the 'gawk' program using 'getline'. As is the case with
1 processes started by '|', the subprogram can be any program, or pipeline
1 of programs, that can be started by the shell.
1
1 There are some cautionary items to be aware of:
1
1 * As the code inside 'gawk' currently stands, the coprocess's
1 standard error goes to the same place that the parent 'gawk''s
1 standard error goes. It is not possible to read the child's
1 standard error separately.
1
1 * I/O buffering may be a problem. 'gawk' automatically flushes all
1 output down the pipe to the coprocess. However, if the coprocess
1 does not flush its output, 'gawk' may hang when doing a 'getline'
1 in order to read the coprocess's results. This could lead to a
1 situation known as "deadlock", where each process is waiting for
1 the other one to do something.
1
1 It is possible to close just one end of the two-way pipe to a
1 coprocess, by supplying a second argument to the 'close()' function of
1 either '"to"' or '"from"' (⇒Close Files And Pipes). These
1 strings tell 'gawk' to close the end of the pipe that sends data to the
1 coprocess or the end that reads from it, respectively.
1
1 This is particularly necessary in order to use the system 'sort'
1 utility as part of a coprocess; 'sort' must read _all_ of its input data
1 before it can produce any output. The 'sort' program does not receive
1 an end-of-file indication until 'gawk' closes the write end of the pipe.
1
1 When you have finished writing data to the 'sort' utility, you can
1 close the '"to"' end of the pipe, and then start reading sorted data via
1 'getline'. For example:
1
1 BEGIN {
1 command = "LC_ALL=C sort"
1 n = split("abcdefghijklmnopqrstuvwxyz", a, "")
1
1 for (i = n; i > 0; i--)
1 print a[i] |& command
1 close(command, "to")
1
1 while ((command |& getline line) > 0)
1 print "got", line
1 close(command)
1 }
1
1 This program writes the letters of the alphabet in reverse order, one
1 per line, down the two-way pipe to 'sort'. It then closes the write end
1 of the pipe, so that 'sort' receives an end-of-file indication. This
1 causes 'sort' to sort the data and write the sorted data back to the
1 'gawk' program. Once all of the data has been read, 'gawk' terminates
1 the coprocess and exits.
1
1 As a side note, the assignment 'LC_ALL=C' in the 'sort' command
1 ensures traditional Unix (ASCII) sorting from 'sort'. This is not
1 strictly necessary here, but it's good to know how to do this.
1
1 Be careful when closing the '"from"' end of a two-way pipe; in this
1 case 'gawk' waits for the child process to exit, which may cause your
1 program to hang. (Thus, this particular feature is of much less use in
1 practice than being able to close the '"to"' end.)
1
1 CAUTION: Normally, it is a fatal error to write to the '"to"' end
1 of a two-way pipe which has been closed, and it is also a fatal
1 error to read from the '"from"' end of a two-way pipe that has been
1 closed.
1
1 You may set 'PROCINFO["COMMAND", "NONFATAL"]' to make such
1 operations become nonfatal. If you do so, you then need to check
11 'ERRNO' after each 'print', 'printf', or 'getline'. ⇒
Nonfatal, for more information.
1
1 You may also use pseudo-ttys (ptys) for two-way communication instead
1 of pipes, if your system supports them. This is done on a per-command
11 basis, by setting a special element in the 'PROCINFO' array (⇒
Auto-set), like so:
1
1 command = "sort -nr" # command, save in convenience variable
1 PROCINFO[command, "pty"] = 1 # update PROCINFO
1 print ... |& command # start two-way pipe
1 ...
1
1 If your system does not have ptys, or if all the system's ptys are in
1 use, 'gawk' automatically falls back to using regular pipes.
1
1 Using ptys usually avoids the buffer deadlock issues described
1 earlier, at some loss in performance. This is because the tty driver
1 buffers and sends data line-by-line. On systems with the 'stdbuf' (part
1 of the GNU Coreutils package
1 (https://www.gnu.org/software/coreutils/coreutils.html)), you can use
1 that program instead of ptys.
1
1 Note also that ptys are not fully transparent. Certain binary
1 control codes, such 'Ctrl-d' for end-of-file, are interpreted by the tty
1 driver and not passed through.
1
1 CAUTION: Finally, coprocesses open up the possibility of "deadlock"
1 between 'gawk' and the program running in the coprocess. This can
1 occur if you send "too much" data to the coprocess before reading
1 any back; each process is blocked writing data with noone available
1 to read what they've already written. There is no workaround for
1 deadlock; careful programming and knowledge of the behavior of the
1 coprocess are required.
1
1 ---------- Footnotes ----------
1
1 (1) Michael Brennan suggests the use of 'rand()' to generate unique
1 file names. This is a valid point; nevertheless, temporary files remain
1 more difficult to use than two-way pipes.
1
1 (2) This is very different from the same operator in the C shell and
1 in Bash.
1