gawk: Filetrans Function
1
1 10.3.1 Noting Data file Boundaries
1 ----------------------------------
1
1 The 'BEGIN' and 'END' rules are each executed exactly once, at the
11 beginning and end of your 'awk' program, respectively (⇒
BEGIN/END). We (the 'gawk' authors) once had a user who mistakenly
1 thought that the 'BEGIN' rules were executed at the beginning of each
1 data file and the 'END' rules were executed at the end of each data
1 file.
1
1 When informed that this was not the case, the user requested that we
1 add new special patterns to 'gawk', named 'BEGIN_FILE' and 'END_FILE',
1 that would have the desired behavior. He even supplied us the code to
1 do so.
1
1 Adding these special patterns to 'gawk' wasn't necessary; the job can
1 be done cleanly in 'awk' itself, as illustrated by the following library
1 program. It arranges to call two user-supplied functions, 'beginfile()'
1 and 'endfile()', at the beginning and end of each data file. Besides
1 solving the problem in only nine(!) lines of code, it does so
1 _portably_; this works with any implementation of 'awk':
1
1 # transfile.awk
1 #
1 # Give the user a hook for filename transitions
1 #
1 # The user must supply functions beginfile() and endfile()
1 # that each take the name of the file being started or
1 # finished, respectively.
1
1 FILENAME != _oldfilename {
1 if (_oldfilename != "")
1 endfile(_oldfilename)
1 _oldfilename = FILENAME
1 beginfile(FILENAME)
1 }
1
1 END { endfile(FILENAME) }
1
1 This file must be loaded before the user's "main" program, so that
1 the rule it supplies is executed first.
1
1 This rule relies on 'awk''s 'FILENAME' variable, which automatically
1 changes for each new data file. The current file name is saved in a
1 private variable, '_oldfilename'. If 'FILENAME' does not equal
1 '_oldfilename', then a new data file is being processed and it is
1 necessary to call 'endfile()' for the old file. Because 'endfile()'
1 should only be called if a file has been processed, the program first
1 checks to make sure that '_oldfilename' is not the null string. The
1 program then assigns the current file name to '_oldfilename' and calls
1 'beginfile()' for the file. Because, like all 'awk' variables,
1 '_oldfilename' is initialized to the null string, this rule executes
1 correctly even for the first data file.
1
1 The program also supplies an 'END' rule to do the final processing
1 for the last file. Because this 'END' rule comes before any 'END' rules
1 supplied in the "main" program, 'endfile()' is called first. Once
1 again, the value of multiple 'BEGIN' and 'END' rules should be clear.
1
1 If the same data file occurs twice in a row on the command line, then
1 'endfile()' and 'beginfile()' are not executed at the end of the first
1 pass and at the beginning of the second pass. The following version
1 solves the problem:
1
1 # ftrans.awk --- handle datafile transitions
1 #
1 # user supplies beginfile() and endfile() functions
1
1 FNR == 1 {
1 if (_filename_ != "")
1 endfile(_filename_)
1 _filename_ = FILENAME
1 beginfile(FILENAME)
1 }
1
1 END { endfile(_filename_) }
1
1 ⇒Wc Program shows how this library function can be used and
1 how it simplifies writing the main program.
1
1 So Why Does 'gawk' Have 'BEGINFILE' and 'ENDFILE'?
1
1 You are probably wondering, if 'beginfile()' and 'endfile()'
1 functions can do the job, why does 'gawk' have 'BEGINFILE' and 'ENDFILE'
1 patterns?
1
1 Good question. Normally, if 'awk' cannot open a file, this causes an
1 immediate fatal error. In this case, there is no way for a user-defined
1 function to deal with the problem, as the mechanism for calling it
1 relies on the file being open and at the first record. Thus, the main
1 reason for 'BEGINFILE' is to give you a "hook" to catch files that
1 cannot be processed. 'ENDFILE' exists for symmetry, and because it
1 provides an easy way to do per-file cleanup processing. For more
1 information, refer to ⇒BEGINFILE/ENDFILE.
1