gawk: Input Parsers
1
1 16.4.5.4 Customized Input Parsers
1 .................................
1
1 By default, 'gawk' reads text files as its input. It uses the value of
1 'RS' to find the end of the record, and then uses 'FS' (or 'FIELDWIDTHS'
1 or 'FPAT') to split it into fields (⇒Reading Files).
1 Additionally, it sets the value of 'RT' (⇒Built-in Variables).
1
1 If you want, you can provide your own custom input parser. An input
1 parser's job is to return a record to the 'gawk' record-processing code,
1 along with indicators for the value and length of the data to be used
1 for 'RT', if any.
1
1 To provide an input parser, you must first provide two functions
1 (where XXX is a prefix name for your extension):
1
1 'awk_bool_t XXX_can_take_file(const awk_input_buf_t *iobuf);'
1 This function examines the information available in 'iobuf' (which
1 we discuss shortly). Based on the information there, it decides if
1 the input parser should be used for this file. If so, it should
1 return true. Otherwise, it should return false. It should not
1 change any state (variable values, etc.) within 'gawk'.
1
1 'awk_bool_t XXX_take_control_of(awk_input_buf_t *iobuf);'
1 When 'gawk' decides to hand control of the file over to the input
1 parser, it calls this function. This function in turn must fill in
1 certain fields in the 'awk_input_buf_t' structure and ensure that
1 certain conditions are true. It should then return true. If an
1 error of some kind occurs, it should not fill in any fields and
1 should return false; then 'gawk' will not use the input parser.
1 The details are presented shortly.
1
1 Your extension should package these functions inside an
1 'awk_input_parser_t', which looks like this:
1
1 typedef struct awk_input_parser {
1 const char *name; /* name of parser */
1 awk_bool_t (*can_take_file)(const awk_input_buf_t *iobuf);
1 awk_bool_t (*take_control_of)(awk_input_buf_t *iobuf);
1 awk_const struct awk_input_parser *awk_const next; /* for gawk */
1 } awk_input_parser_t;
1
1 The fields are:
1
1 'const char *name;'
1 The name of the input parser. This is a regular C string.
1
1 'awk_bool_t (*can_take_file)(const awk_input_buf_t *iobuf);'
1 A pointer to your 'XXX_can_take_file()' function.
1
1 'awk_bool_t (*take_control_of)(awk_input_buf_t *iobuf);'
1 A pointer to your 'XXX_take_control_of()' function.
1
1 'awk_const struct input_parser *awk_const next;'
1 This is for use by 'gawk'; therefore it is marked 'awk_const' so
1 that the extension cannot modify it.
1
1 The steps are as follows:
1
1 1. Create a 'static awk_input_parser_t' variable and initialize it
1 appropriately.
1
1 2. When your extension is loaded, register your input parser with
1 'gawk' using the 'register_input_parser()' API function (described
1 next).
1
1 An 'awk_input_buf_t' looks like this:
1
1 typedef struct awk_input {
1 const char *name; /* filename */
1 int fd; /* file descriptor */
1 #define INVALID_HANDLE (-1)
1 void *opaque; /* private data for input parsers */
1 int (*get_record)(char **out, struct awk_input *iobuf,
1 int *errcode, char **rt_start, size_t *rt_len,
1 const awk_fieldwidth_info_t **field_width);
1 ssize_t (*read_func)();
1 void (*close_func)(struct awk_input *iobuf);
1 struct stat sbuf; /* stat buf */
1 } awk_input_buf_t;
1
1 The fields can be divided into two categories: those for use
1 (initially, at least) by 'XXX_can_take_file()', and those for use by
1 'XXX_take_control_of()'. The first group of fields and their uses are
1 as follows:
1
1 'const char *name;'
1 The name of the file.
1
1 'int fd;'
1 A file descriptor for the file. If 'gawk' was able to open the
1 file, then 'fd' will _not_ be equal to 'INVALID_HANDLE'.
1 Otherwise, it will.
1
1 'struct stat sbuf;'
1 If the file descriptor is valid, then 'gawk' will have filled in
1 this structure via a call to the 'fstat()' system call.
1
1 The 'XXX_can_take_file()' function should examine these fields and
1 decide if the input parser should be used for the file. The decision
1 can be made based upon 'gawk' state (the value of a variable defined
1 previously by the extension and set by 'awk' code), the name of the
1 file, whether or not the file descriptor is valid, the information in
1 the 'struct stat', or any combination of these factors.
1
1 Once 'XXX_can_take_file()' has returned true, and 'gawk' has decided
1 to use your input parser, it calls 'XXX_take_control_of()'. That
1 function then fills either the 'get_record' field or the 'read_func'
1 field in the 'awk_input_buf_t'. It must also ensure that 'fd' is _not_
1 set to 'INVALID_HANDLE'. The following list describes the fields that
1 may be filled by 'XXX_take_control_of()':
1
1 'void *opaque;'
1 This is used to hold any state information needed by the input
1 parser for this file. It is "opaque" to 'gawk'. The input parser
1 is not required to use this pointer.
1
1 'int (*get_record)(char **out,'
1 ' struct awk_input *iobuf,'
1 ' int *errcode,'
1 ' char **rt_start,'
1 ' size_t *rt_len,'
1 ' const awk_fieldwidth_info_t **field_width);'
1 This function pointer should point to a function that creates the
1 input records. Said function is the core of the input parser. Its
1 behavior is described in the text following this list.
1
1 'ssize_t (*read_func)();'
1 This function pointer should point to a function that has the same
1 behavior as the standard POSIX 'read()' system call. It is an
1 alternative to the 'get_record' pointer. Its behavior is also
1 described in the text following this list.
1
1 'void (*close_func)(struct awk_input *iobuf);'
1 This function pointer should point to a function that does the
1 "teardown." It should release any resources allocated by
1 'XXX_take_control_of()'. It may also close the file. If it does
1 so, it should set the 'fd' field to 'INVALID_HANDLE'.
1
1 If 'fd' is still not 'INVALID_HANDLE' after the call to this
1 function, 'gawk' calls the regular 'close()' system call.
1
1 Having a "teardown" function is optional. If your input parser
1 does not need it, do not set this field. Then, 'gawk' calls the
1 regular 'close()' system call on the file descriptor, so it should
1 be valid.
1
1 The 'XXX_get_record()' function does the work of creating input
1 records. The parameters are as follows:
1
1 'char **out'
1 This is a pointer to a 'char *' variable that is set to point to
1 the record. 'gawk' makes its own copy of the data, so the
1 extension must manage this storage.
1
1 'struct awk_input *iobuf'
1 This is the 'awk_input_buf_t' for the file. The fields should be
1 used for reading data ('fd') and for managing private state
1 ('opaque'), if any.
1
1 'int *errcode'
1 If an error occurs, '*errcode' should be set to an appropriate code
1 from '<errno.h>'.
1
1 'char **rt_start'
1 'size_t *rt_len'
1 If the concept of a "record terminator" makes sense, then
1 '*rt_start' should be set to point to the data to be used for 'RT',
1 and '*rt_len' should be set to the length of the data. Otherwise,
1 '*rt_len' should be set to zero. 'gawk' makes its own copy of this
1 data, so the extension must manage this storage.
1
1 'const awk_fieldwidth_info_t **field_width'
1 If 'field_width' is not 'NULL', then '*field_width' will be
1 initialized to 'NULL', and the function may set it to point to a
1 structure supplying field width information to override the default
1 field parsing mechanism. Note that this structure will not be
1 copied by 'gawk'; it must persist at least until the next call to
1 'get_record' or 'close_func'. Note also that 'field_width' is
1 'NULL' when 'getline' is assigning the results to a variable, thus
1 field parsing is not needed. If the parser does set
1 '*field_width', then 'gawk' uses this layout to parse the input
1 record, and the 'PROCINFO["FS"]' value will be '"API"' while this
1 record is active in '$0'. The 'awk_fieldwidth_info_t' data
1 structure is described below.
1
1 The return value is the length of the buffer pointed to by '*out', or
1 'EOF' if end-of-file was reached or an error occurred.
1
1 It is guaranteed that 'errcode' is a valid pointer, so there is no
1 need to test for a 'NULL' value. 'gawk' sets '*errcode' to zero, so
1 there is no need to set it unless an error occurs.
1
1 If an error does occur, the function should return 'EOF' and set
1 '*errcode' to a value greater than zero. In that case, if '*errcode'
1 does not equal zero, 'gawk' automatically updates the 'ERRNO' variable
1 based on the value of '*errcode'. (In general, setting '*errcode =
1 errno' should do the right thing.)
1
1 As an alternative to supplying a function that returns an input
1 record, you may instead supply a function that simply reads bytes, and
1 let 'gawk' parse the data into records. If you do so, the data should
1 be returned in the multibyte encoding of the current locale. Such a
1 function should follow the same behavior as the 'read()' system call,
1 and you fill in the 'read_func' pointer with its address in the
1 'awk_input_buf_t' structure.
1
1 By default, 'gawk' sets the 'read_func' pointer to point to the
1 'read()' system call. So your extension need not set this field
1 explicitly.
1
1 NOTE: You must choose one method or the other: either a function
1 that returns a record, or one that returns raw data. In
1 particular, if you supply a function to get a record, 'gawk' will
1 call it, and will never call the raw read function.
1
1 'gawk' ships with a sample extension that reads directories,
11 returning records for each entry in a directory (⇒Extension Sample
Readdir). You may wish to use that code as a guide for writing your
1 own input parser.
1
1 When writing an input parser, you should think about (and document)
1 how it is expected to interact with 'awk' code. You may want it to
1 always be called, and to take effect as appropriate (as the 'readdir'
1 extension does). Or you may want it to take effect based upon the value
1 of an 'awk' variable, as the XML extension from the 'gawkextlib' project
1 does (⇒gawkextlib). In the latter case, code in a 'BEGINFILE'
1 rule can look at 'FILENAME' and 'ERRNO' to decide whether or not to
1 activate an input parser (⇒BEGINFILE/ENDFILE).
1
1 You register your input parser with the following function:
1
1 'void register_input_parser(awk_input_parser_t *input_parser);'
1 Register the input parser pointed to by 'input_parser' with 'gawk'.
1
1 If you would like to override the default field parsing mechanism for
1 a given record, then you must populate an 'awk_fieldwidth_info_t'
1 structure, which looks like this:
1
1 typedef struct {
1 awk_bool_t use_chars; /* false ==> use bytes */
1 size_t nf; /* number of fields in record (NF) */
1 struct awk_field_info {
1 size_t skip; /* amount to skip before field starts */
1 size_t len; /* length of field */
1 } fields[1]; /* actual dimension should be nf */
1 } awk_fieldwidth_info_t;
1
1 The fields are:
1
1 'awk_bool_t use_chars;'
1 Set this to 'awk_true' if the field lengths are specified in terms
1 of potentially multi-byte characters, and set it to 'awk_false' if
1 the lengths are in terms of bytes. Performance will be better if
1 the values are supplied in terms of bytes.
1
1 'size_t nf;'
1 Set this to the number of fields in the input record, i.e. 'NF'.
1
1 'struct awk_field_info fields[nf];'
1 This is a variable-length array whose actual dimension should be
1 'nf'. For each field, the 'skip' element should be set to the
1 number of characters or bytes, as controlled by the 'use_chars'
1 flag, to skip before the start of this field. The 'len' element
1 provides the length of the field. The values in 'fields[0]'
1 provide the information for '$1', and so on through the
1 'fields[nf-1]' element containing the information for '$NF'.
1
1 A convenience macro 'awk_fieldwidth_info_size(numfields)' is provided
1 to calculate the appropriate size of a variable-length
1 'awk_fieldwidth_info_t' structure containing 'numfields' fields. This
1 can be used as an argument to 'malloc()' or in a union to allocate space
1 statically. Please refer to the 'readdir_test' sample extension for an
1 example.
1