gawk: Input Parsers

1 
1 16.4.5.4 Customized Input Parsers
1 .................................
1 
1 By default, 'gawk' reads text files as its input.  It uses the value of
1 'RS' to find the end of the record, and then uses 'FS' (or 'FIELDWIDTHS'
1 or 'FPAT') to split it into fields (⇒Reading Files).
1 Additionally, it sets the value of 'RT' (⇒Built-in Variables).
1 
1    If you want, you can provide your own custom input parser.  An input
1 parser's job is to return a record to the 'gawk' record-processing code,
1 along with indicators for the value and length of the data to be used
1 for 'RT', if any.
1 
1    To provide an input parser, you must first provide two functions
1 (where XXX is a prefix name for your extension):
1 
1 'awk_bool_t XXX_can_take_file(const awk_input_buf_t *iobuf);'
1      This function examines the information available in 'iobuf' (which
1      we discuss shortly).  Based on the information there, it decides if
1      the input parser should be used for this file.  If so, it should
1      return true.  Otherwise, it should return false.  It should not
1      change any state (variable values, etc.)  within 'gawk'.
1 
1 'awk_bool_t XXX_take_control_of(awk_input_buf_t *iobuf);'
1      When 'gawk' decides to hand control of the file over to the input
1      parser, it calls this function.  This function in turn must fill in
1      certain fields in the 'awk_input_buf_t' structure and ensure that
1      certain conditions are true.  It should then return true.  If an
1      error of some kind occurs, it should not fill in any fields and
1      should return false; then 'gawk' will not use the input parser.
1      The details are presented shortly.
1 
1    Your extension should package these functions inside an
1 'awk_input_parser_t', which looks like this:
1 
1      typedef struct awk_input_parser {
1          const char *name;   /* name of parser */
1          awk_bool_t (*can_take_file)(const awk_input_buf_t *iobuf);
1          awk_bool_t (*take_control_of)(awk_input_buf_t *iobuf);
1          awk_const struct awk_input_parser *awk_const next;   /* for gawk */
1      } awk_input_parser_t;
1 
1    The fields are:
1 
1 'const char *name;'
1      The name of the input parser.  This is a regular C string.
1 
1 'awk_bool_t (*can_take_file)(const awk_input_buf_t *iobuf);'
1      A pointer to your 'XXX_can_take_file()' function.
1 
1 'awk_bool_t (*take_control_of)(awk_input_buf_t *iobuf);'
1      A pointer to your 'XXX_take_control_of()' function.
1 
1 'awk_const struct input_parser *awk_const next;'
1      This is for use by 'gawk'; therefore it is marked 'awk_const' so
1      that the extension cannot modify it.
1 
1    The steps are as follows:
1 
1   1. Create a 'static awk_input_parser_t' variable and initialize it
1      appropriately.
1 
1   2. When your extension is loaded, register your input parser with
1      'gawk' using the 'register_input_parser()' API function (described
1      next).
1 
1    An 'awk_input_buf_t' looks like this:
1 
1      typedef struct awk_input {
1          const char *name;       /* filename */
1          int fd;                 /* file descriptor */
1      #define INVALID_HANDLE (-1)
1          void *opaque;           /* private data for input parsers */
1          int (*get_record)(char **out, struct awk_input *iobuf,
1                            int *errcode, char **rt_start, size_t *rt_len,
1                            const awk_fieldwidth_info_t **field_width);
1          ssize_t (*read_func)();
1          void (*close_func)(struct awk_input *iobuf);
1          struct stat sbuf;       /* stat buf */
1      } awk_input_buf_t;
1 
1    The fields can be divided into two categories: those for use
1 (initially, at least) by 'XXX_can_take_file()', and those for use by
1 'XXX_take_control_of()'.  The first group of fields and their uses are
1 as follows:
1 
1 'const char *name;'
1      The name of the file.
1 
1 'int fd;'
1      A file descriptor for the file.  If 'gawk' was able to open the
1      file, then 'fd' will _not_ be equal to 'INVALID_HANDLE'.
1      Otherwise, it will.
1 
1 'struct stat sbuf;'
1      If the file descriptor is valid, then 'gawk' will have filled in
1      this structure via a call to the 'fstat()' system call.
1 
1    The 'XXX_can_take_file()' function should examine these fields and
1 decide if the input parser should be used for the file.  The decision
1 can be made based upon 'gawk' state (the value of a variable defined
1 previously by the extension and set by 'awk' code), the name of the
1 file, whether or not the file descriptor is valid, the information in
1 the 'struct stat', or any combination of these factors.
1 
1    Once 'XXX_can_take_file()' has returned true, and 'gawk' has decided
1 to use your input parser, it calls 'XXX_take_control_of()'.  That
1 function then fills either the 'get_record' field or the 'read_func'
1 field in the 'awk_input_buf_t'.  It must also ensure that 'fd' is _not_
1 set to 'INVALID_HANDLE'.  The following list describes the fields that
1 may be filled by 'XXX_take_control_of()':
1 
1 'void *opaque;'
1      This is used to hold any state information needed by the input
1      parser for this file.  It is "opaque" to 'gawk'.  The input parser
1      is not required to use this pointer.
1 
1 'int (*get_record)(char **out,'
1 '                  struct awk_input *iobuf,'
1 '                  int *errcode,'
1 '                  char **rt_start,'
1 '                  size_t *rt_len,'
1 '                  const awk_fieldwidth_info_t **field_width);'
1      This function pointer should point to a function that creates the
1      input records.  Said function is the core of the input parser.  Its
1      behavior is described in the text following this list.
1 
1 'ssize_t (*read_func)();'
1      This function pointer should point to a function that has the same
1      behavior as the standard POSIX 'read()' system call.  It is an
1      alternative to the 'get_record' pointer.  Its behavior is also
1      described in the text following this list.
1 
1 'void (*close_func)(struct awk_input *iobuf);'
1      This function pointer should point to a function that does the
1      "teardown."  It should release any resources allocated by
1      'XXX_take_control_of()'.  It may also close the file.  If it does
1      so, it should set the 'fd' field to 'INVALID_HANDLE'.
1 
1      If 'fd' is still not 'INVALID_HANDLE' after the call to this
1      function, 'gawk' calls the regular 'close()' system call.
1 
1      Having a "teardown" function is optional.  If your input parser
1      does not need it, do not set this field.  Then, 'gawk' calls the
1      regular 'close()' system call on the file descriptor, so it should
1      be valid.
1 
1    The 'XXX_get_record()' function does the work of creating input
1 records.  The parameters are as follows:
1 
1 'char **out'
1      This is a pointer to a 'char *' variable that is set to point to
1      the record.  'gawk' makes its own copy of the data, so the
1      extension must manage this storage.
1 
1 'struct awk_input *iobuf'
1      This is the 'awk_input_buf_t' for the file.  The fields should be
1      used for reading data ('fd') and for managing private state
1      ('opaque'), if any.
1 
1 'int *errcode'
1      If an error occurs, '*errcode' should be set to an appropriate code
1      from '<errno.h>'.
1 
1 'char **rt_start'
1 'size_t *rt_len'
1      If the concept of a "record terminator" makes sense, then
1      '*rt_start' should be set to point to the data to be used for 'RT',
1      and '*rt_len' should be set to the length of the data.  Otherwise,
1      '*rt_len' should be set to zero.  'gawk' makes its own copy of this
1      data, so the extension must manage this storage.
1 
1 'const awk_fieldwidth_info_t **field_width'
1      If 'field_width' is not 'NULL', then '*field_width' will be
1      initialized to 'NULL', and the function may set it to point to a
1      structure supplying field width information to override the default
1      field parsing mechanism.  Note that this structure will not be
1      copied by 'gawk'; it must persist at least until the next call to
1      'get_record' or 'close_func'.  Note also that 'field_width' is
1      'NULL' when 'getline' is assigning the results to a variable, thus
1      field parsing is not needed.  If the parser does set
1      '*field_width', then 'gawk' uses this layout to parse the input
1      record, and the 'PROCINFO["FS"]' value will be '"API"' while this
1      record is active in '$0'.  The 'awk_fieldwidth_info_t' data
1      structure is described below.
1 
1    The return value is the length of the buffer pointed to by '*out', or
1 'EOF' if end-of-file was reached or an error occurred.
1 
1    It is guaranteed that 'errcode' is a valid pointer, so there is no
1 need to test for a 'NULL' value.  'gawk' sets '*errcode' to zero, so
1 there is no need to set it unless an error occurs.
1 
1    If an error does occur, the function should return 'EOF' and set
1 '*errcode' to a value greater than zero.  In that case, if '*errcode'
1 does not equal zero, 'gawk' automatically updates the 'ERRNO' variable
1 based on the value of '*errcode'.  (In general, setting '*errcode =
1 errno' should do the right thing.)
1 
1    As an alternative to supplying a function that returns an input
1 record, you may instead supply a function that simply reads bytes, and
1 let 'gawk' parse the data into records.  If you do so, the data should
1 be returned in the multibyte encoding of the current locale.  Such a
1 function should follow the same behavior as the 'read()' system call,
1 and you fill in the 'read_func' pointer with its address in the
1 'awk_input_buf_t' structure.
1 
1    By default, 'gawk' sets the 'read_func' pointer to point to the
1 'read()' system call.  So your extension need not set this field
1 explicitly.
1 
1      NOTE: You must choose one method or the other: either a function
1      that returns a record, or one that returns raw data.  In
1      particular, if you supply a function to get a record, 'gawk' will
1      call it, and will never call the raw read function.
1 
1    'gawk' ships with a sample extension that reads directories,
11 returning records for each entry in a directory (⇒Extension Sample
 Readdir).  You may wish to use that code as a guide for writing your
1 own input parser.
1 
1    When writing an input parser, you should think about (and document)
1 how it is expected to interact with 'awk' code.  You may want it to
1 always be called, and to take effect as appropriate (as the 'readdir'
1 extension does).  Or you may want it to take effect based upon the value
1 of an 'awk' variable, as the XML extension from the 'gawkextlib' project
1 does (⇒gawkextlib).  In the latter case, code in a 'BEGINFILE'
1 rule can look at 'FILENAME' and 'ERRNO' to decide whether or not to
1 activate an input parser (⇒BEGINFILE/ENDFILE).
1 
1    You register your input parser with the following function:
1 
1 'void register_input_parser(awk_input_parser_t *input_parser);'
1      Register the input parser pointed to by 'input_parser' with 'gawk'.
1 
1    If you would like to override the default field parsing mechanism for
1 a given record, then you must populate an 'awk_fieldwidth_info_t'
1 structure, which looks like this:
1 
1      typedef struct {
1              awk_bool_t     use_chars; /* false ==> use bytes */
1              size_t         nf;        /* number of fields in record (NF) */
1              struct awk_field_info {
1                      size_t skip;      /* amount to skip before field starts */
1                      size_t len;       /* length of field */
1              } fields[1];              /* actual dimension should be nf */
1      } awk_fieldwidth_info_t;
1 
1    The fields are:
1 
1 'awk_bool_t use_chars;'
1      Set this to 'awk_true' if the field lengths are specified in terms
1      of potentially multi-byte characters, and set it to 'awk_false' if
1      the lengths are in terms of bytes.  Performance will be better if
1      the values are supplied in terms of bytes.
1 
1 'size_t nf;'
1      Set this to the number of fields in the input record, i.e.  'NF'.
1 
1 'struct awk_field_info fields[nf];'
1      This is a variable-length array whose actual dimension should be
1      'nf'.  For each field, the 'skip' element should be set to the
1      number of characters or bytes, as controlled by the 'use_chars'
1      flag, to skip before the start of this field.  The 'len' element
1      provides the length of the field.  The values in 'fields[0]'
1      provide the information for '$1', and so on through the
1      'fields[nf-1]' element containing the information for '$NF'.
1 
1    A convenience macro 'awk_fieldwidth_info_size(numfields)' is provided
1 to calculate the appropriate size of a variable-length
1 'awk_fieldwidth_info_t' structure containing 'numfields' fields.  This
1 can be used as an argument to 'malloc()' or in a union to allocate space
1 statically.  Please refer to the 'readdir_test' sample extension for an
1 example.
1