gawk: General Data Types

1 
1 16.4.2 General-Purpose Data Types
1 ---------------------------------
1 
1      I have a true love/hate relationship with unions.
1                           -- _Arnold Robbins_
1 
1      That's the thing about unions: the compiler will arrange things so
1      they can accommodate both love and hate.
1                             -- _Chet Ramey_
1 
1    The extension API defines a number of simple types and structures for
1 general-purpose use.  Additional, more specialized, data structures are
1 introduced in subsequent minor nodes, together with the functions that
1 use them.
1 
1    The general-purpose types and structures are as follows:
1 
1 'typedef void *awk_ext_id_t;'
1      A value of this type is received from 'gawk' when an extension is
1      loaded.  That value must then be passed back to 'gawk' as the first
1      parameter of each API function.
1 
1 '#define awk_const ...'
1      This macro expands to 'const' when compiling an extension, and to
1      nothing when compiling 'gawk' itself.  This makes certain fields in
1      the API data structures unwritable from extension code, while
1      allowing 'gawk' to use them as it needs to.
1 
1 'typedef enum awk_bool {'
1 '    awk_false = 0,'
1 '    awk_true'
1 '} awk_bool_t;'
1      A simple Boolean type.
1 
1 'typedef struct awk_string {'
1 '    char *str;      /* data */'
1 '    size_t len;     /* length thereof, in chars */'
1 '} awk_string_t;'
1      This represents a mutable string.  'gawk' owns the memory pointed
1      to if it supplied the value.  Otherwise, it takes ownership of the
1      memory pointed to.  _Such memory must come from calling one of the
1      'gawk_malloc()', 'gawk_calloc()', or 'gawk_realloc()' functions!_
1 
1      As mentioned earlier, strings are maintained using the current
1      multibyte encoding.
1 
1 'typedef enum {'
1 '    AWK_UNDEFINED,'
1 '    AWK_NUMBER,'
1 '    AWK_STRING,'
1 '    AWK_REGEX,'
1 '    AWK_STRNUM,'
1 '    AWK_ARRAY,'
1 '    AWK_SCALAR,         /* opaque access to a variable */'
1 '    AWK_VALUE_COOKIE    /* for updating a previously created value */'
1 '} awk_valtype_t;'
1      This 'enum' indicates the type of a value.  It is used in the
1      following 'struct'.
1 
1 'typedef struct awk_value {'
1 '    awk_valtype_t val_type;'
1 '    union {'
1 '        awk_string_t       s;'
1 '        awknum_t           n;'
1 '        awk_array_t        a;'
1 '        awk_scalar_t       scl;'
1 '        awk_value_cookie_t vc;'
1 '    } u;'
1 '} awk_value_t;'
1      An "'awk' value."  The 'val_type' member indicates what kind of
1      value the 'union' holds, and each member is of the appropriate
1      type.
1 
1 '#define str_value      u.s'
1 '#define strnum_value   str_value'
1 '#define regex_value    str_value'
1 '#define num_value      u.n.d'
1 '#define num_type       u.n.type'
1 '#define num_ptr        u.n.ptr'
1 '#define array_cookie   u.a'
1 '#define scalar_cookie  u.scl'
1 '#define value_cookie   u.vc'
1      Using these macros makes accessing the fields of the 'awk_value_t'
1      more readable.
1 
1 'enum AWK_NUMBER_TYPE {'
1 '    AWK_NUMBER_TYPE_DOUBLE,'
1 '    AWK_NUMBER_TYPE_MPFR,'
1 '    AWK_NUMBER_TYPE_MPZ'
1 '};'
1      This 'enum' is used in the following structure for defining the
1      type of numeric value that is being worked with.  It is declared at
1      the top level of the file so that it works correctly for C++ as
1      well as for C.
1 
1 'typedef struct awk_number {'
1 '    double d;'
1 '    enum AWK_NUMBER_TYPE type;'
1 '    void *ptr;'
1 '} awk_number_t;'
1      This represents a numeric value.  Internally, 'gawk' stores every
1      number as either a C 'double', a GMP integer, or an MPFR
1      arbitrary-precision floating-point value.  In order to allow
1      extensions to also support GMP and MPFR values, numeric values are
1      passed in this structure.
1 
1      The double-precision 'd' element is always populated in data
1      received from 'gawk'.  In addition, by examining the 'type' member,
1      an extension can determine if the 'ptr' member is either a GMP
1      integer (type 'mpz_ptr'), or an MPFR floating-point value (type
1      'mpfr_ptr_t'), and cast it appropriately.
1 
1 'typedef void *awk_scalar_t;'
1      Scalars can be represented as an opaque type.  These values are
1      obtained from 'gawk' and then passed back into it.  This is
1      discussed in a general fashion in the text following this list, and
1      in more detail in ⇒Symbol table by cookie.
1 
1 'typedef void *awk_value_cookie_t;'
1      A "value cookie" is an opaque type representing a cached value.
1      This is also discussed in a general fashion in the text following
1      this list, and in more detail in ⇒Cached values.
1 
1    Scalar values in 'awk' are numbers, strings, strnums, or typed
1 regexps.  The 'awk_value_t' struct represents values.  The 'val_type'
1 member indicates what is in the 'union'.
1 
1    Representing numbers is easy--the API uses a C 'double'.  Strings
1 require more work.  Because 'gawk' allows embedded NUL bytes in string
1 values, a string must be represented as a pair containing a data pointer
1 and length.  This is the 'awk_string_t' type.
1 
1    A strnum (numeric string) value is represented as a string and
1 consists of user input data that appears to be numeric.  When an
1 extension creates a strnum value, the result is a string flagged as user
1 input.  Subsequent parsing by 'gawk' then determines whether it looks
1 like a number and should be treated as a strnum, or as a regular string.
1 
1    This is useful in cases where an extension function would like to do
1 something comparable to the 'split()' function which sets the strnum
1 attribute on the array elements it creates.  For example, an extension
1 that implements CSV splitting would want to use this feature.  This is
1 also useful for a function that retrieves a data item from a database.
1 The PostgreSQL 'PQgetvalue()' function, for example, returns a string
1 that may be numeric or textual depending on the contents.
1 
1    Typed regexp values (⇒Strong Regexp Constants) are not of much
1 use to extension functions.  Extension functions can tell that they've
1 received them, and create them for scalar values.  Otherwise, they can
1 examine the text of the regexp through 'regex_value.str' and
1 'regex_value.len'.
1 
1    Identifiers (i.e., the names of global variables) can be associated
1 with either scalar values or with arrays.  In addition, 'gawk' provides
1 true arrays of arrays, where any given array element can itself be an
11 array.  Discussion of arrays is delayed until ⇒Array
 Manipulation.
1 
1    The various macros listed earlier make it easier to use the elements
1 of the 'union' as if they were fields in a 'struct'; this is a common
1 coding practice in C. Such code is easier to write and to read, but it
1 remains _your_ responsibility to make sure that the 'val_type' member
1 correctly reflects the type of the value in the 'awk_value_t' struct.
1 
1    Conceptually, the first three members of the 'union' (number, string,
1 and array) are all that is needed for working with 'awk' values.
1 However, because the API provides routines for accessing and changing
1 the value of a global scalar variable only by using the variable's name,
1 there is a performance penalty: 'gawk' must find the variable each time
1 it is accessed and changed.  This turns out to be a real issue, not just
1 a theoretical one.
1 
1    Thus, if you know that your extension will spend considerable time
1 reading and/or changing the value of one or more scalar variables, you
1 can obtain a "scalar cookie"(1) object for that variable, and then use
1 the cookie for getting the variable's value or for changing the
1 variable's value.  The 'awk_scalar_t' type holds a scalar cookie, and
1 the 'scalar_cookie' macro provides access to the value of that type in
1 the 'awk_value_t' struct.  Given a scalar cookie, 'gawk' can directly
1 retrieve or modify the value, as required, without having to find it
1 first.
1 
1    The 'awk_value_cookie_t' type and 'value_cookie' macro are similar.
1 If you know that you wish to use the same numeric or string _value_ for
1 one or more variables, you can create the value once, retaining a "value
1 cookie" for it, and then pass in that value cookie whenever you wish to
1 set the value of a variable.  This saves storage space within the
1 running 'gawk' process and reduces the time needed to create the value.
1 
1    ---------- Footnotes ----------
1 
1    (1) See the "cookie" entry in the Jargon file
1 (http://catb.org/jargon/html/C/cookie.html) for a definition of
1 "cookie", and the "magic cookie" entry in the Jargon file
1 (http://catb.org/jargon/html/M/magic-cookie.html) for a nice example.
1 See also the entry for "Cookie" in the ⇒Glossary.
1