gawk: Ordinal Functions

1 
1 10.2.5 Translating Between Characters and Numbers
1 -------------------------------------------------
1 
1 One commercial implementation of 'awk' supplies a built-in function,
1 'ord()', which takes a character and returns the numeric value for that
1 character in the machine's character set.  If the string passed to
1 'ord()' has more than one character, only the first one is used.
1 
1    The inverse of this function is 'chr()' (from the function of the
1 same name in Pascal), which takes a number and returns the corresponding
1 character.  Both functions are written very nicely in 'awk'; there is no
1 real reason to build them into the 'awk' interpreter:
1 
1      # ord.awk --- do ord and chr
1 
1      # Global identifiers:
1      #    _ord_:        numerical values indexed by characters
1      #    _ord_init:    function to initialize _ord_
1 
1      BEGIN    { _ord_init() }
1 
1      function _ord_init(    low, high, i, t)
1      {
1          low = sprintf("%c", 7) # BEL is ascii 7
1          if (low == "\a") {    # regular ascii
1              low = 0
1              high = 127
1          } else if (sprintf("%c", 128 + 7) == "\a") {
1              # ascii, mark parity
1              low = 128
1              high = 255
1          } else {        # ebcdic(!)
1              low = 0
1              high = 255
1          }
1 
1          for (i = low; i <= high; i++) {
1              t = sprintf("%c", i)
1              _ord_[t] = i
1          }
1      }
1 
1    Some explanation of the numbers used by '_ord_init()' is worthwhile.
1 The most prominent character set in use today is ASCII.(1) Although an
1 8-bit byte can hold 256 distinct values (from 0 to 255), ASCII only
1 defines characters that use the values from 0 to 127.(2)  In the now
1 distant past, at least one minicomputer manufacturer used ASCII, but
1 with mark parity, meaning that the leftmost bit in the byte is always 1.
1 This means that on those systems, characters have numeric values from
1 128 to 255.  Finally, large mainframe systems use the EBCDIC character
1 set, which uses all 256 values.  There are other character sets in use
1 on some older systems, but they are not really worth worrying about:
1 
1      function ord(str,    c)
1      {
1          # only first character is of interest
1          c = substr(str, 1, 1)
1          return _ord_[c]
1      }
1 
1      function chr(c)
1      {
1          # force c to be numeric by adding 0
1          return sprintf("%c", c + 0)
1      }
1 
1      #### test code ####
1      # BEGIN {
1      #    for (;;) {
1      #        printf("enter a character: ")
1      #        if (getline var <= 0)
1      #            break
1      #        printf("ord(%s) = %d\n", var, ord(var))
1      #    }
1      # }
1 
1    An obvious improvement to these functions is to move the code for the
1 '_ord_init' function into the body of the 'BEGIN' rule.  It was written
1 this way initially for ease of development.  There is a "test program"
1 in a 'BEGIN' rule, to test the function.  It is commented out for
1 production use.
1 
1    ---------- Footnotes ----------
1 
1    (1) This is changing; many systems use Unicode, a very large
1 character set that includes ASCII as a subset.  On systems with full
1 Unicode support, a character can occupy up to 32 bits, making simple
1 tests such as used here prohibitively expensive.
1 
1    (2) ASCII has been extended in many countries to use the values from
1 128 to 255 for country-specific characters.  If your system uses these
1 extensions, you can simplify '_ord_init()' to loop from 0 to 255.
1