gawk: Ordinal Functions
1
1 10.2.5 Translating Between Characters and Numbers
1 -------------------------------------------------
1
1 One commercial implementation of 'awk' supplies a built-in function,
1 'ord()', which takes a character and returns the numeric value for that
1 character in the machine's character set. If the string passed to
1 'ord()' has more than one character, only the first one is used.
1
1 The inverse of this function is 'chr()' (from the function of the
1 same name in Pascal), which takes a number and returns the corresponding
1 character. Both functions are written very nicely in 'awk'; there is no
1 real reason to build them into the 'awk' interpreter:
1
1 # ord.awk --- do ord and chr
1
1 # Global identifiers:
1 # _ord_: numerical values indexed by characters
1 # _ord_init: function to initialize _ord_
1
1 BEGIN { _ord_init() }
1
1 function _ord_init( low, high, i, t)
1 {
1 low = sprintf("%c", 7) # BEL is ascii 7
1 if (low == "\a") { # regular ascii
1 low = 0
1 high = 127
1 } else if (sprintf("%c", 128 + 7) == "\a") {
1 # ascii, mark parity
1 low = 128
1 high = 255
1 } else { # ebcdic(!)
1 low = 0
1 high = 255
1 }
1
1 for (i = low; i <= high; i++) {
1 t = sprintf("%c", i)
1 _ord_[t] = i
1 }
1 }
1
1 Some explanation of the numbers used by '_ord_init()' is worthwhile.
1 The most prominent character set in use today is ASCII.(1) Although an
1 8-bit byte can hold 256 distinct values (from 0 to 255), ASCII only
1 defines characters that use the values from 0 to 127.(2) In the now
1 distant past, at least one minicomputer manufacturer used ASCII, but
1 with mark parity, meaning that the leftmost bit in the byte is always 1.
1 This means that on those systems, characters have numeric values from
1 128 to 255. Finally, large mainframe systems use the EBCDIC character
1 set, which uses all 256 values. There are other character sets in use
1 on some older systems, but they are not really worth worrying about:
1
1 function ord(str, c)
1 {
1 # only first character is of interest
1 c = substr(str, 1, 1)
1 return _ord_[c]
1 }
1
1 function chr(c)
1 {
1 # force c to be numeric by adding 0
1 return sprintf("%c", c + 0)
1 }
1
1 #### test code ####
1 # BEGIN {
1 # for (;;) {
1 # printf("enter a character: ")
1 # if (getline var <= 0)
1 # break
1 # printf("ord(%s) = %d\n", var, ord(var))
1 # }
1 # }
1
1 An obvious improvement to these functions is to move the code for the
1 '_ord_init' function into the body of the 'BEGIN' rule. It was written
1 this way initially for ease of development. There is a "test program"
1 in a 'BEGIN' rule, to test the function. It is commented out for
1 production use.
1
1 ---------- Footnotes ----------
1
1 (1) This is changing; many systems use Unicode, a very large
1 character set that includes ASCII as a subset. On systems with full
1 Unicode support, a character can occupy up to 32 bits, making simple
1 tests such as used here prohibitively expensive.
1
1 (2) ASCII has been extended in many countries to use the values from
1 128 to 255 for country-specific characters. If your system uses these
1 extensions, you can simplify '_ord_init()' to loop from 0 to 255.
1