gawk: Translate Program

1 
1 11.3.3 Transliterating Characters
1 ---------------------------------
1 
1 The system 'tr' utility transliterates characters.  For example, it is
1 often used to map uppercase letters into lowercase for further
1 processing:
1 
1      GENERATE DATA | tr 'A-Z' 'a-z' | PROCESS DATA ...
1 
1    'tr' requires two lists of characters.(1)  When processing the input,
1 the first character in the first list is replaced with the first
1 character in the second list, the second character in the first list is
1 replaced with the second character in the second list, and so on.  If
1 there are more characters in the "from" list than in the "to" list, the
1 last character of the "to" list is used for the remaining characters in
1 the "from" list.
1 
1    Once upon a time, a user proposed adding a transliteration function
1 to 'gawk'.  The following program was written to prove that character
1 transliteration could be done with a user-level function.  This program
1 is not as complete as the system 'tr' utility, but it does most of the
1 job.
1 
1    The 'translate' program was written long before 'gawk' acquired the
1 ability to split each character in a string into separate array
1 elements.  Thus, it makes repeated use of the 'substr()', 'index()', and
1 'gsub()' built-in functions (⇒String Functions).  There are two
1 functions.  The first, 'stranslate()', takes three arguments:
1 
1 'from'
1      A list of characters from which to translate
1 
1 'to'
1      A list of characters to which to translate
1 
1 'target'
1      The string on which to do the translation
1 
1    Associative arrays make the translation part fairly easy.  't_ar'
1 holds the "to" characters, indexed by the "from" characters.  Then a
1 simple loop goes through 'from', one character at a time.  For each
1 character in 'from', if the character appears in 'target', it is
1 replaced with the corresponding 'to' character.
1 
1    The 'translate()' function calls 'stranslate()', using '$0' as the
1 target.  The main program sets two global variables, 'FROM' and 'TO',
1 from the command line, and then changes 'ARGV' so that 'awk' reads from
1 the standard input.
1 
1    Finally, the processing rule simply calls 'translate()' for each
1 record:
1 
1      # translate.awk --- do tr-like stuff
1      # Bugs: does not handle things like tr A-Z a-z; it has
1      # to be spelled out. However, if `to' is shorter than `from',
1      # the last character in `to' is used for the rest of `from'.
1 
1      function stranslate(from, to, target,     lf, lt, ltarget, t_ar, i, c,
1                                                                     result)
1      {
1          lf = length(from)
1          lt = length(to)
1          ltarget = length(target)
1          for (i = 1; i <= lt; i++)
1              t_ar[substr(from, i, 1)] = substr(to, i, 1)
1          if (lt < lf)
1              for (; i <= lf; i++)
1                  t_ar[substr(from, i, 1)] = substr(to, lt, 1)
1          for (i = 1; i <= ltarget; i++) {
1              c = substr(target, i, 1)
1              if (c in t_ar)
1                  c = t_ar[c]
1              result = result c
1          }
1          return result
1      }
1 
1      function translate(from, to)
1      {
1          return $0 = stranslate(from, to, $0)
1      }
1 
1      # main program
1      BEGIN {
1          if (ARGC < 3) {
1              print "usage: translate from to" > "/dev/stderr"
1              exit
1          }
1          FROM = ARGV[1]
1          TO = ARGV[2]
1          ARGC = 2
1          ARGV[1] = "-"
1      }
1 
1      {
1          translate(FROM, TO)
1          print
1      }
1 
1    It is possible to do character transliteration in a user-level
1 function, but it is not necessarily efficient, and we (the 'gawk'
1 developers) started to consider adding a built-in function.  However,
1 shortly after writing this program, we learned that Brian Kernighan had
11 added the 'toupper()' and 'tolower()' functions to his 'awk' (⇒
 String Functions).  These functions handle the vast majority of the
1 cases where character transliteration is necessary, and so we chose to
1 simply add those functions to 'gawk' as well and then leave well enough
1 alone.
1 
1    An obvious improvement to this program would be to set up the 't_ar'
1 array only once, in a 'BEGIN' rule.  However, this assumes that the
1 "from" and "to" lists will never change throughout the lifetime of the
1 program.
1 
1    Another obvious improvement is to enable the use of ranges, such as
1 'a-z', as allowed by the 'tr' utility.  Look at the code for 'cut.awk'
1 (⇒Cut Program) for inspiration.
1 
1    ---------- Footnotes ----------
1 
1    (1) On some older systems, including Solaris, the system version of
1 'tr' may require that the lists be written as range expressions enclosed
1 in square brackets ('[a-z]') and quoted, to prevent the shell from
1 attempting a file name expansion.  This is not a feature.
1