gawk: Controlling Array Traversal

1 
1 12.2.1 Controlling Array Traversal
1 ----------------------------------
1 
1 By default, the order in which a 'for (INDX in ARRAY)' loop scans an
1 array is not defined; it is generally based upon the internal
1 implementation of arrays inside 'awk'.
1 
1    Often, though, it is desirable to be able to loop over the elements
1 in a particular order that you, the programmer, choose.  'gawk' lets you
1 do this.
1 
1    ⇒Controlling Scanning describes how you can assign special,
1 predefined values to 'PROCINFO["sorted_in"]' in order to control the
1 order in which 'gawk' traverses an array during a 'for' loop.
1 
1    In addition, the value of 'PROCINFO["sorted_in"]' can be a function
1 name.(1)  This lets you traverse an array based on any custom criterion.
1 The array elements are ordered according to the return value of this
1 function.  The comparison function should be defined with at least four
1 arguments:
1 
1      function comp_func(i1, v1, i2, v2)
1      {
1          COMPARE ELEMENTS 1 AND 2 IN SOME FASHION
1          RETURN < 0; 0; OR > 0
1      }
1 
1    Here, 'i1' and 'i2' are the indices, and 'v1' and 'v2' are the
1 corresponding values of the two elements being compared.  Either 'v1' or
1 'v2', or both, can be arrays if the array being traversed contains
1 subarrays as values.  (⇒Arrays of Arrays for more information
1 about subarrays.)  The three possible return values are interpreted as
1 follows:
1 
1 'comp_func(i1, v1, i2, v2) < 0'
1      Index 'i1' comes before index 'i2' during loop traversal.
1 
1 'comp_func(i1, v1, i2, v2) == 0'
1      Indices 'i1' and 'i2' come together, but the relative order with
1      respect to each other is undefined.
1 
1 'comp_func(i1, v1, i2, v2) > 0'
1      Index 'i1' comes after index 'i2' during loop traversal.
1 
1    Our first comparison function can be used to scan an array in
1 numerical order of the indices:
1 
1      function cmp_num_idx(i1, v1, i2, v2)
1      {
1           # numerical index comparison, ascending order
1           return (i1 - i2)
1      }
1 
1    Our second function traverses an array based on the string order of
1 the element values rather than by indices:
1 
1      function cmp_str_val(i1, v1, i2, v2)
1      {
1          # string value comparison, ascending order
1          v1 = v1 ""
1          v2 = v2 ""
1          if (v1 < v2)
1              return -1
1          return (v1 != v2)
1      }
1 
1    The third comparison function makes all numbers, and numeric strings
1 without any leading or trailing spaces, come out first during loop
1 traversal:
1 
1      function cmp_num_str_val(i1, v1, i2, v2,   n1, n2)
1      {
1           # numbers before string value comparison, ascending order
1           n1 = v1 + 0
1           n2 = v2 + 0
1           if (n1 == v1)
1               return (n2 == v2) ? (n1 - n2) : -1
1           else if (n2 == v2)
1               return 1
1           return (v1 < v2) ? -1 : (v1 != v2)
1      }
1 
1    Here is a main program to demonstrate how 'gawk' behaves using each
1 of the previous functions:
1 
1      BEGIN {
1          data["one"] = 10
1          data["two"] = 20
1          data[10] = "one"
1          data[100] = 100
1          data[20] = "two"
1 
1          f[1] = "cmp_num_idx"
1          f[2] = "cmp_str_val"
1          f[3] = "cmp_num_str_val"
1          for (i = 1; i <= 3; i++) {
1              printf("Sort function: %s\n", f[i])
1              PROCINFO["sorted_in"] = f[i]
1              for (j in data)
1                  printf("\tdata[%s] = %s\n", j, data[j])
1              print ""
1          }
1      }
1 
1    Here are the results when the program is run:
1 
1      $ gawk -f compdemo.awk
1      -| Sort function: cmp_num_idx      Sort by numeric index
1      -|     data[two] = 20
1      -|     data[one] = 10              Both strings are numerically zero
1      -|     data[10] = one
1      -|     data[20] = two
1      -|     data[100] = 100
1      -|
1      -| Sort function: cmp_str_val      Sort by element values as strings
1      -|     data[one] = 10
1      -|     data[100] = 100             String 100 is less than string 20
1      -|     data[two] = 20
1      -|     data[10] = one
1      -|     data[20] = two
1      -|
1      -| Sort function: cmp_num_str_val  Sort all numeric values before all strings
1      -|     data[one] = 10
1      -|     data[two] = 20
1      -|     data[100] = 100
1      -|     data[10] = one
1      -|     data[20] = two
1 
1    Consider sorting the entries of a GNU/Linux system password file
1 according to login name.  The following program sorts records by a
1 specific field position and can be used for this purpose:
1 
1      # passwd-sort.awk --- simple program to sort by field position
1      # field position is specified by the global variable POS
1 
1      function cmp_field(i1, v1, i2, v2)
1      {
1          # comparison by value, as string, and ascending order
1          return v1[POS] < v2[POS] ? -1 : (v1[POS] != v2[POS])
1      }
1 
1      {
1          for (i = 1; i <= NF; i++)
1              a[NR][i] = $i
1      }
1 
1      END {
1          PROCINFO["sorted_in"] = "cmp_field"
1          if (POS < 1 || POS > NF)
1              POS = 1
1 
1          for (i in a) {
1              for (j = 1; j <= NF; j++)
1                  printf("%s%c", a[i][j], j < NF ? ":" : "")
1              print ""
1          }
1      }
1 
1    The first field in each entry of the password file is the user's
1 login name, and the fields are separated by colons.  Each record defines
1 a subarray, with each field as an element in the subarray.  Running the
1 program produces the following output:
1 
1      $ gawk -v POS=1 -F: -f sort.awk /etc/passwd
1      -| adm:x:3:4:adm:/var/adm:/sbin/nologin
1      -| apache:x:48:48:Apache:/var/www:/sbin/nologin
1      -| avahi:x:70:70:Avahi daemon:/:/sbin/nologin
1      ...
1 
1    The comparison should normally always return the same value when
1 given a specific pair of array elements as its arguments.  If
1 inconsistent results are returned, then the order is undefined.  This
1 behavior can be exploited to introduce random order into otherwise
1 seemingly ordered data:
1 
1      function cmp_randomize(i1, v1, i2, v2)
1      {
1          # random order (caution: this may never terminate!)
1          return (2 - 4 * rand())
1      }
1 
1    As already mentioned, the order of the indices is arbitrary if two
1 elements compare equal.  This is usually not a problem, but letting the
1 tied elements come out in arbitrary order can be an issue, especially
1 when comparing item values.  The partial ordering of the equal elements
1 may change the next time the array is traversed, if other elements are
1 added to or removed from the array.  One way to resolve ties when
1 comparing elements with otherwise equal values is to include the indices
1 in the comparison rules.  Note that doing this may make the loop
1 traversal less efficient, so consider it only if necessary.  The
1 following comparison functions force a deterministic order, and are
1 based on the fact that the (string) indices of two elements are never
1 equal:
1 
1      function cmp_numeric(i1, v1, i2, v2)
1      {
1          # numerical value (and index) comparison, descending order
1          return (v1 != v2) ? (v2 - v1) : (i2 - i1)
1      }
1 
1      function cmp_string(i1, v1, i2, v2)
1      {
1          # string value (and index) comparison, descending order
1          v1 = v1 i1
1          v2 = v2 i2
1          return (v1 > v2) ? -1 : (v1 != v2)
1      }
1 
1    A custom comparison function can often simplify ordered loop
1 traversal, and the sky is really the limit when it comes to designing
1 such a function.
1 
1    When string comparisons are made during a sort, either for element
1 values where one or both aren't numbers, or for element indices handled
1 as strings, the value of 'IGNORECASE' (⇒Built-in Variables)
1 controls whether the comparisons treat corresponding upper- and
1 lowercase letters as equivalent or distinct.
1 
1    Another point to keep in mind is that in the case of subarrays, the
1 element values can themselves be arrays; a production comparison
1 function should use the 'isarray()' function (⇒Type Functions) to
1 check for this, and choose a defined sorting order for subarrays.
1 
1    All sorting based on 'PROCINFO["sorted_in"]' is disabled in POSIX
1 mode, because the 'PROCINFO' array is not special in that case.
1 
1    As a side note, sorting the array indices before traversing the array
1 has been reported to add a 15% to 20% overhead to the execution time of
1 'awk' programs.  For this reason, sorted array traversal is not the
1 default.
1 
1    ---------- Footnotes ----------
1 
1    (1) This is why the predefined sorting orders start with an '@'
1 character, which cannot be part of an identifier.
1