gawk: Array Sorting Functions

1 
1 12.2.2 Sorting Array Values and Indices with 'gawk'
1 ---------------------------------------------------
1 
1 In most 'awk' implementations, sorting an array requires writing a
1 'sort()' function.  This can be educational for exploring different
1 sorting algorithms, but usually that's not the point of the program.
11 'gawk' provides the built-in 'asort()' and 'asorti()' functions (⇒
 String Functions) for sorting arrays.  For example:
1 
1      POPULATE THE ARRAY data
1      n = asort(data)
1      for (i = 1; i <= n; i++)
1          DO SOMETHING WITH data[i]
1 
1    After the call to 'asort()', the array 'data' is indexed from 1 to
1 some number N, the total number of elements in 'data'.  (This count is
1 'asort()''s return value.)  'data[1]' <= 'data[2]' <= 'data[3]', and so
11 on.  The default comparison is based on the type of the elements (⇒
 Typing and Comparison).  All numeric values come before all string
1 values, which in turn come before all subarrays.
1 
1    An important side effect of calling 'asort()' is that _the array's
1 original indices are irrevocably lost_.  As this isn't always desirable,
1 'asort()' accepts a second argument:
1 
1      POPULATE THE ARRAY source
1      n = asort(source, dest)
1      for (i = 1; i <= n; i++)
1          DO SOMETHING WITH dest[i]
1 
1    In this case, 'gawk' copies the 'source' array into the 'dest' array
1 and then sorts 'dest', destroying its indices.  However, the 'source'
1 array is not affected.
1 
1    Often, what's needed is to sort on the values of the _indices_
1 instead of the values of the elements.  To do that, use the 'asorti()'
1 function.  The interface and behavior are identical to that of
1 'asort()', except that the index values are used for sorting and become
1 the values of the result array:
1 
1      { source[$0] = some_func($0) }
1 
1      END {
1          n = asorti(source, dest)
1          for (i = 1; i <= n; i++) {
1              Work with sorted indices directly:
1              DO SOMETHING WITH dest[i]
1              ...
1              Access original array via sorted indices:
1              DO SOMETHING WITH source[dest[i]]
1          }
1      }
1 
1    So far, so good.  Now it starts to get interesting.  Both 'asort()'
1 and 'asorti()' accept a third string argument to control comparison of
11 array elements.  When we introduced 'asort()' and 'asorti()' in ⇒
 String Functions, we ignored this third argument; however, now is the
1 time to describe how this argument affects these two functions.
1 
1    Basically, the third argument specifies how the array is to be
1 sorted.  There are two possibilities.  As with 'PROCINFO["sorted_in"]',
1 this argument may be one of the predefined names that 'gawk' provides
1 (⇒Controlling Scanning), or it may be the name of a user-defined
1 function (⇒Controlling Array Traversal).
1 
1    In the latter case, _the function can compare elements in any way it
1 chooses_, taking into account just the indices, just the values, or
1 both.  This is extremely powerful.
1 
1    Once the array is sorted, 'asort()' takes the _values_ in their final
1 order and uses them to fill in the result array, whereas 'asorti()'
1 takes the _indices_ in their final order and uses them to fill in the
1 result array.
1 
1      NOTE: Copying array indices and elements isn't expensive in terms
1      of memory.  Internally, 'gawk' maintains "reference counts" to
1      data.  For example, when 'asort()' copies the first array to the
1      second one, there is only one copy of the original array elements'
1      data, even though both arrays use the values.
1 
1    Because 'IGNORECASE' affects string comparisons, the value of
1 'IGNORECASE' also affects sorting for both 'asort()' and 'asorti()'.
1 Note also that the locale's sorting order does _not_ come into play;
1 comparisons are based on character values only.(1)
1 
1    The following example demonstrates the use of a comparison function
1 with 'asort()'.  The comparison function, 'case_fold_compare()', maps
1 both values to lowercase in order to compare them ignoring case.
1 
1      # case_fold_compare --- compare as strings, ignoring case
1 
1      function case_fold_compare(i1, v1, i2, v2,    l, r)
1      {
1          l = tolower(v1)
1          r = tolower(v2)
1 
1          if (l < r)
1              return -1
1          else if (l == r)
1              return 0
1          else
1              return 1
1      }
1 
1    And here is the test program for it:
1 
1      # Test program
1 
1      BEGIN {
1          Letters = "abcdefghijklmnopqrstuvwxyz" \
1                    "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
1          split(Letters, data, "")
1 
1          asort(data, result, "case_fold_compare")
1 
1          j = length(result)
1          for (i = 1; i <= j; i++) {
1              printf("%s", result[i])
1              if (i % (j/2) == 0)
1                  printf("\n")
1              else
1                  printf(" ")
1          }
1      }
1 
1    When run, we get the following:
1 
1      $ gawk -f case_fold_compare.awk
1      -| A a B b c C D d e E F f g G H h i I J j k K l L M m
1      -| n N O o p P Q q r R S s t T u U V v w W X x y Y z Z
1 
1    ---------- Footnotes ----------
1 
1    (1) This is true because locale-based comparison occurs only when in
1 POSIX-compatibility mode, and because 'asort()' and 'asorti()' are
1 'gawk' extensions, they are not available in that case.
1