gawk: Array Sorting Functions
1
1 12.2.2 Sorting Array Values and Indices with 'gawk'
1 ---------------------------------------------------
1
1 In most 'awk' implementations, sorting an array requires writing a
1 'sort()' function. This can be educational for exploring different
1 sorting algorithms, but usually that's not the point of the program.
11 'gawk' provides the built-in 'asort()' and 'asorti()' functions (⇒
String Functions) for sorting arrays. For example:
1
1 POPULATE THE ARRAY data
1 n = asort(data)
1 for (i = 1; i <= n; i++)
1 DO SOMETHING WITH data[i]
1
1 After the call to 'asort()', the array 'data' is indexed from 1 to
1 some number N, the total number of elements in 'data'. (This count is
1 'asort()''s return value.) 'data[1]' <= 'data[2]' <= 'data[3]', and so
11 on. The default comparison is based on the type of the elements (⇒
Typing and Comparison). All numeric values come before all string
1 values, which in turn come before all subarrays.
1
1 An important side effect of calling 'asort()' is that _the array's
1 original indices are irrevocably lost_. As this isn't always desirable,
1 'asort()' accepts a second argument:
1
1 POPULATE THE ARRAY source
1 n = asort(source, dest)
1 for (i = 1; i <= n; i++)
1 DO SOMETHING WITH dest[i]
1
1 In this case, 'gawk' copies the 'source' array into the 'dest' array
1 and then sorts 'dest', destroying its indices. However, the 'source'
1 array is not affected.
1
1 Often, what's needed is to sort on the values of the _indices_
1 instead of the values of the elements. To do that, use the 'asorti()'
1 function. The interface and behavior are identical to that of
1 'asort()', except that the index values are used for sorting and become
1 the values of the result array:
1
1 { source[$0] = some_func($0) }
1
1 END {
1 n = asorti(source, dest)
1 for (i = 1; i <= n; i++) {
1 Work with sorted indices directly:
1 DO SOMETHING WITH dest[i]
1 ...
1 Access original array via sorted indices:
1 DO SOMETHING WITH source[dest[i]]
1 }
1 }
1
1 So far, so good. Now it starts to get interesting. Both 'asort()'
1 and 'asorti()' accept a third string argument to control comparison of
11 array elements. When we introduced 'asort()' and 'asorti()' in ⇒
String Functions, we ignored this third argument; however, now is the
1 time to describe how this argument affects these two functions.
1
1 Basically, the third argument specifies how the array is to be
1 sorted. There are two possibilities. As with 'PROCINFO["sorted_in"]',
1 this argument may be one of the predefined names that 'gawk' provides
1 (⇒Controlling Scanning), or it may be the name of a user-defined
1 function (⇒Controlling Array Traversal).
1
1 In the latter case, _the function can compare elements in any way it
1 chooses_, taking into account just the indices, just the values, or
1 both. This is extremely powerful.
1
1 Once the array is sorted, 'asort()' takes the _values_ in their final
1 order and uses them to fill in the result array, whereas 'asorti()'
1 takes the _indices_ in their final order and uses them to fill in the
1 result array.
1
1 NOTE: Copying array indices and elements isn't expensive in terms
1 of memory. Internally, 'gawk' maintains "reference counts" to
1 data. For example, when 'asort()' copies the first array to the
1 second one, there is only one copy of the original array elements'
1 data, even though both arrays use the values.
1
1 Because 'IGNORECASE' affects string comparisons, the value of
1 'IGNORECASE' also affects sorting for both 'asort()' and 'asorti()'.
1 Note also that the locale's sorting order does _not_ come into play;
1 comparisons are based on character values only.(1)
1
1 The following example demonstrates the use of a comparison function
1 with 'asort()'. The comparison function, 'case_fold_compare()', maps
1 both values to lowercase in order to compare them ignoring case.
1
1 # case_fold_compare --- compare as strings, ignoring case
1
1 function case_fold_compare(i1, v1, i2, v2, l, r)
1 {
1 l = tolower(v1)
1 r = tolower(v2)
1
1 if (l < r)
1 return -1
1 else if (l == r)
1 return 0
1 else
1 return 1
1 }
1
1 And here is the test program for it:
1
1 # Test program
1
1 BEGIN {
1 Letters = "abcdefghijklmnopqrstuvwxyz" \
1 "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
1 split(Letters, data, "")
1
1 asort(data, result, "case_fold_compare")
1
1 j = length(result)
1 for (i = 1; i <= j; i++) {
1 printf("%s", result[i])
1 if (i % (j/2) == 0)
1 printf("\n")
1 else
1 printf(" ")
1 }
1 }
1
1 When run, we get the following:
1
1 $ gawk -f case_fold_compare.awk
1 -| A a B b c C D d e E F f g G H h i I J j k K l L M m
1 -| n N O o p P Q q r R S s t T u U V v w W X x y Y z Z
1
1 ---------- Footnotes ----------
1
1 (1) This is true because locale-based comparison occurs only when in
1 POSIX-compatibility mode, and because 'asort()' and 'asorti()' are
1 'gawk' extensions, they are not available in that case.
1