gawk: Controlling Array Traversal
1
1 12.2.1 Controlling Array Traversal
1 ----------------------------------
1
1 By default, the order in which a 'for (INDX in ARRAY)' loop scans an
1 array is not defined; it is generally based upon the internal
1 implementation of arrays inside 'awk'.
1
1 Often, though, it is desirable to be able to loop over the elements
1 in a particular order that you, the programmer, choose. 'gawk' lets you
1 do this.
1
1 ⇒Controlling Scanning describes how you can assign special,
1 predefined values to 'PROCINFO["sorted_in"]' in order to control the
1 order in which 'gawk' traverses an array during a 'for' loop.
1
1 In addition, the value of 'PROCINFO["sorted_in"]' can be a function
1 name.(1) This lets you traverse an array based on any custom criterion.
1 The array elements are ordered according to the return value of this
1 function. The comparison function should be defined with at least four
1 arguments:
1
1 function comp_func(i1, v1, i2, v2)
1 {
1 COMPARE ELEMENTS 1 AND 2 IN SOME FASHION
1 RETURN < 0; 0; OR > 0
1 }
1
1 Here, 'i1' and 'i2' are the indices, and 'v1' and 'v2' are the
1 corresponding values of the two elements being compared. Either 'v1' or
1 'v2', or both, can be arrays if the array being traversed contains
1 subarrays as values. (⇒Arrays of Arrays for more information
1 about subarrays.) The three possible return values are interpreted as
1 follows:
1
1 'comp_func(i1, v1, i2, v2) < 0'
1 Index 'i1' comes before index 'i2' during loop traversal.
1
1 'comp_func(i1, v1, i2, v2) == 0'
1 Indices 'i1' and 'i2' come together, but the relative order with
1 respect to each other is undefined.
1
1 'comp_func(i1, v1, i2, v2) > 0'
1 Index 'i1' comes after index 'i2' during loop traversal.
1
1 Our first comparison function can be used to scan an array in
1 numerical order of the indices:
1
1 function cmp_num_idx(i1, v1, i2, v2)
1 {
1 # numerical index comparison, ascending order
1 return (i1 - i2)
1 }
1
1 Our second function traverses an array based on the string order of
1 the element values rather than by indices:
1
1 function cmp_str_val(i1, v1, i2, v2)
1 {
1 # string value comparison, ascending order
1 v1 = v1 ""
1 v2 = v2 ""
1 if (v1 < v2)
1 return -1
1 return (v1 != v2)
1 }
1
1 The third comparison function makes all numbers, and numeric strings
1 without any leading or trailing spaces, come out first during loop
1 traversal:
1
1 function cmp_num_str_val(i1, v1, i2, v2, n1, n2)
1 {
1 # numbers before string value comparison, ascending order
1 n1 = v1 + 0
1 n2 = v2 + 0
1 if (n1 == v1)
1 return (n2 == v2) ? (n1 - n2) : -1
1 else if (n2 == v2)
1 return 1
1 return (v1 < v2) ? -1 : (v1 != v2)
1 }
1
1 Here is a main program to demonstrate how 'gawk' behaves using each
1 of the previous functions:
1
1 BEGIN {
1 data["one"] = 10
1 data["two"] = 20
1 data[10] = "one"
1 data[100] = 100
1 data[20] = "two"
1
1 f[1] = "cmp_num_idx"
1 f[2] = "cmp_str_val"
1 f[3] = "cmp_num_str_val"
1 for (i = 1; i <= 3; i++) {
1 printf("Sort function: %s\n", f[i])
1 PROCINFO["sorted_in"] = f[i]
1 for (j in data)
1 printf("\tdata[%s] = %s\n", j, data[j])
1 print ""
1 }
1 }
1
1 Here are the results when the program is run:
1
1 $ gawk -f compdemo.awk
1 -| Sort function: cmp_num_idx Sort by numeric index
1 -| data[two] = 20
1 -| data[one] = 10 Both strings are numerically zero
1 -| data[10] = one
1 -| data[20] = two
1 -| data[100] = 100
1 -|
1 -| Sort function: cmp_str_val Sort by element values as strings
1 -| data[one] = 10
1 -| data[100] = 100 String 100 is less than string 20
1 -| data[two] = 20
1 -| data[10] = one
1 -| data[20] = two
1 -|
1 -| Sort function: cmp_num_str_val Sort all numeric values before all strings
1 -| data[one] = 10
1 -| data[two] = 20
1 -| data[100] = 100
1 -| data[10] = one
1 -| data[20] = two
1
1 Consider sorting the entries of a GNU/Linux system password file
1 according to login name. The following program sorts records by a
1 specific field position and can be used for this purpose:
1
1 # passwd-sort.awk --- simple program to sort by field position
1 # field position is specified by the global variable POS
1
1 function cmp_field(i1, v1, i2, v2)
1 {
1 # comparison by value, as string, and ascending order
1 return v1[POS] < v2[POS] ? -1 : (v1[POS] != v2[POS])
1 }
1
1 {
1 for (i = 1; i <= NF; i++)
1 a[NR][i] = $i
1 }
1
1 END {
1 PROCINFO["sorted_in"] = "cmp_field"
1 if (POS < 1 || POS > NF)
1 POS = 1
1
1 for (i in a) {
1 for (j = 1; j <= NF; j++)
1 printf("%s%c", a[i][j], j < NF ? ":" : "")
1 print ""
1 }
1 }
1
1 The first field in each entry of the password file is the user's
1 login name, and the fields are separated by colons. Each record defines
1 a subarray, with each field as an element in the subarray. Running the
1 program produces the following output:
1
1 $ gawk -v POS=1 -F: -f sort.awk /etc/passwd
1 -| adm:x:3:4:adm:/var/adm:/sbin/nologin
1 -| apache:x:48:48:Apache:/var/www:/sbin/nologin
1 -| avahi:x:70:70:Avahi daemon:/:/sbin/nologin
1 ...
1
1 The comparison should normally always return the same value when
1 given a specific pair of array elements as its arguments. If
1 inconsistent results are returned, then the order is undefined. This
1 behavior can be exploited to introduce random order into otherwise
1 seemingly ordered data:
1
1 function cmp_randomize(i1, v1, i2, v2)
1 {
1 # random order (caution: this may never terminate!)
1 return (2 - 4 * rand())
1 }
1
1 As already mentioned, the order of the indices is arbitrary if two
1 elements compare equal. This is usually not a problem, but letting the
1 tied elements come out in arbitrary order can be an issue, especially
1 when comparing item values. The partial ordering of the equal elements
1 may change the next time the array is traversed, if other elements are
1 added to or removed from the array. One way to resolve ties when
1 comparing elements with otherwise equal values is to include the indices
1 in the comparison rules. Note that doing this may make the loop
1 traversal less efficient, so consider it only if necessary. The
1 following comparison functions force a deterministic order, and are
1 based on the fact that the (string) indices of two elements are never
1 equal:
1
1 function cmp_numeric(i1, v1, i2, v2)
1 {
1 # numerical value (and index) comparison, descending order
1 return (v1 != v2) ? (v2 - v1) : (i2 - i1)
1 }
1
1 function cmp_string(i1, v1, i2, v2)
1 {
1 # string value (and index) comparison, descending order
1 v1 = v1 i1
1 v2 = v2 i2
1 return (v1 > v2) ? -1 : (v1 != v2)
1 }
1
1 A custom comparison function can often simplify ordered loop
1 traversal, and the sky is really the limit when it comes to designing
1 such a function.
1
1 When string comparisons are made during a sort, either for element
1 values where one or both aren't numbers, or for element indices handled
1 as strings, the value of 'IGNORECASE' (⇒Built-in Variables)
1 controls whether the comparisons treat corresponding upper- and
1 lowercase letters as equivalent or distinct.
1
1 Another point to keep in mind is that in the case of subarrays, the
1 element values can themselves be arrays; a production comparison
1 function should use the 'isarray()' function (⇒Type Functions) to
1 check for this, and choose a defined sorting order for subarrays.
1
1 All sorting based on 'PROCINFO["sorted_in"]' is disabled in POSIX
1 mode, because the 'PROCINFO' array is not special in that case.
1
1 As a side note, sorting the array indices before traversing the array
1 has been reported to add a 15% to 20% overhead to the execution time of
1 'awk' programs. For this reason, sorted array traversal is not the
1 default.
1
1 ---------- Footnotes ----------
1
1 (1) This is why the predefined sorting orders start with an '@'
1 character, which cannot be part of an identifier.
1