1 1 9.1.3 String-Manipulation Functions 1 ----------------------------------- 1 1 The functions in this minor node look at or change the text of one or 1 more strings. 1 1 'gawk' understands locales (⇒Locales) and does all string 1 processing in terms of _characters_, not _bytes_. This distinction is 1 particularly important to understand for locales where one character may 1 be represented by multiple bytes. Thus, for example, 'length()' returns 1 the number of characters in a string, and not the number of bytes used 1 to represent those characters. Similarly, 'index()' works with 1 character indices, and not byte indices. 1 1 CAUTION: A number of functions deal with indices into strings. For 1 these functions, the first character of a string is at position 1 (index) one. This is different from C and the languages descended 1 from it, where the first character is at position zero. You need 1 to remember this when doing index calculations, particularly if you 1 are used to C. 1 1 In the following list, optional parameters are enclosed in square 1 brackets ([ ]). Several functions perform string substitution; the full 1 discussion is provided in the description of the 'sub()' function, which 1 comes toward the end, because the list is presented alphabetically. 1 1 Those functions that are specific to 'gawk' are marked with a pound 1 Options::): 1
1 · Gory Details More than you want to know about '\' and 1 '&' with 'sub()', 'gsub()', and 1 'gensub()'. 1 1 'asort('SOURCE [',' DEST [',' HOW ] ]') #' 1 'asorti('SOURCE [',' DEST [',' HOW ] ]') #' 1 These two functions are similar in behavior, so they are described 1 together. 1 1 NOTE: The following description ignores the third argument, 1 HOW, as it requires understanding features that we have not 1 discussed yet. Thus, the discussion here is a deliberate 1 simplification. (We do provide all the details later on; see 1 ⇒Array Sorting Functions for the full story.) 1 1 Both functions return the number of elements in the array SOURCE. 1 For 'asort()', 'gawk' sorts the values of SOURCE and replaces the 1 indices of the sorted values of SOURCE with sequential integers 1 starting with one. If the optional array DEST is specified, then 1 SOURCE is duplicated into DEST. DEST is then sorted, leaving the 1 indices of SOURCE unchanged. 1 11 When comparing strings, 'IGNORECASE' affects the sorting (⇒ Array Sorting Functions). If the SOURCE array contains subarrays 1 as values (⇒Arrays of Arrays), they will come last, after 1 all scalar values. Subarrays are _not_ recursively sorted. 1 1 For example, if the contents of 'a' are as follows: 1 1 a["last"] = "de" 1 a["first"] = "sac" 1 a["middle"] = "cul" 1 1 A call to 'asort()': 1 1 asort(a) 1 1 results in the following contents of 'a': 1 1 a[1] = "cul" 1 a[2] = "de" 1 a[3] = "sac" 1 1 The 'asorti()' function works similarly to 'asort()'; however, the 1 _indices_ are sorted, instead of the values. Thus, in the previous 1 example, starting with the same initial set of indices and values 1 in 'a', calling 'asorti(a)' would yield: 1 1 a[1] = "first" 1 a[2] = "last" 1 a[3] = "middle" 1 1 'gensub(REGEXP, REPLACEMENT, HOW' [', TARGET']') #' 1 Search the target string TARGET for matches of the regular 1 expression REGEXP. If HOW is a string beginning with 'g' or 'G' 1 (short for "global"), then replace all matches of REGEXP with 1 REPLACEMENT. Otherwise, HOW is treated as a number indicating 1 which match of REGEXP to replace. If no TARGET is supplied, use 1 '$0'. It returns the modified string as the result of the function 1 and the original target string is _not_ changed. 1 1 'gensub()' is a general substitution function. Its purpose is to 1 provide more features than the standard 'sub()' and 'gsub()' 1 functions. 1 1 'gensub()' provides an additional feature that is not available in 1 'sub()' or 'gsub()': the ability to specify components of a regexp 1 in the replacement text. This is done by using parentheses in the 1 regexp to mark the components and then specifying '\N' in the 1 replacement text, where N is a digit from 1 to 9. For example: 1 1 $ gawk ' 1 > BEGIN { 1 > a = "abc def" 1 > b = gensub(/(.+) (.+)/, "\\2 \\1", "g", a) 1 > print b 1 > }' 1 -| def abc 1 1 As with 'sub()', you must type two backslashes in order to get one 1 into the string. In the replacement text, the sequence '\0' 1 represents the entire matched text, as does the character '&'. 1 1 The following example shows how you can use the third argument to 1 control which match of the regexp should be changed: 1 1 $ echo a b c a b c | 1 > gawk '{ print gensub(/a/, "AA", 2) }' 1 -| a b c AA b c 1 1 In this case, '$0' is the default target string. 'gensub()' 1 returns the new string as its result, which is passed directly to 1 'print' for printing. 1 1 If the HOW argument is a string that does not begin with 'g' or 1 'G', or if it is a number that is less than or equal to zero, only 1 one substitution is performed. If HOW is zero, 'gawk' issues a 1 warning message. 1 1 If REGEXP does not match TARGET, 'gensub()''s return value is the 1 original unchanged value of TARGET. 1 1 'gsub(REGEXP, REPLACEMENT' [', TARGET']')' 1 Search TARGET for _all_ of the longest, leftmost, _nonoverlapping_ 1 matching substrings it can find and replace them with REPLACEMENT. 1 The 'g' in 'gsub()' stands for "global," which means replace 1 everywhere. For example: 1 1 { gsub(/Britain/, "United Kingdom"); print } 1 1 replaces all occurrences of the string 'Britain' with 'United 1 Kingdom' for all input records. 1 1 The 'gsub()' function returns the number of substitutions made. If 1 the variable to search and alter (TARGET) is omitted, then the 1 entire input record ('$0') is used. As in 'sub()', the characters 1 '&' and '\' are special, and the third argument must be assignable. 1 1 'index(IN, FIND)' 1 Search the string IN for the first occurrence of the string FIND, 1 and return the position in characters where that occurrence begins 1 in the string IN. Consider the following example: 1 1 $ awk 'BEGIN { print index("peanut", "an") }' 1 -| 3 1 1 If FIND is not found, 'index()' returns zero. 1 1 With BWK 'awk' and 'gawk', it is a fatal error to use a regexp 1 constant for FIND. Other implementations allow it, simply treating 1 the regexp constant as an expression meaning '$0 ~ /regexp/'. 1 (d.c.) 1 1 'length('[STRING]')' 1 Return the number of characters in STRING. If STRING is a number, 1 the length of the digit string representing that number is 1 returned. For example, 'length("abcde")' is five. By contrast, 1 'length(15 * 35)' works out to three. In this example, 15 * 35 = 1 525, and 525 is then converted to the string '"525"', which has 1 three characters. 1 1 If no argument is supplied, 'length()' returns the length of '$0'. 1 1 NOTE: In older versions of 'awk', the 'length()' function 1 could be called without any parentheses. Doing so is 1 considered poor practice, although the 2008 POSIX standard 1 explicitly allows it, to support historical practice. For 1 programs to be maximally portable, always supply the 1 parentheses. 1 1 If 'length()' is called with a variable that has not been used, 1 'gawk' forces the variable to be a scalar. Other implementations 1 of 'awk' leave the variable without a type. (d.c.) Consider: 1 1 $ gawk 'BEGIN { print length(x) ; x[1] = 1 }' 1 -| 0 1 error-> gawk: fatal: attempt to use scalar `x' as array 1 1 $ nawk 'BEGIN { print length(x) ; x[1] = 1 }' 1 -| 0 1 1 If '--lint' has been specified on the command line, 'gawk' issues a 1 warning about this. 1 1 With 'gawk' and several other 'awk' implementations, when given an 1 array argument, the 'length()' function returns the number of 1 elements in the array. (c.e.) This is less useful than it might 1 seem at first, as the array is not guaranteed to be indexed from 1 one to the number of elements in it. If '--lint' is provided on 1 the command line (⇒Options), 'gawk' warns that passing an 1 array argument is not portable. If '--posix' is supplied, using an 1 array argument is a fatal error (⇒Arrays). 1 1 'match(STRING, REGEXP' [', ARRAY']')' 1 Search STRING for the longest, leftmost substring matched by the 1 regular expression REGEXP and return the character position (index) 1 at which that substring begins (one, if it starts at the beginning 1 of STRING). If no match is found, return zero. 1 1 The REGEXP argument may be either a regexp constant ('/'...'/') or 1 a string constant ('"'...'"'). In the latter case, the string is 1 treated as a regexp to be matched. ⇒Computed Regexps for a 1 discussion of the difference between the two forms, and the 1 implications for writing your program correctly. 1 1 The order of the first two arguments is the opposite of most other 1 string functions that work with regular expressions, such as 1 'sub()' and 'gsub()'. It might help to remember that for 1 'match()', the order is the same as for the '~' operator: 'STRING ~ 1 REGEXP'. 1 1 The 'match()' function sets the predefined variable 'RSTART' to the 1 index. It also sets the predefined variable 'RLENGTH' to the 1 length in characters of the matched substring. If no match is 1 found, 'RSTART' is set to zero, and 'RLENGTH' to -1. 1 1 For example: 1 1 { 1 if ($1 == "FIND") 1 regex = $2 1 else { 1 where = match($0, regex) 1 if (where != 0) 1 print "Match of", regex, "found at", where, "in", $0 1 } 1 } 1 1 This program looks for lines that match the regular expression 1 stored in the variable 'regex'. This regular expression can be 1 changed. If the first word on a line is 'FIND', 'regex' is changed 1 to be the second word on that line. Therefore, if given: 1 1 FIND ru+n 1 My program runs 1 but not very quickly 1 FIND Melvin 1 JF+KM 1 This line is property of Reality Engineering Co. 1 Melvin was here. 1 1 'awk' prints: 1 1 Match of ru+n found at 12 in My program runs 1 Match of Melvin found at 1 in Melvin was here. 1 1 If ARRAY is present, it is cleared, and then the zeroth element of 1 ARRAY is set to the entire portion of STRING matched by REGEXP. If 1 REGEXP contains parentheses, the integer-indexed elements of ARRAY 1 are set to contain the portion of STRING matching the corresponding 1 parenthesized subexpression. For example: 1 1 $ echo foooobazbarrrrr | 1 > gawk '{ match($0, /(fo+).+(bar*)/, arr) 1 > print arr[1], arr[2] }' 1 -| foooo barrrrr 1 1 In addition, multidimensional subscripts are available providing 1 the start index and length of each matched subexpression: 1 1 $ echo foooobazbarrrrr | 1 > gawk '{ match($0, /(fo+).+(bar*)/, arr) 1 > print arr[1], arr[2] 1 > print arr[1, "start"], arr[1, "length"] 1 > print arr[2, "start"], arr[2, "length"] 1 > }' 1 -| foooo barrrrr 1 -| 1 5 1 -| 9 7 1 1 There may not be subscripts for the start and index for every 1 parenthesized subexpression, because they may not all have matched 11 text; thus, they should be tested for with the 'in' operator (⇒ Reference to Elements). 1 1 The ARRAY argument to 'match()' is a 'gawk' extension. In 1 compatibility mode (⇒Options), using a third argument is a 1 fatal error. 1 1 'patsplit(STRING, ARRAY' [', FIELDPAT' [', SEPS' ] ]') #' 1 Divide STRING into pieces (or "fields") defined by FIELDPAT and 1 store the pieces in ARRAY and the separator strings in the SEPS 1 array. The first piece is stored in 'ARRAY[1]', the second piece 1 in 'ARRAY[2]', and so forth. The third argument, FIELDPAT, is a 1 regexp describing the fields in STRING (just as 'FPAT' is a regexp 1 describing the fields in input records). It may be either a regexp 1 constant or a string. If FIELDPAT is omitted, the value of 'FPAT' 1 is used. 'patsplit()' returns the number of elements created. 1 'SEPS[I]' is the possibly null separator string after 'ARRAY[I]'. 1 The possibly null leading separator will be in 'SEPS[0]'. So a 1 non-null STRING with N fields will have N+1 separators. A null 1 STRING will not have neither fields nor separators. 1 1 The 'patsplit()' function splits strings into pieces in a manner 1 similar to the way input lines are split into fields using 'FPAT' 1 (⇒Splitting By Content). 1 1 Before splitting the string, 'patsplit()' deletes any previously 1 existing elements in the arrays ARRAY and SEPS. 1 1 'split(STRING, ARRAY' [', FIELDSEP' [', SEPS' ] ]')' 1 Divide STRING into pieces separated by FIELDSEP and store the 1 pieces in ARRAY and the separator strings in the SEPS array. The 1 first piece is stored in 'ARRAY[1]', the second piece in 1 'ARRAY[2]', and so forth. The string value of the third argument, 1 FIELDSEP, is a regexp describing where to split STRING (much as 1 'FS' can be a regexp describing where to split input records). If 1 FIELDSEP is omitted, the value of 'FS' is used. 'split()' returns 1 the number of elements created. SEPS is a 'gawk' extension, with 1 'SEPS[I]' being the separator string between 'ARRAY[I]' and 1 'ARRAY[I+1]'. If FIELDSEP is a single space, then any leading 1 whitespace goes into 'SEPS[0]' and any trailing whitespace goes 1 into 'SEPS[N]', where N is the return value of 'split()' (i.e., the 1 number of elements in ARRAY). 1 1 The 'split()' function splits strings into pieces in a manner 1 similar to the way input lines are split into fields. For example: 1 1 split("cul-de-sac", a, "-", seps) 1 1 splits the string '"cul-de-sac"' into three fields using '-' as the 1 separator. It sets the contents of the array 'a' as follows: 1 1 a[1] = "cul" 1 a[2] = "de" 1 a[3] = "sac" 1 1 and sets the contents of the array 'seps' as follows: 1 1 seps[1] = "-" 1 seps[2] = "-" 1 1 The value returned by this call to 'split()' is three. 1 1 As with input field-splitting, when the value of FIELDSEP is '" "', 1 leading and trailing whitespace is ignored in values assigned to 1 the elements of ARRAY but not in SEPS, and the elements are 1 separated by runs of whitespace. Also, as with input field 1 splitting, if FIELDSEP is the null string, each individual 1 character in the string is split into its own array element. 1 (c.e.) 1 1 Note, however, that 'RS' has no effect on the way 'split()' works. 1 Even though 'RS = ""' causes the newline character to also be an 1 input field separator, this does not affect how 'split()' splits 1 strings. 1 1 Modern implementations of 'awk', including 'gawk', allow the third 1 argument to be a regexp constant ('/'...'/') as well as a string. 1 Regexps:: for a discussion of the difference between using a string 1 constant or a regexp constant, and the implications for writing 1 your program correctly. 1 1 Before splitting the string, 'split()' deletes any previously 1 existing elements in the arrays ARRAY and SEPS. 1 1 If STRING is null, the array has no elements. (So this is a 11 portable way to delete an entire array with one statement. ⇒ Delete.) 1 1 If STRING does not match FIELDSEP at all (but is not null), ARRAY 1 has one element only. The value of that element is the original 1 STRING. 1 1 In POSIX mode (⇒Options), the fourth argument is not 1 allowed. 1 1 'sprintf(FORMAT, EXPRESSION1, ...)' 1 Return (without printing) the string that 'printf' would have 1 printed out with the same arguments (⇒Printf). For example: 1 1 pival = sprintf("pi = %.2f (approx.)", 22/7) 1 1 assigns the string 'pi = 3.14 (approx.)' to the variable 'pival'. 1 1 'strtonum(STR) #' 1 Examine STR and return its numeric value. If STR begins with a 1 leading '0', 'strtonum()' assumes that STR is an octal number. If 1 STR begins with a leading '0x' or '0X', 'strtonum()' assumes that 1 STR is a hexadecimal number. For example: 1 1 $ echo 0x11 | 1 > gawk '{ printf "%d\n", strtonum($1) }' 1 -| 17 1 1 Using the 'strtonum()' function is _not_ the same as adding zero to 1 a string value; the automatic coercion of strings to numbers works 1 only for decimal data, not for octal or hexadecimal.(1) 1 1 Note also that 'strtonum()' uses the current locale's decimal point 1 for recognizing numbers (⇒Locales). 1 1 'sub(REGEXP, REPLACEMENT' [', TARGET']')' 1 Search TARGET, which is treated as a string, for the leftmost, 1 longest substring matched by the regular expression REGEXP. Modify 1 the entire string by replacing the matched text with REPLACEMENT. 1 The modified string becomes the new value of TARGET. Return the 1 number of substitutions made (zero or one). 1 1 The REGEXP argument may be either a regexp constant ('/'...'/') or 1 a string constant ('"'...'"'). In the latter case, the string is 1 treated as a regexp to be matched. ⇒Computed Regexps for a 1 discussion of the difference between the two forms, and the 1 implications for writing your program correctly. 1 1 This function is peculiar because TARGET is not simply used to 1 compute a value, and not just any expression will do--it must be a 1 variable, field, or array element so that 'sub()' can store a 1 modified value there. If this argument is omitted, then the 1 default is to use and alter '$0'.(2) For example: 1 1 str = "water, water, everywhere" 1 sub(/at/, "ith", str) 1 1 sets 'str' to 'wither, water, everywhere', by replacing the 1 leftmost longest occurrence of 'at' with 'ith'. 1 1 If the special character '&' appears in REPLACEMENT, it stands for 1 the precise substring that was matched by REGEXP. (If the regexp 1 can match more than one string, then this precise substring may 1 vary.) For example: 1 1 { sub(/candidate/, "& and his wife"); print } 1 1 changes the first occurrence of 'candidate' to 'candidate and his 1 wife' on each input line. Here is another example: 1 1 $ awk 'BEGIN { 1 > str = "daabaaa" 1 > sub(/a+/, "C&C", str) 1 > print str 1 > }' 1 -| dCaaCbaaa 1 1 This shows how '&' can represent a nonconstant string and also 11 illustrates the "leftmost, longest" rule in regexp matching (⇒ Leftmost Longest). 1 1 The effect of this special character ('&') can be turned off by 1 putting a backslash before it in the string. As usual, to insert 1 one backslash in the string, you must write two backslashes. 1 Therefore, write '\\&' in a string constant to include a literal 1 '&' in the replacement. For example, the following shows how to 1 replace the first '|' on each line with an '&': 1 1 { sub(/\|/, "\\&"); print } 1 1 As mentioned, the third argument to 'sub()' must be a variable, 1 field, or array element. Some versions of 'awk' allow the third 1 argument to be an expression that is not an lvalue. In such a 1 case, 'sub()' still searches for the pattern and returns zero or 1 one, but the result of the substitution (if any) is thrown away 1 because there is no place to put it. Such versions of 'awk' accept 1 expressions like the following: 1 1 sub(/USA/, "United States", "the USA and Canada") 1 1 For historical compatibility, 'gawk' accepts such erroneous code. 1 However, using any other nonchangeable object as the third 1 parameter causes a fatal error and your program will not run. 1 1 Finally, if the REGEXP is not a regexp constant, it is converted 1 into a string, and then the value of that string is treated as the 1 regexp to match. 1 1 'substr(STRING, START' [', LENGTH' ]')' 1 Return a LENGTH-character-long substring of STRING, starting at 1 character number START. The first character of a string is 1 character number one.(3) For example, 'substr("washington", 5, 3)' 1 returns '"ing"'. 1 1 If LENGTH is not present, 'substr()' returns the whole suffix of 1 STRING that begins at character number START. For example, 1 'substr("washington", 5)' returns '"ington"'. The whole suffix is 1 also returned if LENGTH is greater than the number of characters 1 remaining in the string, counting from character START. 1 1 If START is less than one, 'substr()' treats it as if it was one. 1 (POSIX doesn't specify what to do in this case: BWK 'awk' acts this 1 way, and therefore 'gawk' does too.) If START is greater than the 1 number of characters in the string, 'substr()' returns the null 1 string. Similarly, if LENGTH is present but less than or equal to 1 zero, the null string is returned. 1 1 The string returned by 'substr()' _cannot_ be assigned. Thus, it 1 is a mistake to attempt to change a portion of a string, as shown 1 in the following example: 1 1 string = "abcdef" 1 # try to get "abCDEf", won't work 1 substr(string, 3, 3) = "CDE" 1 1 It is also a mistake to use 'substr()' as the third argument of 1 'sub()' or 'gsub()': 1 1 gsub(/xyz/, "pdq", substr($0, 5, 20)) # WRONG 1 1 (Some commercial versions of 'awk' treat 'substr()' as assignable, 1 but doing so is not portable.) 1 1 If you need to replace bits and pieces of a string, combine 1 'substr()' with string concatenation, in the following manner: 1 1 string = "abcdef" 1 ... 1 string = substr(string, 1, 2) "CDE" substr(string, 6) 1 1 'tolower(STRING)' 1 Return a copy of STRING, with each uppercase character in the 1 string replaced with its corresponding lowercase character. 1 Nonalphabetic characters are left unchanged. For example, 1 'tolower("MiXeD cAsE 123")' returns '"mixed case 123"'. 1 1 'toupper(STRING)' 1 Return a copy of STRING, with each lowercase character in the 1 string replaced with its corresponding uppercase character. 1 Nonalphabetic characters are left unchanged. For example, 1 'toupper("MiXeD cAsE 123")' returns '"MIXED CASE 123"'. 1 1 Matching the Null String 1 1 In 'awk', the '*' operator can match the null string. This is 1 particularly important for the 'sub()', 'gsub()', and 'gensub()' 1 functions. For example: 1 1 $ echo abc | awk '{ gsub(/m*/, "X"); print }' 1 -| XaXbXcX 1 1 Although this makes a certain amount of sense, it can be surprising. 1 1 ---------- Footnotes ---------- 1 1 (1) Unless you use the '--non-decimal-data' option, which isn't 1 recommended. ⇒Nondecimal Data for more information. 1 1 (2) Note that this means that the record will first be regenerated 1 using the value of 'OFS' if any fields have been changed, and that the 1 fields will be updated after the substitution, even if the operation is 1 a "no-op" such as 'sub(/^/, "")'. 1 1 (3) This is different from C and C++, in which the first character is 1 number zero. 1