Info: (gawk) String Functions

⇖ Info Catalog ← gawk: Numeric Functions ↑ gawk: Built-in → gawk: I/O Functions

gawk: String Functions

1 
1 9.1.3 String-Manipulation Functions
1 -----------------------------------
1 
1 The functions in this minor node look at or change the text of one or
1 more strings.
1 
1    'gawk' understands locales (⇒Locales) and does all string
1 processing in terms of _characters_, not _bytes_.  This distinction is
1 particularly important to understand for locales where one character may
1 be represented by multiple bytes.  Thus, for example, 'length()' returns
1 the number of characters in a string, and not the number of bytes used
1 to represent those characters.  Similarly, 'index()' works with
1 character indices, and not byte indices.
1 
1      CAUTION: A number of functions deal with indices into strings.  For
1      these functions, the first character of a string is at position
1      (index) one.  This is different from C and the languages descended
1      from it, where the first character is at position zero.  You need
1      to remember this when doing index calculations, particularly if you
1      are used to C.
1 
1    In the following list, optional parameters are enclosed in square
1 brackets ([ ]).  Several functions perform string substitution; the full
1 discussion is provided in the description of the 'sub()' function, which
1 comes toward the end, because the list is presented alphabetically.
1 
1    Those functions that are specific to 'gawk' are marked with a pound
1 Options::):
1

1 
· Gory Details                More than you want to know about '\' and
1                                 '&' with 'sub()', 'gsub()', and
1                                 'gensub()'.
1 
1 'asort('SOURCE [',' DEST [',' HOW ] ]') #'
1 'asorti('SOURCE [',' DEST [',' HOW ] ]') #'
1      These two functions are similar in behavior, so they are described
1      together.
1 
1           NOTE: The following description ignores the third argument,
1           HOW, as it requires understanding features that we have not
1           discussed yet.  Thus, the discussion here is a deliberate
1           simplification.  (We do provide all the details later on; see
1           ⇒Array Sorting Functions for the full story.)
1 
1      Both functions return the number of elements in the array SOURCE.
1      For 'asort()', 'gawk' sorts the values of SOURCE and replaces the
1      indices of the sorted values of SOURCE with sequential integers
1      starting with one.  If the optional array DEST is specified, then
1      SOURCE is duplicated into DEST.  DEST is then sorted, leaving the
1      indices of SOURCE unchanged.
1 
11      When comparing strings, 'IGNORECASE' affects the sorting (⇒
      Array Sorting Functions).  If the SOURCE array contains subarrays
1      as values (⇒Arrays of Arrays), they will come last, after
1      all scalar values.  Subarrays are _not_ recursively sorted.
1 
1      For example, if the contents of 'a' are as follows:
1 
1           a["last"] = "de"
1           a["first"] = "sac"
1           a["middle"] = "cul"
1 
1      A call to 'asort()':
1 
1           asort(a)
1 
1      results in the following contents of 'a':
1 
1           a[1] = "cul"
1           a[2] = "de"
1           a[3] = "sac"
1 
1      The 'asorti()' function works similarly to 'asort()'; however, the
1      _indices_ are sorted, instead of the values.  Thus, in the previous
1      example, starting with the same initial set of indices and values
1      in 'a', calling 'asorti(a)' would yield:
1 
1           a[1] = "first"
1           a[2] = "last"
1           a[3] = "middle"
1 
1 'gensub(REGEXP, REPLACEMENT, HOW' [', TARGET']') #'
1      Search the target string TARGET for matches of the regular
1      expression REGEXP.  If HOW is a string beginning with 'g' or 'G'
1      (short for "global"), then replace all matches of REGEXP with
1      REPLACEMENT.  Otherwise, HOW is treated as a number indicating
1      which match of REGEXP to replace.  If no TARGET is supplied, use
1      '$0'.  It returns the modified string as the result of the function
1      and the original target string is _not_ changed.
1 
1      'gensub()' is a general substitution function.  Its purpose is to
1      provide more features than the standard 'sub()' and 'gsub()'
1      functions.
1 
1      'gensub()' provides an additional feature that is not available in
1      'sub()' or 'gsub()': the ability to specify components of a regexp
1      in the replacement text.  This is done by using parentheses in the
1      regexp to mark the components and then specifying '\N' in the
1      replacement text, where N is a digit from 1 to 9.  For example:
1 
1           $ gawk '
1           > BEGIN {
1           >      a = "abc def"
1           >      b = gensub(/(.+) (.+)/, "\\2 \\1", "g", a)
1           >      print b
1           > }'
1           -| def abc
1 
1      As with 'sub()', you must type two backslashes in order to get one
1      into the string.  In the replacement text, the sequence '\0'
1      represents the entire matched text, as does the character '&'.
1 
1      The following example shows how you can use the third argument to
1      control which match of the regexp should be changed:
1 
1           $ echo a b c a b c |
1           > gawk '{ print gensub(/a/, "AA", 2) }'
1           -| a b c AA b c
1 
1      In this case, '$0' is the default target string.  'gensub()'
1      returns the new string as its result, which is passed directly to
1      'print' for printing.
1 
1      If the HOW argument is a string that does not begin with 'g' or
1      'G', or if it is a number that is less than or equal to zero, only
1      one substitution is performed.  If HOW is zero, 'gawk' issues a
1      warning message.
1 
1      If REGEXP does not match TARGET, 'gensub()''s return value is the
1      original unchanged value of TARGET.
1 
1 'gsub(REGEXP, REPLACEMENT' [', TARGET']')'
1      Search TARGET for _all_ of the longest, leftmost, _nonoverlapping_
1      matching substrings it can find and replace them with REPLACEMENT.
1      The 'g' in 'gsub()' stands for "global," which means replace
1      everywhere.  For example:
1 
1           { gsub(/Britain/, "United Kingdom"); print }
1 
1      replaces all occurrences of the string 'Britain' with 'United
1      Kingdom' for all input records.
1 
1      The 'gsub()' function returns the number of substitutions made.  If
1      the variable to search and alter (TARGET) is omitted, then the
1      entire input record ('$0') is used.  As in 'sub()', the characters
1      '&' and '\' are special, and the third argument must be assignable.
1 
1 'index(IN, FIND)'
1      Search the string IN for the first occurrence of the string FIND,
1      and return the position in characters where that occurrence begins
1      in the string IN.  Consider the following example:
1 
1           $ awk 'BEGIN { print index("peanut", "an") }'
1           -| 3
1 
1      If FIND is not found, 'index()' returns zero.
1 
1      With BWK 'awk' and 'gawk', it is a fatal error to use a regexp
1      constant for FIND.  Other implementations allow it, simply treating
1      the regexp constant as an expression meaning '$0 ~ /regexp/'.
1      (d.c.)
1 
1 'length('[STRING]')'
1      Return the number of characters in STRING.  If STRING is a number,
1      the length of the digit string representing that number is
1      returned.  For example, 'length("abcde")' is five.  By contrast,
1      'length(15 * 35)' works out to three.  In this example, 15 * 35 =
1      525, and 525 is then converted to the string '"525"', which has
1      three characters.
1 
1      If no argument is supplied, 'length()' returns the length of '$0'.
1 
1           NOTE: In older versions of 'awk', the 'length()' function
1           could be called without any parentheses.  Doing so is
1           considered poor practice, although the 2008 POSIX standard
1           explicitly allows it, to support historical practice.  For
1           programs to be maximally portable, always supply the
1           parentheses.
1 
1      If 'length()' is called with a variable that has not been used,
1      'gawk' forces the variable to be a scalar.  Other implementations
1      of 'awk' leave the variable without a type.  (d.c.)  Consider:
1 
1           $ gawk 'BEGIN { print length(x) ; x[1] = 1 }'
1           -| 0
1           error-> gawk: fatal: attempt to use scalar `x' as array
1 
1           $ nawk 'BEGIN { print length(x) ; x[1] = 1 }'
1           -| 0
1 
1      If '--lint' has been specified on the command line, 'gawk' issues a
1      warning about this.
1 
1      With 'gawk' and several other 'awk' implementations, when given an
1      array argument, the 'length()' function returns the number of
1      elements in the array.  (c.e.)  This is less useful than it might
1      seem at first, as the array is not guaranteed to be indexed from
1      one to the number of elements in it.  If '--lint' is provided on
1      the command line (⇒Options), 'gawk' warns that passing an
1      array argument is not portable.  If '--posix' is supplied, using an
1      array argument is a fatal error (⇒Arrays).
1 
1 'match(STRING, REGEXP' [', ARRAY']')'
1      Search STRING for the longest, leftmost substring matched by the
1      regular expression REGEXP and return the character position (index)
1      at which that substring begins (one, if it starts at the beginning
1      of STRING).  If no match is found, return zero.
1 
1      The REGEXP argument may be either a regexp constant ('/'...'/') or
1      a string constant ('"'...'"').  In the latter case, the string is
1      treated as a regexp to be matched.  ⇒Computed Regexps for a
1      discussion of the difference between the two forms, and the
1      implications for writing your program correctly.
1 
1      The order of the first two arguments is the opposite of most other
1      string functions that work with regular expressions, such as
1      'sub()' and 'gsub()'.  It might help to remember that for
1      'match()', the order is the same as for the '~' operator: 'STRING ~
1      REGEXP'.
1 
1      The 'match()' function sets the predefined variable 'RSTART' to the
1      index.  It also sets the predefined variable 'RLENGTH' to the
1      length in characters of the matched substring.  If no match is
1      found, 'RSTART' is set to zero, and 'RLENGTH' to -1.
1 
1      For example:
1 
1           {
1               if ($1 == "FIND")
1                   regex = $2
1               else {
1                   where = match($0, regex)
1                   if (where != 0)
1                       print "Match of", regex, "found at", where, "in", $0
1                  }
1           }
1 
1      This program looks for lines that match the regular expression
1      stored in the variable 'regex'.  This regular expression can be
1      changed.  If the first word on a line is 'FIND', 'regex' is changed
1      to be the second word on that line.  Therefore, if given:
1 
1           FIND ru+n
1           My program runs
1           but not very quickly
1           FIND Melvin
1           JF+KM
1           This line is property of Reality Engineering Co.
1           Melvin was here.
1 
1      'awk' prints:
1 
1           Match of ru+n found at 12 in My program runs
1           Match of Melvin found at 1 in Melvin was here.
1 
1      If ARRAY is present, it is cleared, and then the zeroth element of
1      ARRAY is set to the entire portion of STRING matched by REGEXP.  If
1      REGEXP contains parentheses, the integer-indexed elements of ARRAY
1      are set to contain the portion of STRING matching the corresponding
1      parenthesized subexpression.  For example:
1 
1           $ echo foooobazbarrrrr |
1           > gawk '{ match($0, /(fo+).+(bar*)/, arr)
1           >         print arr[1], arr[2] }'
1           -| foooo barrrrr
1 
1      In addition, multidimensional subscripts are available providing
1      the start index and length of each matched subexpression:
1 
1           $ echo foooobazbarrrrr |
1           > gawk '{ match($0, /(fo+).+(bar*)/, arr)
1           >           print arr[1], arr[2]
1           >           print arr[1, "start"], arr[1, "length"]
1           >           print arr[2, "start"], arr[2, "length"]
1           > }'
1           -| foooo barrrrr
1           -| 1 5
1           -| 9 7
1 
1      There may not be subscripts for the start and index for every
1      parenthesized subexpression, because they may not all have matched
11      text; thus, they should be tested for with the 'in' operator (⇒
      Reference to Elements).
1 
1      The ARRAY argument to 'match()' is a 'gawk' extension.  In
1      compatibility mode (⇒Options), using a third argument is a
1      fatal error.
1 
1 'patsplit(STRING, ARRAY' [', FIELDPAT' [', SEPS' ] ]') #'
1      Divide STRING into pieces (or "fields") defined by FIELDPAT and
1      store the pieces in ARRAY and the separator strings in the SEPS
1      array.  The first piece is stored in 'ARRAY[1]', the second piece
1      in 'ARRAY[2]', and so forth.  The third argument, FIELDPAT, is a
1      regexp describing the fields in STRING (just as 'FPAT' is a regexp
1      describing the fields in input records).  It may be either a regexp
1      constant or a string.  If FIELDPAT is omitted, the value of 'FPAT'
1      is used.  'patsplit()' returns the number of elements created.
1      'SEPS[I]' is the possibly null separator string after 'ARRAY[I]'.
1      The possibly null leading separator will be in 'SEPS[0]'.  So a
1      non-null STRING with N fields will have N+1 separators.  A null
1      STRING will not have neither fields nor separators.
1 
1      The 'patsplit()' function splits strings into pieces in a manner
1      similar to the way input lines are split into fields using 'FPAT'
1      (⇒Splitting By Content).
1 
1      Before splitting the string, 'patsplit()' deletes any previously
1      existing elements in the arrays ARRAY and SEPS.
1 
1 'split(STRING, ARRAY' [', FIELDSEP' [', SEPS' ] ]')'
1      Divide STRING into pieces separated by FIELDSEP and store the
1      pieces in ARRAY and the separator strings in the SEPS array.  The
1      first piece is stored in 'ARRAY[1]', the second piece in
1      'ARRAY[2]', and so forth.  The string value of the third argument,
1      FIELDSEP, is a regexp describing where to split STRING (much as
1      'FS' can be a regexp describing where to split input records).  If
1      FIELDSEP is omitted, the value of 'FS' is used.  'split()' returns
1      the number of elements created.  SEPS is a 'gawk' extension, with
1      'SEPS[I]' being the separator string between 'ARRAY[I]' and
1      'ARRAY[I+1]'.  If FIELDSEP is a single space, then any leading
1      whitespace goes into 'SEPS[0]' and any trailing whitespace goes
1      into 'SEPS[N]', where N is the return value of 'split()' (i.e., the
1      number of elements in ARRAY).
1 
1      The 'split()' function splits strings into pieces in a manner
1      similar to the way input lines are split into fields.  For example:
1 
1           split("cul-de-sac", a, "-", seps)
1 
1      splits the string '"cul-de-sac"' into three fields using '-' as the
1      separator.  It sets the contents of the array 'a' as follows:
1 
1           a[1] = "cul"
1           a[2] = "de"
1           a[3] = "sac"
1 
1      and sets the contents of the array 'seps' as follows:
1 
1           seps[1] = "-"
1           seps[2] = "-"
1 
1      The value returned by this call to 'split()' is three.
1 
1      As with input field-splitting, when the value of FIELDSEP is '" "',
1      leading and trailing whitespace is ignored in values assigned to
1      the elements of ARRAY but not in SEPS, and the elements are
1      separated by runs of whitespace.  Also, as with input field
1      splitting, if FIELDSEP is the null string, each individual
1      character in the string is split into its own array element.
1      (c.e.)
1 
1      Note, however, that 'RS' has no effect on the way 'split()' works.
1      Even though 'RS = ""' causes the newline character to also be an
1      input field separator, this does not affect how 'split()' splits
1      strings.
1 
1      Modern implementations of 'awk', including 'gawk', allow the third
1      argument to be a regexp constant ('/'...'/') as well as a string.
1      Regexps:: for a discussion of the difference between using a string
1      constant or a regexp constant, and the implications for writing
1      your program correctly.
1 
1      Before splitting the string, 'split()' deletes any previously
1      existing elements in the arrays ARRAY and SEPS.
1 
1      If STRING is null, the array has no elements.  (So this is a
11      portable way to delete an entire array with one statement.  ⇒
      Delete.)
1 
1      If STRING does not match FIELDSEP at all (but is not null), ARRAY
1      has one element only.  The value of that element is the original
1      STRING.
1 
1      In POSIX mode (⇒Options), the fourth argument is not
1      allowed.
1 
1 'sprintf(FORMAT, EXPRESSION1, ...)'
1      Return (without printing) the string that 'printf' would have
1      printed out with the same arguments (⇒Printf).  For example:
1 
1           pival = sprintf("pi = %.2f (approx.)", 22/7)
1 
1      assigns the string 'pi = 3.14 (approx.)' to the variable 'pival'.
1 
1 'strtonum(STR) #'
1      Examine STR and return its numeric value.  If STR begins with a
1      leading '0', 'strtonum()' assumes that STR is an octal number.  If
1      STR begins with a leading '0x' or '0X', 'strtonum()' assumes that
1      STR is a hexadecimal number.  For example:
1 
1           $ echo 0x11 |
1           > gawk '{ printf "%d\n", strtonum($1) }'
1           -| 17
1 
1      Using the 'strtonum()' function is _not_ the same as adding zero to
1      a string value; the automatic coercion of strings to numbers works
1      only for decimal data, not for octal or hexadecimal.(1)
1 
1      Note also that 'strtonum()' uses the current locale's decimal point
1      for recognizing numbers (⇒Locales).
1 
1 'sub(REGEXP, REPLACEMENT' [', TARGET']')'
1      Search TARGET, which is treated as a string, for the leftmost,
1      longest substring matched by the regular expression REGEXP.  Modify
1      the entire string by replacing the matched text with REPLACEMENT.
1      The modified string becomes the new value of TARGET.  Return the
1      number of substitutions made (zero or one).
1 
1      The REGEXP argument may be either a regexp constant ('/'...'/') or
1      a string constant ('"'...'"').  In the latter case, the string is
1      treated as a regexp to be matched.  ⇒Computed Regexps for a
1      discussion of the difference between the two forms, and the
1      implications for writing your program correctly.
1 
1      This function is peculiar because TARGET is not simply used to
1      compute a value, and not just any expression will do--it must be a
1      variable, field, or array element so that 'sub()' can store a
1      modified value there.  If this argument is omitted, then the
1      default is to use and alter '$0'.(2)  For example:
1 
1           str = "water, water, everywhere"
1           sub(/at/, "ith", str)
1 
1      sets 'str' to 'wither, water, everywhere', by replacing the
1      leftmost longest occurrence of 'at' with 'ith'.
1 
1      If the special character '&' appears in REPLACEMENT, it stands for
1      the precise substring that was matched by REGEXP.  (If the regexp
1      can match more than one string, then this precise substring may
1      vary.)  For example:
1 
1           { sub(/candidate/, "& and his wife"); print }
1 
1      changes the first occurrence of 'candidate' to 'candidate and his
1      wife' on each input line.  Here is another example:
1 
1           $ awk 'BEGIN {
1           >         str = "daabaaa"
1           >         sub(/a+/, "C&C", str)
1           >         print str
1           > }'
1           -| dCaaCbaaa
1 
1      This shows how '&' can represent a nonconstant string and also
11      illustrates the "leftmost, longest" rule in regexp matching (⇒
      Leftmost Longest).
1 
1      The effect of this special character ('&') can be turned off by
1      putting a backslash before it in the string.  As usual, to insert
1      one backslash in the string, you must write two backslashes.
1      Therefore, write '\\&' in a string constant to include a literal
1      '&' in the replacement.  For example, the following shows how to
1      replace the first '|' on each line with an '&':
1 
1           { sub(/\|/, "\\&"); print }
1 
1      As mentioned, the third argument to 'sub()' must be a variable,
1      field, or array element.  Some versions of 'awk' allow the third
1      argument to be an expression that is not an lvalue.  In such a
1      case, 'sub()' still searches for the pattern and returns zero or
1      one, but the result of the substitution (if any) is thrown away
1      because there is no place to put it.  Such versions of 'awk' accept
1      expressions like the following:
1 
1           sub(/USA/, "United States", "the USA and Canada")
1 
1      For historical compatibility, 'gawk' accepts such erroneous code.
1      However, using any other nonchangeable object as the third
1      parameter causes a fatal error and your program will not run.
1 
1      Finally, if the REGEXP is not a regexp constant, it is converted
1      into a string, and then the value of that string is treated as the
1      regexp to match.
1 
1 'substr(STRING, START' [', LENGTH' ]')'
1      Return a LENGTH-character-long substring of STRING, starting at
1      character number START.  The first character of a string is
1      character number one.(3)  For example, 'substr("washington", 5, 3)'
1      returns '"ing"'.
1 
1      If LENGTH is not present, 'substr()' returns the whole suffix of
1      STRING that begins at character number START.  For example,
1      'substr("washington", 5)' returns '"ington"'.  The whole suffix is
1      also returned if LENGTH is greater than the number of characters
1      remaining in the string, counting from character START.
1 
1      If START is less than one, 'substr()' treats it as if it was one.
1      (POSIX doesn't specify what to do in this case: BWK 'awk' acts this
1      way, and therefore 'gawk' does too.)  If START is greater than the
1      number of characters in the string, 'substr()' returns the null
1      string.  Similarly, if LENGTH is present but less than or equal to
1      zero, the null string is returned.
1 
1      The string returned by 'substr()' _cannot_ be assigned.  Thus, it
1      is a mistake to attempt to change a portion of a string, as shown
1      in the following example:
1 
1           string = "abcdef"
1           # try to get "abCDEf", won't work
1           substr(string, 3, 3) = "CDE"
1 
1      It is also a mistake to use 'substr()' as the third argument of
1      'sub()' or 'gsub()':
1 
1           gsub(/xyz/, "pdq", substr($0, 5, 20))  # WRONG
1 
1      (Some commercial versions of 'awk' treat 'substr()' as assignable,
1      but doing so is not portable.)
1 
1      If you need to replace bits and pieces of a string, combine
1      'substr()' with string concatenation, in the following manner:
1 
1           string = "abcdef"
1           ...
1           string = substr(string, 1, 2) "CDE" substr(string, 6)
1 
1 'tolower(STRING)'
1      Return a copy of STRING, with each uppercase character in the
1      string replaced with its corresponding lowercase character.
1      Nonalphabetic characters are left unchanged.  For example,
1      'tolower("MiXeD cAsE 123")' returns '"mixed case 123"'.
1 
1 'toupper(STRING)'
1      Return a copy of STRING, with each lowercase character in the
1      string replaced with its corresponding uppercase character.
1      Nonalphabetic characters are left unchanged.  For example,
1      'toupper("MiXeD cAsE 123")' returns '"MIXED CASE 123"'.
1 
1                        Matching the Null String
1 
1    In 'awk', the '*' operator can match the null string.  This is
1 particularly important for the 'sub()', 'gsub()', and 'gensub()'
1 functions.  For example:
1 
1      $ echo abc | awk '{ gsub(/m*/, "X"); print }'
1      -| XaXbXcX
1 
1 Although this makes a certain amount of sense, it can be surprising.
1 
1    ---------- Footnotes ----------
1 
1    (1) Unless you use the '--non-decimal-data' option, which isn't
1 recommended.  ⇒Nondecimal Data for more information.
1 
1    (2) Note that this means that the record will first be regenerated
1 using the value of 'OFS' if any fields have been changed, and that the
1 fields will be updated after the substitution, even if the operation is
1 a "no-op" such as 'sub(/^/, "")'.
1 
1    (3) This is different from C and C++, in which the first character is
1 number zero.
1

⇖ Info Catalog ← gawk: Numeric Functions ↑ gawk: Built-in → gawk: I/O Functions

gawk: String Functions

Menu