gawk: Split Program

1 
1 11.2.4 Splitting a Large File into Pieces
1 -----------------------------------------
1 
1 The 'split' program splits large text files into smaller pieces.  Usage
1 is as follows:(1)
1 
1      'split' ['-COUNT'] [FILE] [PREFIX]
1 
1    By default, the output files are named 'xaa', 'xab', and so on.  Each
1 file has 1,000 lines in it, with the likely exception of the last file.
1 To change the number of lines in each file, supply a number on the
1 command line preceded with a minus sign (e.g., '-500' for files with 500
1 lines in them instead of 1,000).  To change the names of the output
1 files to something like 'myfileaa', 'myfileab', and so on, supply an
1 additional argument that specifies the file name prefix.
1 
1    Here is a version of 'split' in 'awk'.  It uses the 'ord()' and
1 'chr()' functions presented in ⇒Ordinal Functions.
1 
1    The program first sets its defaults, and then tests to make sure
1 there are not too many arguments.  It then looks at each argument in
1 turn.  The first argument could be a minus sign followed by a number.
1 If it is, this happens to look like a negative number, so it is made
1 positive, and that is the count of lines.  The data file name is skipped
1 over and the final argument is used as the prefix for the output file
1 names:
1 
1      # split.awk --- do split in awk
1      #
1      # Requires ord() and chr() library functions
1      # usage: split [-count] [file] [outname]
1 
1      BEGIN {
1          outfile = "x"    # default
1          count = 1000
1          if (ARGC > 4)
1              usage()
1 
1          i = 1
1          if (i in ARGV && ARGV[i] ~ /^-[[:digit:]]+$/) {
1              count = -ARGV[i]
1              ARGV[i] = ""
1              i++
1          }
1          # test argv in case reading from stdin instead of file
1          if (i in ARGV)
1              i++    # skip datafile name
1          if (i in ARGV) {
1              outfile = ARGV[i]
1              ARGV[i] = ""
1          }
1          s1 = s2 = "a"
1          out = (outfile s1 s2)
1      }
1 
1    The next rule does most of the work.  'tcount' (temporary count)
1 tracks how many lines have been printed to the output file so far.  If
1 it is greater than 'count', it is time to close the current file and
1 start a new one.  's1' and 's2' track the current suffixes for the file
1 name.  If they are both 'z', the file is just too big.  Otherwise, 's1'
1 moves to the next letter in the alphabet and 's2' starts over again at
1 'a':
1 
1      {
1          if (++tcount > count) {
1              close(out)
1              if (s2 == "z") {
1                  if (s1 == "z") {
1                      printf("split: %s is too large to split\n",
1                             FILENAME) > "/dev/stderr"
1                      exit 1
1                  }
1                  s1 = chr(ord(s1) + 1)
1                  s2 = "a"
1              }
1              else
1                  s2 = chr(ord(s2) + 1)
1              out = (outfile s1 s2)
1              tcount = 1
1          }
1          print > out
1      }
1 
1 The 'usage()' function simply prints an error message and exits:
1 
1      function usage()
1      {
1          print("usage: split [-num] [file] [outname]") > "/dev/stderr"
1          exit 1
1      }
1 
1    This program is a bit sloppy; it relies on 'awk' to automatically
1 close the last file instead of doing it in an 'END' rule.  It also
1 assumes that letters are contiguous in the character set, which isn't
1 true for EBCDIC systems.
1 
1    ---------- Footnotes ----------
1 
1    (1) This is the traditional usage.  The POSIX usage is different, but
1 not relevant for what the program aims to demonstrate.
1