gawk: Split Program
1
1 11.2.4 Splitting a Large File into Pieces
1 -----------------------------------------
1
1 The 'split' program splits large text files into smaller pieces. Usage
1 is as follows:(1)
1
1 'split' ['-COUNT'] [FILE] [PREFIX]
1
1 By default, the output files are named 'xaa', 'xab', and so on. Each
1 file has 1,000 lines in it, with the likely exception of the last file.
1 To change the number of lines in each file, supply a number on the
1 command line preceded with a minus sign (e.g., '-500' for files with 500
1 lines in them instead of 1,000). To change the names of the output
1 files to something like 'myfileaa', 'myfileab', and so on, supply an
1 additional argument that specifies the file name prefix.
1
1 Here is a version of 'split' in 'awk'. It uses the 'ord()' and
1 'chr()' functions presented in ⇒Ordinal Functions.
1
1 The program first sets its defaults, and then tests to make sure
1 there are not too many arguments. It then looks at each argument in
1 turn. The first argument could be a minus sign followed by a number.
1 If it is, this happens to look like a negative number, so it is made
1 positive, and that is the count of lines. The data file name is skipped
1 over and the final argument is used as the prefix for the output file
1 names:
1
1 # split.awk --- do split in awk
1 #
1 # Requires ord() and chr() library functions
1 # usage: split [-count] [file] [outname]
1
1 BEGIN {
1 outfile = "x" # default
1 count = 1000
1 if (ARGC > 4)
1 usage()
1
1 i = 1
1 if (i in ARGV && ARGV[i] ~ /^-[[:digit:]]+$/) {
1 count = -ARGV[i]
1 ARGV[i] = ""
1 i++
1 }
1 # test argv in case reading from stdin instead of file
1 if (i in ARGV)
1 i++ # skip datafile name
1 if (i in ARGV) {
1 outfile = ARGV[i]
1 ARGV[i] = ""
1 }
1 s1 = s2 = "a"
1 out = (outfile s1 s2)
1 }
1
1 The next rule does most of the work. 'tcount' (temporary count)
1 tracks how many lines have been printed to the output file so far. If
1 it is greater than 'count', it is time to close the current file and
1 start a new one. 's1' and 's2' track the current suffixes for the file
1 name. If they are both 'z', the file is just too big. Otherwise, 's1'
1 moves to the next letter in the alphabet and 's2' starts over again at
1 'a':
1
1 {
1 if (++tcount > count) {
1 close(out)
1 if (s2 == "z") {
1 if (s1 == "z") {
1 printf("split: %s is too large to split\n",
1 FILENAME) > "/dev/stderr"
1 exit 1
1 }
1 s1 = chr(ord(s1) + 1)
1 s2 = "a"
1 }
1 else
1 s2 = chr(ord(s2) + 1)
1 out = (outfile s1 s2)
1 tcount = 1
1 }
1 print > out
1 }
1
1 The 'usage()' function simply prints an error message and exits:
1
1 function usage()
1 {
1 print("usage: split [-num] [file] [outname]") > "/dev/stderr"
1 exit 1
1 }
1
1 This program is a bit sloppy; it relies on 'awk' to automatically
1 close the last file instead of doing it in an 'END' rule. It also
1 assumes that letters are contiguous in the character set, which isn't
1 true for EBCDIC systems.
1
1 ---------- Footnotes ----------
1
1 (1) This is the traditional usage. The POSIX usage is different, but
1 not relevant for what the program aims to demonstrate.
1