coreutils: Putting the tools together

1 
1 Putting the Tools Together
1 ==========================
1 
1 Now, let’s suppose this is a large ISP server system with dozens of
1 users logged in.  The management wants the system administrator to write
1 a program that will generate a sorted list of logged in users.
1 Furthermore, even if a user is logged in multiple times, his or her name
1 should only show up in the output once.
1 
1    The administrator could sit down with the system documentation and
1 write a C program that did this.  It would take perhaps a couple of
1 hundred lines of code and about two hours to write it, test it, and
1 debug it.  However, knowing the software toolbox, the administrator can
1 instead start out by generating just a list of logged on users:
1 
1      $ who | cut -c1-8
1      ⊣ arnold
1      ⊣ miriam
1      ⊣ bill
1      ⊣ arnold
1 
1    Next, sort the list:
1 
1      $ who | cut -c1-8 | sort
1      ⊣ arnold
1      ⊣ arnold
1      ⊣ bill
1      ⊣ miriam
1 
1    Finally, run the sorted list through ‘uniq’, to weed out duplicates:
1 
1      $ who | cut -c1-8 | sort | uniq
1      ⊣ arnold
1      ⊣ bill
1      ⊣ miriam
1 
1    The ‘sort’ command actually has a ‘-u’ option that does what ‘uniq’
1 does.  However, ‘uniq’ has other uses for which one cannot substitute
1 ‘sort -u’.
1 
1    The administrator puts this pipeline into a shell script, and makes
1 it available for all the users on the system (‘#’ is the system
1 administrator, or ‘root’, prompt):
1 
1      # cat > /usr/local/bin/listusers
1      who | cut -c1-8 | sort | uniq
1      ^D
1      # chmod +x /usr/local/bin/listusers
1 
1    There are four major points to note here.  First, with just four
1 programs, on one command line, the administrator was able to save about
1 two hours worth of work.  Furthermore, the shell pipeline is just about
1 as efficient as the C program would be, and it is much more efficient in
1 terms of programmer time.  People time is much more expensive than
1 computer time, and in our modern “there’s never enough time to do
1 everything” society, saving two hours of programmer time is no mean
1 feat.
1 
1    Second, it is also important to emphasize that with the _combination_
1 of the tools, it is possible to do a special purpose job never imagined
1 by the authors of the individual programs.
1 
1    Third, it is also valuable to build up your pipeline in stages, as we
1 did here.  This allows you to view the data at each stage in the
1 pipeline, which helps you acquire the confidence that you are indeed
1 using these tools correctly.
1 
1    Finally, by bundling the pipeline in a shell script, other users can
1 use your command, without having to remember the fancy plumbing you set
1 up for them.  In terms of how you run them, shell scripts and compiled
1 programs are indistinguishable.
1 
1    After the previous warm-up exercise, we’ll look at two additional,
1 more complicated pipelines.  For them, we need to introduce two more
1 tools.
1 
1    The first is the ‘tr’ command, which stands for “transliterate.” The
1 ‘tr’ command works on a character-by-character basis, changing
1 characters.  Normally it is used for things like mapping upper case to
1 lower case:
1 
1      $ echo ThIs ExAmPlE HaS MIXED case! | tr '[:upper:]' '[:lower:]'
1      ⊣ this example has mixed case!
1 
1    There are several options of interest:
1 
1 ‘-c’
1      work on the complement of the listed characters, i.e., operations
1      apply to characters not in the given set
1 
1 ‘-d’
1      delete characters in the first set from the output
1 
1 ‘-s’
1      squeeze repeated characters in the output into just one character.
1 
1    We will be using all three options in a moment.
1 
1    The other command we’ll look at is ‘comm’.  The ‘comm’ command takes
1 two sorted input files as input data, and prints out the files’ lines in
1 three columns.  The output columns are the data lines unique to the
1 first file, the data lines unique to the second file, and the data lines
1 that are common to both.  The ‘-1’, ‘-2’, and ‘-3’ command line options
1 _omit_ the respective columns.  (This is non-intuitive and takes a
1 little getting used to.)  For example:
1 
1      $ cat f1
1      ⊣ 11111
1      ⊣ 22222
1      ⊣ 33333
1      ⊣ 44444
1      $ cat f2
1      ⊣ 00000
1      ⊣ 22222
1      ⊣ 33333
1      ⊣ 55555
1      $ comm f1 f2
1      ⊣         00000
1      ⊣ 11111
1      ⊣                 22222
1      ⊣                 33333
1      ⊣ 44444
1      ⊣         55555
1 
1    The file name ‘-’ tells ‘comm’ to read standard input instead of a
1 regular file.
1 
1    Now we’re ready to build a fancy pipeline.  The first application is
1 a word frequency counter.  This helps an author determine if he or she
1 is over-using certain words.
1 
1    The first step is to change the case of all the letters in our input
1 file to one case.  “The” and “the” are the same word when doing
1 counting.
1 
1      $ tr '[:upper:]' '[:lower:]' < whats.gnu | ...
1 
1    The next step is to get rid of punctuation.  Quoted words and
1 unquoted words should be treated identically; it’s easiest to just get
1 the punctuation out of the way.
1 
1      $ tr '[:upper:]' '[:lower:]' < whats.gnu | tr -cd '[:alnum:]_ \n' | ...
1 
1    The second ‘tr’ command operates on the complement of the listed
1 characters, which are all the letters, the digits, the underscore, and
1 the blank.  The ‘\n’ represents the newline character; it has to be left
1 alone.  (The ASCII tab character should also be included for good
1 measure in a production script.)
1 
1    At this point, we have data consisting of words separated by blank
1 space.  The words only contain alphanumeric characters (and the
1 underscore).  The next step is break the data apart so that we have one
1 word per line.  This makes the counting operation much easier, as we
1 will see shortly.
1 
1      $ tr '[:upper:]' '[:lower:]' < whats.gnu | tr -cd '[:alnum:]_ \n' |
1      > tr -s ' ' '\n' | ...
1 
1    This command turns blanks into newlines.  The ‘-s’ option squeezes
1 multiple newline characters in the output into just one, removing blank
1 lines.  (The ‘>’ is the shell’s “secondary prompt.” This is what the
1 shell prints when it notices you haven’t finished typing in all of a
1 command.)
1 
1    We now have data consisting of one word per line, no punctuation, all
1 one case.  We’re ready to count each word:
1 
1      $ tr '[:upper:]' '[:lower:]' < whats.gnu | tr -cd '[:alnum:]_ \n' |
1      > tr -s ' ' '\n' | sort | uniq -c | ...
1 
1    At this point, the data might look something like this:
1 
1           60 a
1            2 able
1            6 about
1            1 above
1            2 accomplish
1            1 acquire
1            1 actually
1            2 additional
1 
1    The output is sorted by word, not by count!  What we want is the most
1 frequently used words first.  Fortunately, this is easy to accomplish,
1 with the help of two more ‘sort’ options:
1 
1 ‘-n’
1      do a numeric sort, not a textual one
1 
1 ‘-r’
1      reverse the order of the sort
1 
1    The final pipeline looks like this:
1 
1      $ tr '[:upper:]' '[:lower:]' < whats.gnu | tr -cd '[:alnum:]_ \n' |
1      > tr -s ' ' '\n' | sort | uniq -c | sort -n -r
1      ⊣    156 the
1      ⊣     60 a
1      ⊣     58 to
1      ⊣     51 of
1      ⊣     51 and
1      ...
1 
1    Whew!  That’s a lot to digest.  Yet, the same principles apply.  With
1 six commands, on two lines (really one long one split for convenience),
1 we’ve created a program that does something interesting and useful, in
1 much less time than we could have written a C program to do the same
1 thing.
1 
1    A minor modification to the above pipeline can give us a simple
1 spelling checker!  To determine if you’ve spelled a word correctly, all
1 you have to do is look it up in a dictionary.  If it is not there, then
1 chances are that your spelling is incorrect.  So, we need a dictionary.
1 The conventional location for a dictionary is ‘/usr/dict/words’.  On my
1 GNU/Linux system,(1) this is a sorted, 45,402 word dictionary.
1 
1    Now, how to compare our file with the dictionary?  As before, we
1 generate a sorted list of words, one per line:
1 
1      $ tr '[:upper:]' '[:lower:]' < whats.gnu | tr -cd '[:alnum:]_ \n' |
1      > tr -s ' ' '\n' | sort -u | ...
1 
1    Now, all we need is a list of words that are _not_ in the dictionary.
1 Here is where the ‘comm’ command comes in.
1 
1      $ tr '[:upper:]' '[:lower:]' < whats.gnu | tr -cd '[:alnum:]_ \n' |
1      > tr -s ' ' '\n' | sort -u |
1      > comm -23 - /usr/dict/words
1 
1    The ‘-2’ and ‘-3’ options eliminate lines that are only in the
1 dictionary (the second file), and lines that are in both files.  Lines
1 only in the first file (standard input, our stream of words), are words
1 that are not in the dictionary.  These are likely candidates for
1 spelling errors.  This pipeline was the first cut at a production
1 spelling checker on Unix.
1 
1    There are some other tools that deserve brief mention.
1 
1 ‘grep’
1      search files for text that matches a regular expression
1 
1 ‘wc’
1      count lines, words, characters
1 
1 ‘tee’
1      a T-fitting for data pipes, copies data to files and to standard
1      output
1 
1 ‘sed’
1      the stream editor, an advanced tool
1 
1 ‘awk’
1      a data manipulation language, another advanced tool
1 
1    The software tools philosophy also espoused the following bit of
1 advice: “Let someone else do the hard part.” This means, take something
1 that gives you most of what you need, and then massage it the rest of
1 the way until it’s in the form that you want.
1 
1    To summarize:
1 
1   1. Each program should do one thing well.  No more, no less.
1 
1   2. Combining programs with appropriate plumbing leads to results where
1      the whole is greater than the sum of the parts.  It also leads to
1      novel uses of programs that the authors might never have imagined.
1 
1   3. Programs should never print extraneous header or trailer data,
1      since these could get sent on down a pipeline.  (A point we didn’t
1      mention earlier.)
1 
1   4. Let someone else do the hard part.
1 
1   5. Know your toolbox!  Use each program appropriately.  If you don’t
1      have an appropriate tool, build one.
1 
1    All the programs discussed are available as described in GNU core
1 utilities (https://www.gnu.org/software/coreutils/coreutils.html).
1 
1    None of what I have presented in this column is new.  The Software
1 Tools philosophy was first introduced in the book ‘Software Tools’, by
1 Brian Kernighan and P.J. Plauger (Addison-Wesley, ISBN 0-201-03669-X).
1 This book showed how to write and use software tools.  It was written in
1 1976, using a preprocessor for FORTRAN named ‘ratfor’ (RATional
1 FORtran).  At the time, C was not as ubiquitous as it is now; FORTRAN
1 was.  The last chapter presented a ‘ratfor’ to FORTRAN processor,
1 written in ‘ratfor’.  ‘ratfor’ looks an awful lot like C; if you know C,
1 you won’t have any problem following the code.
1 
1    In 1981, the book was updated and made available as ‘Software Tools
1 in Pascal’ (Addison-Wesley, ISBN 0-201-10342-7).  Both books are still
1 in print and are well worth reading if you’re a programmer.  They
1 certainly made a major change in how I view programming.
1 
1    The programs in both books are available from Brian Kernighan’s home
1 page (https://www.cs.princeton.edu/~bwk/).  For a number of years, there
1 was an active Software Tools Users Group, whose members had ported the
1 original ‘ratfor’ programs to essentially every computer system with a
1 FORTRAN compiler.  The popularity of the group waned in the middle 1980s
1 as Unix began to spread beyond universities.
1 
1    With the current proliferation of GNU code and other clones of Unix
1 programs, these programs now receive little attention; modern C versions
1 are much more efficient and do more than these programs do.
1 Nevertheless, as exposition of good programming style, and evangelism
1 for a still-valuable philosophy, these books are unparalleled, and I
1 recommend them highly.
1 
1    Acknowledgment: I would like to express my gratitude to Brian
1 Kernighan of Bell Labs, the original Software Toolsmith, for reviewing
1 this column.
1 
1    ---------- Footnotes ----------
1 
1    (1) Redhat Linux 6.1, for the November 2000 revision of this article.
1