find: Controlling Parallelism

1 
1 3.3.2.5 Controlling Parallelism
1 ...............................
1 
1 Normally, 'xargs' runs one command at a time.  This is called "serial"
1 execution; the commands happen in a series, one after another.  If you'd
1 like 'xargs' to do things in "parallel", you can ask it to do so, either
1 when you invoke it, or later while it is running.  Running several
1 commands at one time can make the entire operation go more quickly, if
1 the commands are independent, and if your system has enough resources to
1 handle the load.  When parallelism works in your application, 'xargs'
1 provides an easy way to get your work done faster.
1 
1 '--max-procs=MAX-PROCS'
1 '-P MAX-PROCS'
1      Run up to MAX-PROCS processes at a time; the default is 1.  If
1      MAX-PROCS is 0, 'xargs' will run as many processes as possible at a
1      time.  Use the '-n', '-s', or '-L' option with '-P'; otherwise
1      chances are that the command will be run only once.
1 
1    For example, suppose you have a directory tree of large image files
1 and a 'makeallsizes' script that takes a single file name and creates
1 various sized images from it (thumbnail-sized, web-page-sized,
1 printer-sized, and the original large file).  The script is doing enough
1 work that it takes significant time to run, even on a single image.  You
1 could run:
1 
1      find originals -name '*.jpg' | xargs -1 makeallsizes
1 
1    This will run 'makeallsizes FILENAME' once for each '.jpg' file in
1 the 'originals' directory.  However, if your system has two central
1 processors, this script will only keep one of them busy.  Instead, you
1 could probably finish in about half the time by running:
1 
1      find originals -name '*.jpg' | xargs -1 -P 2 makeallsizes
1 
1    'xargs' will run the first two commands in parallel, and then
1 whenever one of them terminates, it will start another one, until the
1 entire job is done.
1 
1    The same idea can be generalized to as many processors as you have
1 handy.  It also generalizes to other resources besides processors.  For
1 example, if 'xargs' is running commands that are waiting for a response
1 from a distant network connection, running a few in parallel may reduce
1 the overall latency by overlapping their waiting time.
1 
1    If you are running commands in parallel, you need to think about how
1 they should arbitrate access to any resources that they share.  For
1 example, if more than one of them tries to print to stdout, the ouptut
1 will be produced in an indeterminate order (and very likely mixed up)
1 unless the processes collaborate in some way to prevent this.  Using
1 some kind of locking scheme is one way to prevent such problems.  In
1 general, using a locking scheme will help ensure correct output but
1 reduce performance.  If you don't want to tolerate the performance
1 difference, simply arrange for each process to produce a separate output
1 file (or otherwise use separate resources).
1 
1    'xargs' also allows you to "turn up" or "turn down" its parallelism
1 in the middle of a run.  Suppose you are keeping your four-processor
1 system busy for hours, processing thousands of images using '-P 4'.
1 Now, in the middle of the run, you or someone else wants you to reduce
1 your load on the system, so that something else will run faster.  If you
1 interrupt 'xargs', your job will be half-done, and it may take
1 significant manual work to resume it only for the remaining images.  If
1 you suspend 'xargs' using your shell's job controls (e.g.  'control-Z'),
1 then it will get no work done while suspended.
1 
1    Find out the process ID of the 'xargs' process, either from your
1 shell or with the 'ps' command.  After you send it the signal 'SIGUSR2',
1 'xargs' will run one fewer command in parallel.  If you send it the
1 signal 'SIGUSR1', it will run one more command in parallel.  For
1 example:
1 
1      shell$ xargs <allimages -1 -P 4 makeallsizes &
1      [4] 27643
1         ... at some later point ...
1      shell$ kill -USR2 27643
1      shell$ kill -USR2 %4
1 
1    The first 'kill' command will cause 'xargs' to wait for two commands
1 to terminate before starting the next command (reducing the parallelism
1 from 4 to 3).  The second 'kill' will reduce it from 3 to 2.  ('%4'
1 works in some shells as a shorthand for the process ID of the background
1 job labeled '[4]'.)
1 
1    Similarly, if you started a long 'xargs' job without parallelism, you
1 can easily switch it to start running two commands in parallel by
1 sending it a 'SIGUSR1'.
1 
1    'xargs' will never terminate any existing commands when you ask it to
1 run fewer processes.  It merely waits for the excess commands to finish.
1 If you ask it to run more commands, it will start the next one
1 immediately (if it has more work to do).  If the degree of parallelism
1 is already 1, sending 'SIGUSR2' will have no further effect (since
1 '--max-procs=0' means that there should be no limit on the number of
1 processes to run).
1 
1    There is an implementation-defined limit on the number of processes.
1 This limit is shown with 'xargs --show-limits'.  The limit is at least
1 127 on all systems (and on the author's system it is 2147483647).
1 
1    If you send several identical signals quickly, the operating system
1 does not guarantee that each of them will be delivered to 'xargs'.  This
1 means that you can't rapidly increase or decrease the parallelism by
1 more than one command at a time.  You can avoid this problem by sending
1 a signal, observing the result, then sending the next one; or merely by
1 delaying for a few seconds between signals (unless your system is very
1 heavily loaded).
1 
1    Whether or not parallel execution will work well for you depends on
1 the nature of the commmand you are running in parallel, on the
1 configuration of the system on which you are running the command, and on
1 the other work being done on the system at the time.
1