find: Controlling Parallelism
1
1 3.3.2.5 Controlling Parallelism
1 ...............................
1
1 Normally, 'xargs' runs one command at a time. This is called "serial"
1 execution; the commands happen in a series, one after another. If you'd
1 like 'xargs' to do things in "parallel", you can ask it to do so, either
1 when you invoke it, or later while it is running. Running several
1 commands at one time can make the entire operation go more quickly, if
1 the commands are independent, and if your system has enough resources to
1 handle the load. When parallelism works in your application, 'xargs'
1 provides an easy way to get your work done faster.
1
1 '--max-procs=MAX-PROCS'
1 '-P MAX-PROCS'
1 Run up to MAX-PROCS processes at a time; the default is 1. If
1 MAX-PROCS is 0, 'xargs' will run as many processes as possible at a
1 time. Use the '-n', '-s', or '-L' option with '-P'; otherwise
1 chances are that the command will be run only once.
1
1 For example, suppose you have a directory tree of large image files
1 and a 'makeallsizes' script that takes a single file name and creates
1 various sized images from it (thumbnail-sized, web-page-sized,
1 printer-sized, and the original large file). The script is doing enough
1 work that it takes significant time to run, even on a single image. You
1 could run:
1
1 find originals -name '*.jpg' | xargs -1 makeallsizes
1
1 This will run 'makeallsizes FILENAME' once for each '.jpg' file in
1 the 'originals' directory. However, if your system has two central
1 processors, this script will only keep one of them busy. Instead, you
1 could probably finish in about half the time by running:
1
1 find originals -name '*.jpg' | xargs -1 -P 2 makeallsizes
1
1 'xargs' will run the first two commands in parallel, and then
1 whenever one of them terminates, it will start another one, until the
1 entire job is done.
1
1 The same idea can be generalized to as many processors as you have
1 handy. It also generalizes to other resources besides processors. For
1 example, if 'xargs' is running commands that are waiting for a response
1 from a distant network connection, running a few in parallel may reduce
1 the overall latency by overlapping their waiting time.
1
1 If you are running commands in parallel, you need to think about how
1 they should arbitrate access to any resources that they share. For
1 example, if more than one of them tries to print to stdout, the ouptut
1 will be produced in an indeterminate order (and very likely mixed up)
1 unless the processes collaborate in some way to prevent this. Using
1 some kind of locking scheme is one way to prevent such problems. In
1 general, using a locking scheme will help ensure correct output but
1 reduce performance. If you don't want to tolerate the performance
1 difference, simply arrange for each process to produce a separate output
1 file (or otherwise use separate resources).
1
1 'xargs' also allows you to "turn up" or "turn down" its parallelism
1 in the middle of a run. Suppose you are keeping your four-processor
1 system busy for hours, processing thousands of images using '-P 4'.
1 Now, in the middle of the run, you or someone else wants you to reduce
1 your load on the system, so that something else will run faster. If you
1 interrupt 'xargs', your job will be half-done, and it may take
1 significant manual work to resume it only for the remaining images. If
1 you suspend 'xargs' using your shell's job controls (e.g. 'control-Z'),
1 then it will get no work done while suspended.
1
1 Find out the process ID of the 'xargs' process, either from your
1 shell or with the 'ps' command. After you send it the signal 'SIGUSR2',
1 'xargs' will run one fewer command in parallel. If you send it the
1 signal 'SIGUSR1', it will run one more command in parallel. For
1 example:
1
1 shell$ xargs <allimages -1 -P 4 makeallsizes &
1 [4] 27643
1 ... at some later point ...
1 shell$ kill -USR2 27643
1 shell$ kill -USR2 %4
1
1 The first 'kill' command will cause 'xargs' to wait for two commands
1 to terminate before starting the next command (reducing the parallelism
1 from 4 to 3). The second 'kill' will reduce it from 3 to 2. ('%4'
1 works in some shells as a shorthand for the process ID of the background
1 job labeled '[4]'.)
1
1 Similarly, if you started a long 'xargs' job without parallelism, you
1 can easily switch it to start running two commands in parallel by
1 sending it a 'SIGUSR1'.
1
1 'xargs' will never terminate any existing commands when you ask it to
1 run fewer processes. It merely waits for the excess commands to finish.
1 If you ask it to run more commands, it will start the next one
1 immediately (if it has more work to do). If the degree of parallelism
1 is already 1, sending 'SIGUSR2' will have no further effect (since
1 '--max-procs=0' means that there should be no limit on the number of
1 processes to run).
1
1 There is an implementation-defined limit on the number of processes.
1 This limit is shown with 'xargs --show-limits'. The limit is at least
1 127 on all systems (and on the author's system it is 2147483647).
1
1 If you send several identical signals quickly, the operating system
1 does not guarantee that each of them will be delivered to 'xargs'. This
1 means that you can't rapidly increase or decrease the parallelism by
1 more than one command at a time. You can avoid this problem by sending
1 a signal, observing the result, then sending the next one; or merely by
1 delaying for a few seconds between signals (unless your system is very
1 heavily loaded).
1
1 Whether or not parallel execution will work well for you depends on
1 the nature of the commmand you are running in parallel, on the
1 configuration of the system on which you are running the command, and on
1 the other work being done on the system at the time.
1