find: Deleting Files

1 
1 10.1 Deleting Files
1 ===================
1 
1 One of the most common tasks that 'find' is used for is locating files
1 that can be deleted.  This might include:
1 
1    * Files last modified more than 3 years ago which haven't been
1      accessed for at least 2 years
1    * Files belonging to a certain user
1    * Temporary files which are no longer required
1 
1    This example concentrates on the actual deletion task rather than on
1 sophisticated ways of locating the files that need to be deleted.  We'll
1 assume that the files we want to delete are old files underneath
1 '/var/tmp/stuff'.
1 
1 10.1.1 The Traditional Way
1 --------------------------
1 
1 The traditional way to delete files in '/var/tmp/stuff' that have not
1 been modified in over 90 days would have been:
1 
1      find /var/tmp/stuff -mtime +90 -exec /bin/rm {} \;
1 
1    The above command uses '-exec' to run the '/bin/rm' command to remove
1 each file.  This approach works and in fact would have worked in Version
1 7 Unix in 1979.  However, there are a number of problems with this
1 approach.
1 
1    The most obvious problem with the approach above is that it causes
1 'find' to fork every time it finds a file that needs to delete, and the
1 child process then has to use the 'exec' system call to launch
1 '/bin/rm'.  All this is quite inefficient.  If we are going to use
1 '/bin/rm' to do this job, it is better to make it delete more than one
1 file at a time.
1 
1    The most obvious way of doing this is to use the shell's command
1 expansion feature:
1 
1      /bin/rm `find /var/tmp/stuff -mtime +90 -print`
1    or you could use the more modern form
1      /bin/rm $(find /var/tmp/stuff -mtime +90 -print)
1 
1    The commands above are much more efficient than the first attempt.
1 However, there is a problem with them.  The shell has a maximum command
1 length which is imposed by the operating system (the actual limit varies
1 between systems).  This means that while the command expansion technique
1 will usually work, it will suddenly fail when there are lots of files to
1 delete.  Since the task is to delete unwanted files, this is precisely
1 the time we don't want things to go wrong.
1 
1 10.1.2 Making Use of 'xargs'
1 ----------------------------
1 
1 So, is there a way to be more efficient in the use of 'fork()' and
1 'exec()' without running up against this limit?  Yes, we can be almost
1 optimally efficient by making use of the 'xargs' command.  The 'xargs'
1 command reads arguments from its standard input and builds them into
1 command lines.  We can use it like this:
1 
1      find /var/tmp/stuff -mtime +90 -print | xargs /bin/rm
1 
1    For example if the files found by 'find' are '/var/tmp/stuff/A',
1 '/var/tmp/stuff/B' and '/var/tmp/stuff/C' then 'xargs' might issue the
1 commands
1 
1      /bin/rm /var/tmp/stuff/A /var/tmp/stuff/B
1      /bin/rm /var/tmp/stuff/C
1 
1    The above assumes that 'xargs' has a very small maximum command line
1 length.  The real limit is much larger but the idea is that 'xargs' will
1 run '/bin/rm' as many times as necessary to get the job done, given the
1 limits on command line length.
1 
1    This usage of 'xargs' is pretty efficient, and the 'xargs' command is
1 widely implemented (all modern versions of Unix offer it).  So far then,
1 the news is all good.  However, there is bad news too.
1 
1 10.1.3 Unusual characters in filenames
1 --------------------------------------
1 
1 Unix-like systems allow any characters to appear in file names with the
1 exception of the ASCII NUL character and the slash.  Slashes can occur
1 in path names (as the directory separator) but not in the names of
1 actual directory entries.  This means that the list of files that
1 'xargs' reads could in fact contain white space characters - spaces,
1 tabs and newline characters.  Since by default, 'xargs' assumes that the
1 list of files it is reading uses white space as an argument separator,
1 it cannot correctly handle the case where a filename actually includes
1 white space.  This makes the default behaviour of 'xargs' almost useless
1 for handling arbitrary data.
1 
1    To solve this problem, GNU findutils introduced the '-print0' action
1 for 'find'.  This uses the ASCII NUL character to separate the entries
1 in the file list that it produces.  This is the ideal choice of
1 separator since it is the only character that cannot appear within a
1 path name.  The '-0' option to 'xargs' makes it assume that arguments
1 are separated with ASCII NUL instead of white space.  It also turns off
1 another misfeature in the default behaviour of 'xargs', which is that it
1 pays attention to quote characters in its input.  Some versions of
1 'xargs' also terminate when they see a lone '_' in the input, but GNU
1 'find' no longer does that (since it has become an optional behaviour in
1 the Unix standard).
1 
1    So, putting 'find -print0' together with 'xargs -0' we get this
1 command:
1 
1      find /var/tmp/stuff -mtime +90 -print0 | xargs -0 /bin/rm
1 
1    The result is an efficient way of proceeding that correctly handles
1 all the possible characters that could appear in the list of files to
1 delete.  This is good news.  However, there is, as I'm sure you're
1 expecting, also more bad news.  The problem is that this is not a
1 portable construct; although other versions of Unix (notably BSD-derived
1 ones) support '-print0', it's not universal.  So, is there a more
1 universal mechanism?
1 
1 10.1.4 Going back to '-exec'
1 ----------------------------
1 
1 There is indeed a more universal mechanism, which is a slight
1 modification to the '-exec' action.  The normal '-exec' action assumes
1 that the command to run is terminated with a semicolon (the semicolon
1 normally has to be quoted in order to protect it from interpretation as
1 the shell command separator).  The SVR4 edition of Unix introduced a
1 slight variation, which involves terminating the command with '+'
1 instead:
1 
1      find /var/tmp/stuff -mtime +90 -exec /bin/rm {} \+
1 
1    The above use of '-exec' causes 'find' to build up a long command
1 line and then issue it.  This can be less efficient than some uses of
1 'xargs'; for example 'xargs' allows new command lines to be built up
1 while the previous command is still executing, and allows you to specify
1 a number of commands to run in parallel.  However, the 'find ... -exec
1 ... +' construct has the advantage of wide portability.  GNU findutils
1 did not support '-exec ... +' until version 4.2.12; one of the reasons
1 for this is that it already had the '-print0' action in any case.
1 
1 10.1.5 A more secure version of '-exec'
1 ---------------------------------------
1 
1 The command above seems to be efficient and portable.  However, within
1 it lurks a security problem.  The problem is shared with all the
1 commands we've tried in this worked example so far, too.  The security
1 problem is a race condition; that is, if it is possible for somebody to
1 manipulate the filesystem that you are searching while you are searching
1 it, it is possible for them to persuade your 'find' command to cause the
1 deletion of a file that you can delete but they normally cannot.
1 
1    The problem occurs because the '-exec' action is defined by the POSIX
1 standard to invoke its command with the same working directory as 'find'
1 had when it was started.  This means that the arguments which replace
1 the {} include a relative path from 'find''s starting point down the
1 file that needs to be deleted.  For example,
1 
1      find /var/tmp/stuff -mtime +90 -exec /bin/rm {} \+
1 
1    might actually issue the command:
1 
1      /bin/rm /var/tmp/stuff/A /var/tmp/stuff/B /var/tmp/stuff/passwd
1 
1    Notice the file '/var/tmp/stuff/passwd'.  Likewise, the command:
1 
1      cd /var/tmp && find stuff -mtime +90 -exec /bin/rm {} \+
1 
1    might actually issue the command:
1 
1      /bin/rm stuff/A stuff/B stuff/passwd
1 
1    If an attacker can rename 'stuff' to something else (making use of
1 their write permissions in '/var/tmp') they can replace it with a
1 symbolic link to '/etc'.  That means that the '/bin/rm' command will be
1 invoked on '/etc/passwd'.  If you are running your 'find' command as
1 root, the attacker has just managed to delete a vital file.  All they
1 needed to do to achieve this was replace a subdirectory with a symbolic
1 link at the vital moment.
1 
1    There is however, a simple solution to the problem.  This is an
1 action which works a lot like '-exec' but doesn't need to traverse a
1 chain of directories to reach the file that it needs to work on.  This
1 is the '-execdir' action, which was introduced by the BSD family of
1 operating systems.  The command,
1 
1      find /var/tmp/stuff -mtime +90 -execdir /bin/rm {} \+
1 
1    might delete a set of files by performing these actions:
1 
1   1. Change directory to /var/tmp/stuff/foo
1   2. Invoke '/bin/rm ./file1 ./file2 ./file3'
1   3. Change directory to /var/tmp/stuff/bar
1   4. Invoke '/bin/rm ./file99 ./file100 ./file101'
1 
1    This is a much more secure method.  We are no longer exposed to a
1 race condition.  For many typical uses of 'find', this is the best
1 strategy.  It's reasonably efficient, but the length of the command line
1 is limited not just by the operating system limits, but also by how many
1 files we actually need to delete from each directory.
1 
1    Is it possible to do any better?  In the case of general file
1 processing, no.  However, in the specific case of deleting files it is
1 indeed possible to do better.
1 
1 10.1.6 Using the '-delete' action
1 ---------------------------------
1 
1 The most efficient and secure method of solving this problem is to use
1 the '-delete' action:
1 
1      find /var/tmp/stuff -mtime +90 -delete
1 
1    This alternative is more efficient than any of the '-exec' or
1 '-execdir' actions, since it entirely avoids the overhead of forking a
1 new process and using 'exec' to run '/bin/rm'.  It is also normally more
1 efficient than 'xargs' for the same reason.  The file deletion is
1 performed from the directory containing the entry to be deleted, so the
1 '-delete' action has the same security advantages as the '-execdir'
1 action has.
1 
1    The '-delete' action was introduced by the BSD family of operating
1 systems.
1 
1 10.1.7 Improving things still further
1 -------------------------------------
1 
1 Is it possible to improve things still further?  Not without either
1 modifying the system library to the operating system or having more
1 specific knowledge of the layout of the filesystem and disk I/O
1 subsystem, or both.
1 
1    The 'find' command traverses the filesystem, reading directories.  It
1 then issues a separate system call for each file to be deleted.  If we
1 could modify the operating system, there are potential gains that could
1 be made:
1 
1    * We could have a system call to which we pass more than one filename
1      for deletion
1    * Alternatively, we could pass in a list of inode numbers (on
1      GNU/Linux systems, 'readdir()' also returns the inode number of
1      each directory entry) to be deleted.
1 
1    The above possibilities sound interesting, but from the kernel's
1 point of view it is difficult to enforce standard Unix access controls
1 for such processing by inode number.  Such a facility would probably
1 need to be restricted to the superuser.
1 
1    Another way of improving performance would be to increase the
1 parallelism of the process.  For example if the directory hierarchy we
1 are searching is actually spread across a number of disks, we might
1 somehow be able to arrange for 'find' to process each disk in parallel.
1 In practice GNU 'find' doesn't have such an intimate understanding of
1 the system's filesystem layout and disk I/O subsystem.
1 
1    However, since the system administrator can have such an
1 understanding they can take advantage of it like so:
1 
1      find /var/tmp/stuff1 -mtime +90 -delete &
1      find /var/tmp/stuff2 -mtime +90 -delete &
1      find /var/tmp/stuff3 -mtime +90 -delete &
1      find /var/tmp/stuff4 -mtime +90 -delete &
1      wait
1 
1    In the example above, four separate instances of 'find' are used to
1 search four subdirectories in parallel.  The 'wait' command simply waits
1 for all of these to complete.  Whether this approach is more or less
1 efficient than a single instance of 'find' depends on a number of
1 things:
1 
1    * Are the directories being searched in parallel actually on separate
1      disks?  If not, this parallel search might just result in a lot of
1      disk head movement and so the speed might even be slower.
1    * Other activity - are other programs also doing things on those
1      disks?
1 
1 10.1.8 Conclusion
1 -----------------
1 
1 The fastest and most secure way to delete files with the help of 'find'
1 is to use '-delete'.  Using 'xargs -0 -P N' can also make effective use
1 of the disk, but it is not as secure.
1 
1    In the case where we're doing things other than deleting files, the
1 most secure alternative is '-execdir ... +', but this is not as portable
1 as the insecure action '-exec ... +'.
1 
1    The '-delete' action is not completely portable, but the only other
1 possibility which is as secure ('-execdir') is no more portable.  The
1 most efficient portable alternative is '-exec ...+', but this is
1 insecure and isn't supported by versions of GNU findutils prior to
1 4.2.12.
1