find: Deleting Files
1
1 10.1 Deleting Files
1 ===================
1
1 One of the most common tasks that 'find' is used for is locating files
1 that can be deleted. This might include:
1
1 * Files last modified more than 3 years ago which haven't been
1 accessed for at least 2 years
1 * Files belonging to a certain user
1 * Temporary files which are no longer required
1
1 This example concentrates on the actual deletion task rather than on
1 sophisticated ways of locating the files that need to be deleted. We'll
1 assume that the files we want to delete are old files underneath
1 '/var/tmp/stuff'.
1
1 10.1.1 The Traditional Way
1 --------------------------
1
1 The traditional way to delete files in '/var/tmp/stuff' that have not
1 been modified in over 90 days would have been:
1
1 find /var/tmp/stuff -mtime +90 -exec /bin/rm {} \;
1
1 The above command uses '-exec' to run the '/bin/rm' command to remove
1 each file. This approach works and in fact would have worked in Version
1 7 Unix in 1979. However, there are a number of problems with this
1 approach.
1
1 The most obvious problem with the approach above is that it causes
1 'find' to fork every time it finds a file that needs to delete, and the
1 child process then has to use the 'exec' system call to launch
1 '/bin/rm'. All this is quite inefficient. If we are going to use
1 '/bin/rm' to do this job, it is better to make it delete more than one
1 file at a time.
1
1 The most obvious way of doing this is to use the shell's command
1 expansion feature:
1
1 /bin/rm `find /var/tmp/stuff -mtime +90 -print`
1 or you could use the more modern form
1 /bin/rm $(find /var/tmp/stuff -mtime +90 -print)
1
1 The commands above are much more efficient than the first attempt.
1 However, there is a problem with them. The shell has a maximum command
1 length which is imposed by the operating system (the actual limit varies
1 between systems). This means that while the command expansion technique
1 will usually work, it will suddenly fail when there are lots of files to
1 delete. Since the task is to delete unwanted files, this is precisely
1 the time we don't want things to go wrong.
1
1 10.1.2 Making Use of 'xargs'
1 ----------------------------
1
1 So, is there a way to be more efficient in the use of 'fork()' and
1 'exec()' without running up against this limit? Yes, we can be almost
1 optimally efficient by making use of the 'xargs' command. The 'xargs'
1 command reads arguments from its standard input and builds them into
1 command lines. We can use it like this:
1
1 find /var/tmp/stuff -mtime +90 -print | xargs /bin/rm
1
1 For example if the files found by 'find' are '/var/tmp/stuff/A',
1 '/var/tmp/stuff/B' and '/var/tmp/stuff/C' then 'xargs' might issue the
1 commands
1
1 /bin/rm /var/tmp/stuff/A /var/tmp/stuff/B
1 /bin/rm /var/tmp/stuff/C
1
1 The above assumes that 'xargs' has a very small maximum command line
1 length. The real limit is much larger but the idea is that 'xargs' will
1 run '/bin/rm' as many times as necessary to get the job done, given the
1 limits on command line length.
1
1 This usage of 'xargs' is pretty efficient, and the 'xargs' command is
1 widely implemented (all modern versions of Unix offer it). So far then,
1 the news is all good. However, there is bad news too.
1
1 10.1.3 Unusual characters in filenames
1 --------------------------------------
1
1 Unix-like systems allow any characters to appear in file names with the
1 exception of the ASCII NUL character and the slash. Slashes can occur
1 in path names (as the directory separator) but not in the names of
1 actual directory entries. This means that the list of files that
1 'xargs' reads could in fact contain white space characters - spaces,
1 tabs and newline characters. Since by default, 'xargs' assumes that the
1 list of files it is reading uses white space as an argument separator,
1 it cannot correctly handle the case where a filename actually includes
1 white space. This makes the default behaviour of 'xargs' almost useless
1 for handling arbitrary data.
1
1 To solve this problem, GNU findutils introduced the '-print0' action
1 for 'find'. This uses the ASCII NUL character to separate the entries
1 in the file list that it produces. This is the ideal choice of
1 separator since it is the only character that cannot appear within a
1 path name. The '-0' option to 'xargs' makes it assume that arguments
1 are separated with ASCII NUL instead of white space. It also turns off
1 another misfeature in the default behaviour of 'xargs', which is that it
1 pays attention to quote characters in its input. Some versions of
1 'xargs' also terminate when they see a lone '_' in the input, but GNU
1 'find' no longer does that (since it has become an optional behaviour in
1 the Unix standard).
1
1 So, putting 'find -print0' together with 'xargs -0' we get this
1 command:
1
1 find /var/tmp/stuff -mtime +90 -print0 | xargs -0 /bin/rm
1
1 The result is an efficient way of proceeding that correctly handles
1 all the possible characters that could appear in the list of files to
1 delete. This is good news. However, there is, as I'm sure you're
1 expecting, also more bad news. The problem is that this is not a
1 portable construct; although other versions of Unix (notably BSD-derived
1 ones) support '-print0', it's not universal. So, is there a more
1 universal mechanism?
1
1 10.1.4 Going back to '-exec'
1 ----------------------------
1
1 There is indeed a more universal mechanism, which is a slight
1 modification to the '-exec' action. The normal '-exec' action assumes
1 that the command to run is terminated with a semicolon (the semicolon
1 normally has to be quoted in order to protect it from interpretation as
1 the shell command separator). The SVR4 edition of Unix introduced a
1 slight variation, which involves terminating the command with '+'
1 instead:
1
1 find /var/tmp/stuff -mtime +90 -exec /bin/rm {} \+
1
1 The above use of '-exec' causes 'find' to build up a long command
1 line and then issue it. This can be less efficient than some uses of
1 'xargs'; for example 'xargs' allows new command lines to be built up
1 while the previous command is still executing, and allows you to specify
1 a number of commands to run in parallel. However, the 'find ... -exec
1 ... +' construct has the advantage of wide portability. GNU findutils
1 did not support '-exec ... +' until version 4.2.12; one of the reasons
1 for this is that it already had the '-print0' action in any case.
1
1 10.1.5 A more secure version of '-exec'
1 ---------------------------------------
1
1 The command above seems to be efficient and portable. However, within
1 it lurks a security problem. The problem is shared with all the
1 commands we've tried in this worked example so far, too. The security
1 problem is a race condition; that is, if it is possible for somebody to
1 manipulate the filesystem that you are searching while you are searching
1 it, it is possible for them to persuade your 'find' command to cause the
1 deletion of a file that you can delete but they normally cannot.
1
1 The problem occurs because the '-exec' action is defined by the POSIX
1 standard to invoke its command with the same working directory as 'find'
1 had when it was started. This means that the arguments which replace
1 the {} include a relative path from 'find''s starting point down the
1 file that needs to be deleted. For example,
1
1 find /var/tmp/stuff -mtime +90 -exec /bin/rm {} \+
1
1 might actually issue the command:
1
1 /bin/rm /var/tmp/stuff/A /var/tmp/stuff/B /var/tmp/stuff/passwd
1
1 Notice the file '/var/tmp/stuff/passwd'. Likewise, the command:
1
1 cd /var/tmp && find stuff -mtime +90 -exec /bin/rm {} \+
1
1 might actually issue the command:
1
1 /bin/rm stuff/A stuff/B stuff/passwd
1
1 If an attacker can rename 'stuff' to something else (making use of
1 their write permissions in '/var/tmp') they can replace it with a
1 symbolic link to '/etc'. That means that the '/bin/rm' command will be
1 invoked on '/etc/passwd'. If you are running your 'find' command as
1 root, the attacker has just managed to delete a vital file. All they
1 needed to do to achieve this was replace a subdirectory with a symbolic
1 link at the vital moment.
1
1 There is however, a simple solution to the problem. This is an
1 action which works a lot like '-exec' but doesn't need to traverse a
1 chain of directories to reach the file that it needs to work on. This
1 is the '-execdir' action, which was introduced by the BSD family of
1 operating systems. The command,
1
1 find /var/tmp/stuff -mtime +90 -execdir /bin/rm {} \+
1
1 might delete a set of files by performing these actions:
1
1 1. Change directory to /var/tmp/stuff/foo
1 2. Invoke '/bin/rm ./file1 ./file2 ./file3'
1 3. Change directory to /var/tmp/stuff/bar
1 4. Invoke '/bin/rm ./file99 ./file100 ./file101'
1
1 This is a much more secure method. We are no longer exposed to a
1 race condition. For many typical uses of 'find', this is the best
1 strategy. It's reasonably efficient, but the length of the command line
1 is limited not just by the operating system limits, but also by how many
1 files we actually need to delete from each directory.
1
1 Is it possible to do any better? In the case of general file
1 processing, no. However, in the specific case of deleting files it is
1 indeed possible to do better.
1
1 10.1.6 Using the '-delete' action
1 ---------------------------------
1
1 The most efficient and secure method of solving this problem is to use
1 the '-delete' action:
1
1 find /var/tmp/stuff -mtime +90 -delete
1
1 This alternative is more efficient than any of the '-exec' or
1 '-execdir' actions, since it entirely avoids the overhead of forking a
1 new process and using 'exec' to run '/bin/rm'. It is also normally more
1 efficient than 'xargs' for the same reason. The file deletion is
1 performed from the directory containing the entry to be deleted, so the
1 '-delete' action has the same security advantages as the '-execdir'
1 action has.
1
1 The '-delete' action was introduced by the BSD family of operating
1 systems.
1
1 10.1.7 Improving things still further
1 -------------------------------------
1
1 Is it possible to improve things still further? Not without either
1 modifying the system library to the operating system or having more
1 specific knowledge of the layout of the filesystem and disk I/O
1 subsystem, or both.
1
1 The 'find' command traverses the filesystem, reading directories. It
1 then issues a separate system call for each file to be deleted. If we
1 could modify the operating system, there are potential gains that could
1 be made:
1
1 * We could have a system call to which we pass more than one filename
1 for deletion
1 * Alternatively, we could pass in a list of inode numbers (on
1 GNU/Linux systems, 'readdir()' also returns the inode number of
1 each directory entry) to be deleted.
1
1 The above possibilities sound interesting, but from the kernel's
1 point of view it is difficult to enforce standard Unix access controls
1 for such processing by inode number. Such a facility would probably
1 need to be restricted to the superuser.
1
1 Another way of improving performance would be to increase the
1 parallelism of the process. For example if the directory hierarchy we
1 are searching is actually spread across a number of disks, we might
1 somehow be able to arrange for 'find' to process each disk in parallel.
1 In practice GNU 'find' doesn't have such an intimate understanding of
1 the system's filesystem layout and disk I/O subsystem.
1
1 However, since the system administrator can have such an
1 understanding they can take advantage of it like so:
1
1 find /var/tmp/stuff1 -mtime +90 -delete &
1 find /var/tmp/stuff2 -mtime +90 -delete &
1 find /var/tmp/stuff3 -mtime +90 -delete &
1 find /var/tmp/stuff4 -mtime +90 -delete &
1 wait
1
1 In the example above, four separate instances of 'find' are used to
1 search four subdirectories in parallel. The 'wait' command simply waits
1 for all of these to complete. Whether this approach is more or less
1 efficient than a single instance of 'find' depends on a number of
1 things:
1
1 * Are the directories being searched in parallel actually on separate
1 disks? If not, this parallel search might just result in a lot of
1 disk head movement and so the speed might even be slower.
1 * Other activity - are other programs also doing things on those
1 disks?
1
1 10.1.8 Conclusion
1 -----------------
1
1 The fastest and most secure way to delete files with the help of 'find'
1 is to use '-delete'. Using 'xargs -0 -P N' can also make effective use
1 of the disk, but it is not as secure.
1
1 In the case where we're doing things other than deleting files, the
1 most secure alternative is '-execdir ... +', but this is not as portable
1 as the insecure action '-exec ... +'.
1
1 The '-delete' action is not completely portable, but the only other
1 possibility which is as secure ('-execdir') is no more portable. The
1 most efficient portable alternative is '-exec ...+', but this is
1 insecure and isn't supported by versions of GNU findutils prior to
1 4.2.12.
1