wget: Robot Exclusion

1 
1 9.1 Robot Exclusion
1 ===================
1 
1 It is extremely easy to make Wget wander aimlessly around a web site,
1 sucking all the available data in progress.  ‘wget -r SITE’, and you’re
1 set.  Great?  Not for the server admin.
1 
1    As long as Wget is only retrieving static pages, and doing it at a
1 reasonable rate (see the ‘--wait’ option), there’s not much of a
1 problem.  The trouble is that Wget can’t tell the difference between the
1 smallest static page and the most demanding CGI. A site I know has a
1 section handled by a CGI Perl script that converts Info files to HTML on
1 the fly.  The script is slow, but works well enough for human users
1 viewing an occasional Info file.  However, when someone’s recursive Wget
1 download stumbles upon the index page that links to all the Info files
1 through the script, the system is brought to its knees without providing
1 anything useful to the user (This task of converting Info files could be
1 done locally and access to Info documentation for all installed GNU
1 software on a system is available from the ‘info’ command).
1 
1    To avoid this kind of accident, as well as to preserve privacy for
1 documents that need to be protected from well-behaved robots, the
1 concept of “robot exclusion” was invented.  The idea is that the server
1 administrators and document authors can specify which portions of the
1 site they wish to protect from robots and those they will permit access.
1 
1    The most popular mechanism, and the de facto standard supported by
1 all the major robots, is the “Robots Exclusion Standard” (RES) written
1 by Martijn Koster et al.  in 1994.  It specifies the format of a text
1 file containing directives that instruct the robots which URL paths to
1 avoid.  To be found by the robots, the specifications must be placed in
1 ‘/robots.txt’ in the server root, which the robots are expected to
1 download and parse.
1 
1    Although Wget is not a web robot in the strictest sense of the word,
1 it can download large parts of the site without the user’s intervention
1 to download an individual page.  Because of that, Wget honors RES when
1 downloading recursively.  For instance, when you issue:
1 
1      wget -r http://www.example.com/
1 
1    First the index of ‘www.example.com’ will be downloaded.  If Wget
1 finds that it wants to download more documents from that server, it will
1 request ‘http://www.example.com/robots.txt’ and, if found, use it for
1 further downloads.  ‘robots.txt’ is loaded only once per each server.
1 
1    Until version 1.8, Wget supported the first version of the standard,
1 written by Martijn Koster in 1994 and available at
1 <http://www.robotstxt.org/orig.html>.  As of version 1.8, Wget has
1 supported the additional directives specified in the internet draft
1 ‘<draft-koster-robots-00.txt>’ titled “A Method for Web Robots Control”.
1 The draft, which has as far as I know never made to an RFC, is available
1 at <http://www.robotstxt.org/norobots-rfc.txt>.
1 
1    This manual no longer includes the text of the Robot Exclusion
1 Standard.
1 
1    The second, less known mechanism, enables the author of an individual
1 document to specify whether they want the links from the file to be
1 followed by a robot.  This is achieved using the ‘META’ tag, like this:
1 
1      <meta name="robots" content="nofollow">
1 
1    This is explained in some detail at
1 <http://www.robotstxt.org/meta.html>.  Wget supports this method of
1 robot exclusion in addition to the usual ‘/robots.txt’ exclusion.
1 
1    If you know what you are doing and really really wish to turn off the
1 robot exclusion, set the ‘robots’ variable to ‘off’ in your ‘.wgetrc’.
1 You can achieve the same effect from the command line using the ‘-e’
1 switch, e.g.  ‘wget -e robots=off URL...’.
1