wget: Recursive Download

1 
1 3 Recursive Download
1 ********************
1 
1 GNU Wget is capable of traversing parts of the Web (or a single HTTP or
1 FTP server), following links and directory structure.  We refer to this
1 as to “recursive retrieval”, or “recursion”.
1 
1    With HTTP URLs, Wget retrieves and parses the HTML or CSS from the
1 given URL, retrieving the files the document refers to, through markup
1 like ‘href’ or ‘src’, or CSS URI values specified using the ‘url()’
1 functional notation.  If the freshly downloaded file is also of type
1 ‘text/html’, ‘application/xhtml+xml’, or ‘text/css’, it will be parsed
1 and followed further.
1 
1    Recursive retrieval of HTTP and HTML/CSS content is “breadth-first”.
1 This means that Wget first downloads the requested document, then the
1 documents linked from that document, then the documents linked by them,
1 and so on.  In other words, Wget first downloads the documents at depth
1 1, then those at depth 2, and so on until the specified maximum depth.
1 
1    The maximum “depth” to which the retrieval may descend is specified
1 with the ‘-l’ option.  The default maximum depth is five layers.
1 
1    When retrieving an FTP URL recursively, Wget will retrieve all the
1 data from the given directory tree (including the subdirectories up to
1 the specified depth) on the remote server, creating its mirror image
1 locally.  FTP retrieval is also limited by the ‘depth’ parameter.
1 Unlike HTTP recursion, FTP recursion is performed depth-first.
1 
1    By default, Wget will create a local directory tree, corresponding to
1 the one found on the remote server.
1 
1    Recursive retrieving can find a number of applications, the most
1 important of which is mirroring.  It is also useful for WWW
1 presentations, and any other opportunities where slow network
1 connections should be bypassed by storing the files locally.
1 
1    You should be warned that recursive downloads can overload the remote
1 servers.  Because of that, many administrators frown upon them and may
1 ban access from your site if they detect very fast downloads of big
1 amounts of content.  When downloading from Internet servers, consider
1 using the ‘-w’ option to introduce a delay between accesses to the
1 server.  The download will take a while longer, but the server
1 administrator will not be alarmed by your rudeness.
1 
1    Of course, recursive download may cause problems on your machine.  If
1 left to run unchecked, it can easily fill up the disk.  If downloading
1 from local network, it can also take bandwidth on the system, as well as
1 consume memory and CPU.
1 
1    Try to specify the criteria that match the kind of download you are
1 trying to achieve.  If you want to download only one page, use
1 ‘--page-requisites’ without any additional recursion.  If you want to
1 download things under one directory, use ‘-np’ to avoid downloading
1 things from other directories.  If you want to download all the files
1 from one directory, use ‘-l 1’ to make sure the recursion depth never
1 exceeds one.  ⇒Following Links, for more information about this.
1 
1    Recursive retrieval should be used with care.  Don’t say you were not
1 warned.
1