wget: Recursive Download
1
1 3 Recursive Download
1 ********************
1
1 GNU Wget is capable of traversing parts of the Web (or a single HTTP or
1 FTP server), following links and directory structure. We refer to this
1 as to “recursive retrieval”, or “recursion”.
1
1 With HTTP URLs, Wget retrieves and parses the HTML or CSS from the
1 given URL, retrieving the files the document refers to, through markup
1 like ‘href’ or ‘src’, or CSS URI values specified using the ‘url()’
1 functional notation. If the freshly downloaded file is also of type
1 ‘text/html’, ‘application/xhtml+xml’, or ‘text/css’, it will be parsed
1 and followed further.
1
1 Recursive retrieval of HTTP and HTML/CSS content is “breadth-first”.
1 This means that Wget first downloads the requested document, then the
1 documents linked from that document, then the documents linked by them,
1 and so on. In other words, Wget first downloads the documents at depth
1 1, then those at depth 2, and so on until the specified maximum depth.
1
1 The maximum “depth” to which the retrieval may descend is specified
1 with the ‘-l’ option. The default maximum depth is five layers.
1
1 When retrieving an FTP URL recursively, Wget will retrieve all the
1 data from the given directory tree (including the subdirectories up to
1 the specified depth) on the remote server, creating its mirror image
1 locally. FTP retrieval is also limited by the ‘depth’ parameter.
1 Unlike HTTP recursion, FTP recursion is performed depth-first.
1
1 By default, Wget will create a local directory tree, corresponding to
1 the one found on the remote server.
1
1 Recursive retrieving can find a number of applications, the most
1 important of which is mirroring. It is also useful for WWW
1 presentations, and any other opportunities where slow network
1 connections should be bypassed by storing the files locally.
1
1 You should be warned that recursive downloads can overload the remote
1 servers. Because of that, many administrators frown upon them and may
1 ban access from your site if they detect very fast downloads of big
1 amounts of content. When downloading from Internet servers, consider
1 using the ‘-w’ option to introduce a delay between accesses to the
1 server. The download will take a while longer, but the server
1 administrator will not be alarmed by your rudeness.
1
1 Of course, recursive download may cause problems on your machine. If
1 left to run unchecked, it can easily fill up the disk. If downloading
1 from local network, it can also take bandwidth on the system, as well as
1 consume memory and CPU.
1
1 Try to specify the criteria that match the kind of download you are
1 trying to achieve. If you want to download only one page, use
1 ‘--page-requisites’ without any additional recursion. If you want to
1 download things under one directory, use ‘-np’ to avoid downloading
1 things from other directories. If you want to download all the files
1 from one directory, use ‘-l 1’ to make sure the recursion depth never
1 exceeds one. ⇒Following Links, for more information about this.
1
1 Recursive retrieval should be used with care. Don’t say you were not
1 warned.
1