wget: Recursive Retrieval Options
1
1 2.11 Recursive Retrieval Options
1 ================================
1
1 ‘-r’
1 ‘--recursive’
1 Turn on recursive retrieving. ⇒Recursive Download, for more
1 details. The default maximum depth is 5.
1
1 ‘-l DEPTH’
1 ‘--level=DEPTH’
11 Specify recursion maximum depth level DEPTH (⇒Recursive
Download).
1
1 ‘--delete-after’
1 This option tells Wget to delete every single file it downloads,
1 _after_ having done so. It is useful for pre-fetching popular
1 pages through a proxy, e.g.:
1
1 wget -r -nd --delete-after http://whatever.com/~popular/page/
1
1 The ‘-r’ option is to retrieve recursively, and ‘-nd’ to not create
1 directories.
1
1 Note that ‘--delete-after’ deletes files on the local machine. It
1 does not issue the ‘DELE’ command to remote FTP sites, for
1 instance. Also note that when ‘--delete-after’ is specified,
1 ‘--convert-links’ is ignored, so ‘.orig’ files are simply not
1 created in the first place.
1
1 ‘-k’
1 ‘--convert-links’
1 After the download is complete, convert the links in the document
1 to make them suitable for local viewing. This affects not only the
1 visible hyperlinks, but any part of the document that links to
1 external content, such as embedded images, links to style sheets,
1 hyperlinks to non-HTML content, etc.
1
1 Each link will be changed in one of the two ways:
1
1 • The links to files that have been downloaded by Wget will be
1 changed to refer to the file they point to as a relative link.
1
1 Example: if the downloaded file ‘/foo/doc.html’ links to
1 ‘/bar/img.gif’, also downloaded, then the link in ‘doc.html’
1 will be modified to point to ‘../bar/img.gif’. This kind of
1 transformation works reliably for arbitrary combinations of
1 directories.
1
1 • The links to files that have not been downloaded by Wget will
1 be changed to include host name and absolute path of the
1 location they point to.
1
1 Example: if the downloaded file ‘/foo/doc.html’ links to
1 ‘/bar/img.gif’ (or to ‘../bar/img.gif’), then the link in
1 ‘doc.html’ will be modified to point to
1 ‘http://HOSTNAME/bar/img.gif’.
1
1 Because of this, local browsing works reliably: if a linked file
1 was downloaded, the link will refer to its local name; if it was
1 not downloaded, the link will refer to its full Internet address
1 rather than presenting a broken link. The fact that the former
1 links are converted to relative links ensures that you can move the
1 downloaded hierarchy to another directory.
1
1 Note that only at the end of the download can Wget know which links
1 have been downloaded. Because of that, the work done by ‘-k’ will
1 be performed at the end of all the downloads.
1
1 ‘--convert-file-only’
1 This option converts only the filename part of the URLs, leaving
1 the rest of the URLs untouched. This filename part is sometimes
1 referred to as the "basename", although we avoid that term here in
1 order not to cause confusion.
1
1 It works particularly well in conjunction with
1 ‘--adjust-extension’, although this coupling is not enforced. It
1 proves useful to populate Internet caches with files downloaded
1 from different hosts.
1
1 Example: if some link points to ‘//foo.com/bar.cgi?xyz’ with
1 ‘--adjust-extension’ asserted and its local destination is intended
1 to be ‘./foo.com/bar.cgi?xyz.css’, then the link would be converted
1 to ‘//foo.com/bar.cgi?xyz.css’. Note that only the filename part
1 has been modified. The rest of the URL has been left untouched,
1 including the net path (‘//’) which would otherwise be processed by
1 Wget and converted to the effective scheme (ie. ‘http://’).
1
1 ‘-K’
1 ‘--backup-converted’
1 When converting a file, back up the original version with a ‘.orig’
11 suffix. Affects the behavior of ‘-N’ (⇒HTTP Time-Stamping
Internals).
1
1 ‘-m’
1 ‘--mirror’
1 Turn on options suitable for mirroring. This option turns on
1 recursion and time-stamping, sets infinite recursion depth and
1 keeps FTP directory listings. It is currently equivalent to ‘-r -N
1 -l inf --no-remove-listing’.
1
1 ‘-p’
1 ‘--page-requisites’
1 This option causes Wget to download all the files that are
1 necessary to properly display a given HTML page. This includes
1 such things as inlined images, sounds, and referenced stylesheets.
1
1 Ordinarily, when downloading a single HTML page, any requisite
1 documents that may be needed to display it properly are not
1 downloaded. Using ‘-r’ together with ‘-l’ can help, but since Wget
1 does not ordinarily distinguish between external and inlined
1 documents, one is generally left with “leaf documents” that are
1 missing their requisites.
1
1 For instance, say document ‘1.html’ contains an ‘<IMG>’ tag
1 referencing ‘1.gif’ and an ‘<A>’ tag pointing to external document
1 ‘2.html’. Say that ‘2.html’ is similar but that its image is
1 ‘2.gif’ and it links to ‘3.html’. Say this continues up to some
1 arbitrarily high number.
1
1 If one executes the command:
1
1 wget -r -l 2 http://SITE/1.html
1
1 then ‘1.html’, ‘1.gif’, ‘2.html’, ‘2.gif’, and ‘3.html’ will be
1 downloaded. As you can see, ‘3.html’ is without its requisite
1 ‘3.gif’ because Wget is simply counting the number of hops (up to
1 2) away from ‘1.html’ in order to determine where to stop the
1 recursion. However, with this command:
1
1 wget -r -l 2 -p http://SITE/1.html
1
1 all the above files _and_ ‘3.html’’s requisite ‘3.gif’ will be
1 downloaded. Similarly,
1
1 wget -r -l 1 -p http://SITE/1.html
1
1 will cause ‘1.html’, ‘1.gif’, ‘2.html’, and ‘2.gif’ to be
1 downloaded. One might think that:
1
1 wget -r -l 0 -p http://SITE/1.html
1
1 would download just ‘1.html’ and ‘1.gif’, but unfortunately this is
1 not the case, because ‘-l 0’ is equivalent to ‘-l inf’—that is,
1 infinite recursion. To download a single HTML page (or a handful
1 of them, all specified on the command-line or in a ‘-i’ URL input
1 file) and its (or their) requisites, simply leave off ‘-r’ and
1 ‘-l’:
1
1 wget -p http://SITE/1.html
1
1 Note that Wget will behave as if ‘-r’ had been specified, but only
1 that single page and its requisites will be downloaded. Links from
1 that page to external documents will not be followed. Actually, to
1 download a single page and all its requisites (even if they exist
1 on separate websites), and make sure the lot displays properly
1 locally, this author likes to use a few options in addition to
1 ‘-p’:
1
1 wget -E -H -k -K -p http://SITE/DOCUMENT
1
1 To finish off this topic, it’s worth knowing that Wget’s idea of an
1 external document link is any URL specified in an ‘<A>’ tag, an
1 ‘<AREA>’ tag, or a ‘<LINK>’ tag other than ‘<LINK
1 REL="stylesheet">’.
1
1 ‘--strict-comments’
1 Turn on strict parsing of HTML comments. The default is to
1 terminate comments at the first occurrence of ‘-->’.
1
1 According to specifications, HTML comments are expressed as SGML
1 “declarations”. Declaration is special markup that begins with
1 ‘<!’ and ends with ‘>’, such as ‘<!DOCTYPE ...>’, that may contain
1 comments between a pair of ‘--’ delimiters. HTML comments are
1 “empty declarations”, SGML declarations without any non-comment
1 text. Therefore, ‘<!--foo-->’ is a valid comment, and so is
1 ‘<!--one-- --two-->’, but ‘<!--1--2-->’ is not.
1
1 On the other hand, most HTML writers don’t perceive comments as
1 anything other than text delimited with ‘<!--’ and ‘-->’, which is
1 not quite the same. For example, something like ‘<!------------>’
1 works as a valid comment as long as the number of dashes is a
1 multiple of four (!). If not, the comment technically lasts until
1 the next ‘--’, which may be at the other end of the document.
1 Because of this, many popular browsers completely ignore the
1 specification and implement what users have come to expect:
1 comments delimited with ‘<!--’ and ‘-->’.
1
1 Until version 1.9, Wget interpreted comments strictly, which
1 resulted in missing links in many web pages that displayed fine in
1 browsers, but had the misfortune of containing non-compliant
1 comments. Beginning with version 1.9, Wget has joined the ranks of
1 clients that implements “naive” comments, terminating each comment
1 at the first occurrence of ‘-->’.
1
1 If, for whatever reason, you want strict comment parsing, use this
1 option to turn it on.
1