wget: Recursive Retrieval Options

1 
1 2.11 Recursive Retrieval Options
1 ================================
1 
1 ‘-r’
1 ‘--recursive’
1      Turn on recursive retrieving.  ⇒Recursive Download, for more
1      details.  The default maximum depth is 5.
1 
1 ‘-l DEPTH’
1 ‘--level=DEPTH’
11      Specify recursion maximum depth level DEPTH (⇒Recursive
      Download).
1 
1 ‘--delete-after’
1      This option tells Wget to delete every single file it downloads,
1      _after_ having done so.  It is useful for pre-fetching popular
1      pages through a proxy, e.g.:
1 
1           wget -r -nd --delete-after http://whatever.com/~popular/page/
1 
1      The ‘-r’ option is to retrieve recursively, and ‘-nd’ to not create
1      directories.
1 
1      Note that ‘--delete-after’ deletes files on the local machine.  It
1      does not issue the ‘DELE’ command to remote FTP sites, for
1      instance.  Also note that when ‘--delete-after’ is specified,
1      ‘--convert-links’ is ignored, so ‘.orig’ files are simply not
1      created in the first place.
1 
1 ‘-k’
1 ‘--convert-links’
1      After the download is complete, convert the links in the document
1      to make them suitable for local viewing.  This affects not only the
1      visible hyperlinks, but any part of the document that links to
1      external content, such as embedded images, links to style sheets,
1      hyperlinks to non-HTML content, etc.
1 
1      Each link will be changed in one of the two ways:
1 
1         • The links to files that have been downloaded by Wget will be
1           changed to refer to the file they point to as a relative link.
1 
1           Example: if the downloaded file ‘/foo/doc.html’ links to
1           ‘/bar/img.gif’, also downloaded, then the link in ‘doc.html’
1           will be modified to point to ‘../bar/img.gif’.  This kind of
1           transformation works reliably for arbitrary combinations of
1           directories.
1 
1         • The links to files that have not been downloaded by Wget will
1           be changed to include host name and absolute path of the
1           location they point to.
1 
1           Example: if the downloaded file ‘/foo/doc.html’ links to
1           ‘/bar/img.gif’ (or to ‘../bar/img.gif’), then the link in
1           ‘doc.html’ will be modified to point to
1           ‘http://HOSTNAME/bar/img.gif’.
1 
1      Because of this, local browsing works reliably: if a linked file
1      was downloaded, the link will refer to its local name; if it was
1      not downloaded, the link will refer to its full Internet address
1      rather than presenting a broken link.  The fact that the former
1      links are converted to relative links ensures that you can move the
1      downloaded hierarchy to another directory.
1 
1      Note that only at the end of the download can Wget know which links
1      have been downloaded.  Because of that, the work done by ‘-k’ will
1      be performed at the end of all the downloads.
1 
1 ‘--convert-file-only’
1      This option converts only the filename part of the URLs, leaving
1      the rest of the URLs untouched.  This filename part is sometimes
1      referred to as the "basename", although we avoid that term here in
1      order not to cause confusion.
1 
1      It works particularly well in conjunction with
1      ‘--adjust-extension’, although this coupling is not enforced.  It
1      proves useful to populate Internet caches with files downloaded
1      from different hosts.
1 
1      Example: if some link points to ‘//foo.com/bar.cgi?xyz’ with
1      ‘--adjust-extension’ asserted and its local destination is intended
1      to be ‘./foo.com/bar.cgi?xyz.css’, then the link would be converted
1      to ‘//foo.com/bar.cgi?xyz.css’.  Note that only the filename part
1      has been modified.  The rest of the URL has been left untouched,
1      including the net path (‘//’) which would otherwise be processed by
1      Wget and converted to the effective scheme (ie.  ‘http://’).
1 
1 ‘-K’
1 ‘--backup-converted’
1      When converting a file, back up the original version with a ‘.orig’
11      suffix.  Affects the behavior of ‘-N’ (⇒HTTP Time-Stamping
      Internals).
1 
1 ‘-m’
1 ‘--mirror’
1      Turn on options suitable for mirroring.  This option turns on
1      recursion and time-stamping, sets infinite recursion depth and
1      keeps FTP directory listings.  It is currently equivalent to ‘-r -N
1      -l inf --no-remove-listing’.
1 
1 ‘-p’
1 ‘--page-requisites’
1      This option causes Wget to download all the files that are
1      necessary to properly display a given HTML page.  This includes
1      such things as inlined images, sounds, and referenced stylesheets.
1 
1      Ordinarily, when downloading a single HTML page, any requisite
1      documents that may be needed to display it properly are not
1      downloaded.  Using ‘-r’ together with ‘-l’ can help, but since Wget
1      does not ordinarily distinguish between external and inlined
1      documents, one is generally left with “leaf documents” that are
1      missing their requisites.
1 
1      For instance, say document ‘1.html’ contains an ‘<IMG>’ tag
1      referencing ‘1.gif’ and an ‘<A>’ tag pointing to external document
1      ‘2.html’.  Say that ‘2.html’ is similar but that its image is
1      ‘2.gif’ and it links to ‘3.html’.  Say this continues up to some
1      arbitrarily high number.
1 
1      If one executes the command:
1 
1           wget -r -l 2 http://SITE/1.html
1 
1      then ‘1.html’, ‘1.gif’, ‘2.html’, ‘2.gif’, and ‘3.html’ will be
1      downloaded.  As you can see, ‘3.html’ is without its requisite
1      ‘3.gif’ because Wget is simply counting the number of hops (up to
1      2) away from ‘1.html’ in order to determine where to stop the
1      recursion.  However, with this command:
1 
1           wget -r -l 2 -p http://SITE/1.html
1 
1      all the above files _and_ ‘3.html’’s requisite ‘3.gif’ will be
1      downloaded.  Similarly,
1 
1           wget -r -l 1 -p http://SITE/1.html
1 
1      will cause ‘1.html’, ‘1.gif’, ‘2.html’, and ‘2.gif’ to be
1      downloaded.  One might think that:
1 
1           wget -r -l 0 -p http://SITE/1.html
1 
1      would download just ‘1.html’ and ‘1.gif’, but unfortunately this is
1      not the case, because ‘-l 0’ is equivalent to ‘-l inf’—that is,
1      infinite recursion.  To download a single HTML page (or a handful
1      of them, all specified on the command-line or in a ‘-i’ URL input
1      file) and its (or their) requisites, simply leave off ‘-r’ and
1      ‘-l’:
1 
1           wget -p http://SITE/1.html
1 
1      Note that Wget will behave as if ‘-r’ had been specified, but only
1      that single page and its requisites will be downloaded.  Links from
1      that page to external documents will not be followed.  Actually, to
1      download a single page and all its requisites (even if they exist
1      on separate websites), and make sure the lot displays properly
1      locally, this author likes to use a few options in addition to
1      ‘-p’:
1 
1           wget -E -H -k -K -p http://SITE/DOCUMENT
1 
1      To finish off this topic, it’s worth knowing that Wget’s idea of an
1      external document link is any URL specified in an ‘<A>’ tag, an
1      ‘<AREA>’ tag, or a ‘<LINK>’ tag other than ‘<LINK
1      REL="stylesheet">’.
1 
1 ‘--strict-comments’
1      Turn on strict parsing of HTML comments.  The default is to
1      terminate comments at the first occurrence of ‘-->’.
1 
1      According to specifications, HTML comments are expressed as SGML
1      “declarations”.  Declaration is special markup that begins with
1      ‘<!’ and ends with ‘>’, such as ‘<!DOCTYPE ...>’, that may contain
1      comments between a pair of ‘--’ delimiters.  HTML comments are
1      “empty declarations”, SGML declarations without any non-comment
1      text.  Therefore, ‘<!--foo-->’ is a valid comment, and so is
1      ‘<!--one-- --two-->’, but ‘<!--1--2-->’ is not.
1 
1      On the other hand, most HTML writers don’t perceive comments as
1      anything other than text delimited with ‘<!--’ and ‘-->’, which is
1      not quite the same.  For example, something like ‘<!------------>’
1      works as a valid comment as long as the number of dashes is a
1      multiple of four (!).  If not, the comment technically lasts until
1      the next ‘--’, which may be at the other end of the document.
1      Because of this, many popular browsers completely ignore the
1      specification and implement what users have come to expect:
1      comments delimited with ‘<!--’ and ‘-->’.
1 
1      Until version 1.9, Wget interpreted comments strictly, which
1      resulted in missing links in many web pages that displayed fine in
1      browsers, but had the misfortune of containing non-compliant
1      comments.  Beginning with version 1.9, Wget has joined the ranks of
1      clients that implements “naive” comments, terminating each comment
1      at the first occurrence of ‘-->’.
1 
1      If, for whatever reason, you want strict comment parsing, use this
1      option to turn it on.
1