wget: Spanning Hosts

1 
1 4.1 Spanning Hosts
1 ==================
1 
1 Wget’s recursive retrieval normally refuses to visit hosts different
1 than the one you specified on the command line.  This is a reasonable
1 default; without it, every retrieval would have the potential to turn
1 your Wget into a small version of google.
1 
1    However, visiting different hosts, or “host spanning,” is sometimes a
1 useful option.  Maybe the images are served from a different server.
1 Maybe you’re mirroring a site that consists of pages interlinked between
1 three servers.  Maybe the server has two equivalent names, and the HTML
1 pages refer to both interchangeably.
1 
1 Span to any host—‘-H’
1 
1      The ‘-H’ option turns on host spanning, thus allowing Wget’s
1      recursive run to visit any host referenced by a link.  Unless
1      sufficient recursion-limiting criteria are applied depth, these
1      foreign hosts will typically link to yet more hosts, and so on
1      until Wget ends up sucking up much more data than you have
1      intended.
1 
1 Limit spanning to certain domains—‘-D’
1 
1      The ‘-D’ option allows you to specify the domains that will be
1      followed, thus limiting the recursion only to the hosts that belong
1      to these domains.  Obviously, this makes sense only in conjunction
1      with ‘-H’.  A typical example would be downloading the contents of
1      ‘www.example.com’, but allowing downloads from
1      ‘images.example.com’, etc.:
1 
1           wget -rH -Dexample.com http://www.example.com/
1 
1      You can specify more than one address by separating them with a
1      comma, e.g.  ‘-Ddomain1.com,domain2.com’.
1 
1 Keep download off certain domains—‘--exclude-domains’
1 
1      If there are domains you want to exclude specifically, you can do
1      it with ‘--exclude-domains’, which accepts the same type of
1      arguments of ‘-D’, but will _exclude_ all the listed domains.  For
1      example, if you want to download all the hosts from ‘foo.edu’
1      domain, with the exception of ‘sunsite.foo.edu’, you can do it like
1      this:
1 
1           wget -rH -Dfoo.edu --exclude-domains sunsite.foo.edu \
1               http://www.foo.edu/
1