wget: Advanced Usage

1 
1 7.2 Advanced Usage
1 ==================
1 
1    • You have a file that contains the URLs you want to download?  Use
1      the ‘-i’ switch:
1 
1           wget -i FILE
1 
1      If you specify ‘-’ as file name, the URLs will be read from
1      standard input.
1 
1    • Create a five levels deep mirror image of the GNU web site, with
1      the same directory structure the original has, with only one try
1      per document, saving the log of the activities to ‘gnulog’:
1 
1           wget -r https://www.gnu.org/ -o gnulog
1 
1    • The same as the above, but convert the links in the downloaded
1      files to point to local files, so you can view the documents
1      off-line:
1 
1           wget --convert-links -r https://www.gnu.org/ -o gnulog
1 
1    • Retrieve only one HTML page, but make sure that all the elements
1      needed for the page to be displayed, such as inline images and
1      external style sheets, are also downloaded.  Also make sure the
1      downloaded page references the downloaded links.
1 
1           wget -p --convert-links http://www.example.com/dir/page.html
1 
1      The HTML page will be saved to ‘www.example.com/dir/page.html’, and
1      the images, stylesheets, etc., somewhere under ‘www.example.com/’,
1      depending on where they were on the remote server.
1 
1    • The same as the above, but without the ‘www.example.com/’
1      directory.  In fact, I don’t want to have all those random server
1      directories anyway—just save _all_ those files under a ‘download/’
1      subdirectory of the current directory.
1 
1           wget -p --convert-links -nH -nd -Pdownload \
1                http://www.example.com/dir/page.html
1 
1    • Retrieve the index.html of ‘www.lycos.com’, showing the original
1      server headers:
1 
1           wget -S http://www.lycos.com/
1 
1    • Save the server headers with the file, perhaps for post-processing.
1 
1           wget --save-headers http://www.lycos.com/
1           more index.html
1 
1    • Retrieve the first two levels of ‘wuarchive.wustl.edu’, saving them
1      to ‘/tmp’.
1 
1           wget -r -l2 -P/tmp ftp://wuarchive.wustl.edu/
1 
1    • You want to download all the GIFs from a directory on an HTTP
1      server.  You tried ‘wget http://www.example.com/dir/*.gif’, but
1      that didn’t work because HTTP retrieval does not support globbing.
1      In that case, use:
1 
1           wget -r -l1 --no-parent -A.gif http://www.example.com/dir/
1 
1      More verbose, but the effect is the same.  ‘-r -l1’ means to
1      retrieve recursively (⇒Recursive Download), with maximum
1      depth of 1.  ‘--no-parent’ means that references to the parent
1      directory are ignored (⇒Directory-Based Limits), and
1      ‘-A.gif’ means to download only the GIF files.  ‘-A "*.gif"’ would
1      have worked too.
1 
1    • Suppose you were in the middle of downloading, when Wget was
1      interrupted.  Now you do not want to clobber the files already
1      present.  It would be:
1 
1           wget -nc -r https://www.gnu.org/
1 
1    • If you want to encode your own username and password to HTTP or
1      FTP, use the appropriate URL syntax (⇒URL Format).
1 
1           wget ftp://hniksic:mypassword@unix.example.com/.emacs
1 
1      Note, however, that this usage is not advisable on multi-user
1      systems because it reveals your password to anyone who looks at the
1      output of ‘ps’.
1 
1    • You would like the output documents to go to standard output
1      instead of to files?
1 
1           wget -O - http://jagor.srce.hr/ http://www.srce.hr/
1 
1      You can also combine the two options and make pipelines to retrieve
1      the documents from remote hotlists:
1 
1           wget -O - http://cool.list.com/ | wget --force-html -i -
1