wget: Recursive Accept/Reject Options
1
1 2.12 Recursive Accept/Reject Options
1 ====================================
1
1 ‘-A ACCLIST --accept ACCLIST’
1 ‘-R REJLIST --reject REJLIST’
1 Specify comma-separated lists of file name suffixes or patterns to
1 accept or reject (⇒Types of Files). Note that if any of the
1 wildcard characters, ‘*’, ‘?’, ‘[’ or ‘]’, appear in an element of
1 ACCLIST or REJLIST, it will be treated as a pattern, rather than a
1 suffix. In this case, you have to enclose the pattern into quotes
1 to prevent your shell from expanding it, like in ‘-A "*.mp3"’ or
1 ‘-A '*.mp3'’.
1
1 ‘--accept-regex URLREGEX’
1 ‘--reject-regex URLREGEX’
1 Specify a regular expression to accept or reject the complete URL.
1
1 ‘--regex-type REGEXTYPE’
1 Specify the regular expression type. Possible types are ‘posix’ or
1 ‘pcre’. Note that to be able to use ‘pcre’ type, wget has to be
1 compiled with libpcre support.
1
1 ‘-D DOMAIN-LIST’
1 ‘--domains=DOMAIN-LIST’
1 Set domains to be followed. DOMAIN-LIST is a comma-separated list
1 of domains. Note that it does _not_ turn on ‘-H’.
1
1 ‘--exclude-domains DOMAIN-LIST’
11 Specify the domains that are _not_ to be followed (⇒Spanning
Hosts).
1
1 ‘--follow-ftp’
1 Follow FTP links from HTML documents. Without this option, Wget
1 will ignore all the FTP links.
1
1 ‘--follow-tags=LIST’
1 Wget has an internal table of HTML tag / attribute pairs that it
1 considers when looking for linked documents during a recursive
1 retrieval. If a user wants only a subset of those tags to be
1 considered, however, he or she should be specify such tags in a
1 comma-separated LIST with this option.
1
1 ‘--ignore-tags=LIST’
1 This is the opposite of the ‘--follow-tags’ option. To skip
1 certain HTML tags when recursively looking for documents to
1 download, specify them in a comma-separated LIST.
1
1 In the past, this option was the best bet for downloading a single
1 page and its requisites, using a command-line like:
1
1 wget --ignore-tags=a,area -H -k -K -r http://SITE/DOCUMENT
1
1 However, the author of this option came across a page with tags
1 like ‘<LINK REL="home" HREF="/">’ and came to the realization that
1 specifying tags to ignore was not enough. One can’t just tell Wget
1 to ignore ‘<LINK>’, because then stylesheets will not be
1 downloaded. Now the best bet for downloading a single page and its
1 requisites is the dedicated ‘--page-requisites’ option.
1
1 ‘--ignore-case’
1 Ignore case when matching files and directories. This influences
1 the behavior of -R, -A, -I, and -X options, as well as globbing
1 implemented when downloading from FTP sites. For example, with
1 this option, ‘-A "*.txt"’ will match ‘file1.txt’, but also
1 ‘file2.TXT’, ‘file3.TxT’, and so on. The quotes in the example are
1 to prevent the shell from expanding the pattern.
1
1 ‘-H’
1 ‘--span-hosts’
11 Enable spanning across hosts when doing recursive retrieving (⇒
Spanning Hosts).
1
1 ‘-L’
1 ‘--relative’
1 Follow relative links only. Useful for retrieving a specific home
1 page without any distractions, not even those from the same hosts
1 (⇒Relative Links).
1
1 ‘-I LIST’
1 ‘--include-directories=LIST’
1 Specify a comma-separated list of directories you wish to follow
1 when downloading (⇒Directory-Based Limits). Elements of
1 LIST may contain wildcards.
1
1 ‘-X LIST’
1 ‘--exclude-directories=LIST’
1 Specify a comma-separated list of directories you wish to exclude
1 from download (⇒Directory-Based Limits). Elements of LIST
1 may contain wildcards.
1
1 ‘-np’
1 ‘--no-parent’
1 Do not ever ascend to the parent directory when retrieving
1 recursively. This is a useful option, since it guarantees that
1 only the files _below_ a certain hierarchy will be downloaded.
1 ⇒Directory-Based Limits, for more details.
1