wget: Types of Files

1 
1 4.2 Types of Files
1 ==================
1 
1 When downloading material from the web, you will often want to restrict
1 the retrieval to only certain file types.  For example, if you are
1 interested in downloading GIFs, you will not be overjoyed to get loads
1 of PostScript documents, and vice versa.
1 
1    Wget offers two options to deal with this problem.  Each option
1 description lists a short name, a long name, and the equivalent command
1 in ‘.wgetrc’.
1 
1 ‘-A ACCLIST’
1 ‘--accept ACCLIST’
1 ‘accept = ACCLIST’
1 ‘--accept-regex URLREGEX’
1 ‘accept-regex = URLREGEX’
1      The argument to ‘--accept’ option is a list of file suffixes or
1      patterns that Wget will download during recursive retrieval.  A
1      suffix is the ending part of a file, and consists of “normal”
1      letters, e.g.  ‘gif’ or ‘.jpg’.  A matching pattern contains
1      shell-like wildcards, e.g.  ‘books*’ or ‘zelazny*196[0-9]*’.
1 
1      So, specifying ‘wget -A gif,jpg’ will make Wget download only the
1      files ending with ‘gif’ or ‘jpg’, i.e.  GIFs and JPEGs.  On the
1      other hand, ‘wget -A "zelazny*196[0-9]*"’ will download only files
1      beginning with ‘zelazny’ and containing numbers from 1960 to 1969
1      anywhere within.  Look up the manual of your shell for a
1      description of how pattern matching works.
1 
1      Of course, any number of suffixes and patterns can be combined into
1      a comma-separated list, and given as an argument to ‘-A’.
1 
1      The argument to ‘--accept-regex’ option is a regular expression
1      which is matched against the complete URL.
1 
1 ‘-R REJLIST’
1 ‘--reject REJLIST’
1 ‘reject = REJLIST’
1 ‘--reject-regex URLREGEX’
1 ‘reject-regex = URLREGEX’
1      The ‘--reject’ option works the same way as ‘--accept’, only its
1      logic is the reverse; Wget will download all files _except_ the
1      ones matching the suffixes (or patterns) in the list.
1 
1      So, if you want to download a whole page except for the cumbersome
1      MPEGs and .AU files, you can use ‘wget -R mpg,mpeg,au’.
1      Analogously, to download all files except the ones beginning with
1      ‘bjork’, use ‘wget -R "bjork*"’.  The quotes are to prevent
1      expansion by the shell.
1 
1    The argument to ‘--accept-regex’ option is a regular expression which
1 is matched against the complete URL.
1 
1 The ‘-A’ and ‘-R’ options may be combined to achieve even better
1 fine-tuning of which files to retrieve.  E.g.  ‘wget -A "*zelazny*" -R
1 .ps’ will download all the files having ‘zelazny’ as a part of their
1 name, but _not_ the PostScript files.
1 
1    Note that these two options do not affect the downloading of HTML
1 files (as determined by a ‘.htm’ or ‘.html’ filename prefix).  This
1 behavior may not be desirable for all users, and may be changed for
1 future versions of Wget.
1 
1    Note, too, that query strings (strings at the end of a URL beginning
1 with a question mark (‘?’) are not included as part of the filename for
1 accept/reject rules, even though these will actually contribute to the
1 name chosen for the local file.  It is expected that a future version of
1 Wget will provide an option to allow matching against query strings.
1 
1    Finally, it’s worth noting that the accept/reject lists are matched
1 _twice_ against downloaded files: once against the URL’s filename
1 portion, to determine if the file should be downloaded in the first
1 place; then, after it has been accepted and successfully downloaded, the
1 local file’s name is also checked against the accept/reject lists to see
1 if it should be removed.  The rationale was that, since ‘.htm’ and
1 ‘.html’ files are always downloaded regardless of accept/reject rules,
1 they should be removed _after_ being downloaded and scanned for links,
1 if they did match the accept/reject lists.  However, this can lead to
1 unexpected results, since the local filenames can differ from the
1 original URL filenames in the following ways, all of which can change
1 whether an accept/reject rule matches:
1 
1    • If the local file already exists and ‘--no-directories’ was
1      specified, a numeric suffix will be appended to the original name.
1    • If ‘--adjust-extension’ was specified, the local filename might
1      have ‘.html’ appended to it.  If Wget is invoked with ‘-E -A.php’,
1      a filename such as ‘index.php’ will match be accepted, but upon
1      download will be named ‘index.php.html’, which no longer matches,
1      and so the file will be deleted.
1    • Query strings do not contribute to URL matching, but are included
1      in local filenames, and so _do_ contribute to filename matching.
1 
1 This behavior, too, is considered less-than-desirable, and may change in
1 a future version of Wget.
1