wget: Types of Files
1
1 4.2 Types of Files
1 ==================
1
1 When downloading material from the web, you will often want to restrict
1 the retrieval to only certain file types. For example, if you are
1 interested in downloading GIFs, you will not be overjoyed to get loads
1 of PostScript documents, and vice versa.
1
1 Wget offers two options to deal with this problem. Each option
1 description lists a short name, a long name, and the equivalent command
1 in ‘.wgetrc’.
1
1 ‘-A ACCLIST’
1 ‘--accept ACCLIST’
1 ‘accept = ACCLIST’
1 ‘--accept-regex URLREGEX’
1 ‘accept-regex = URLREGEX’
1 The argument to ‘--accept’ option is a list of file suffixes or
1 patterns that Wget will download during recursive retrieval. A
1 suffix is the ending part of a file, and consists of “normal”
1 letters, e.g. ‘gif’ or ‘.jpg’. A matching pattern contains
1 shell-like wildcards, e.g. ‘books*’ or ‘zelazny*196[0-9]*’.
1
1 So, specifying ‘wget -A gif,jpg’ will make Wget download only the
1 files ending with ‘gif’ or ‘jpg’, i.e. GIFs and JPEGs. On the
1 other hand, ‘wget -A "zelazny*196[0-9]*"’ will download only files
1 beginning with ‘zelazny’ and containing numbers from 1960 to 1969
1 anywhere within. Look up the manual of your shell for a
1 description of how pattern matching works.
1
1 Of course, any number of suffixes and patterns can be combined into
1 a comma-separated list, and given as an argument to ‘-A’.
1
1 The argument to ‘--accept-regex’ option is a regular expression
1 which is matched against the complete URL.
1
1 ‘-R REJLIST’
1 ‘--reject REJLIST’
1 ‘reject = REJLIST’
1 ‘--reject-regex URLREGEX’
1 ‘reject-regex = URLREGEX’
1 The ‘--reject’ option works the same way as ‘--accept’, only its
1 logic is the reverse; Wget will download all files _except_ the
1 ones matching the suffixes (or patterns) in the list.
1
1 So, if you want to download a whole page except for the cumbersome
1 MPEGs and .AU files, you can use ‘wget -R mpg,mpeg,au’.
1 Analogously, to download all files except the ones beginning with
1 ‘bjork’, use ‘wget -R "bjork*"’. The quotes are to prevent
1 expansion by the shell.
1
1 The argument to ‘--accept-regex’ option is a regular expression which
1 is matched against the complete URL.
1
1 The ‘-A’ and ‘-R’ options may be combined to achieve even better
1 fine-tuning of which files to retrieve. E.g. ‘wget -A "*zelazny*" -R
1 .ps’ will download all the files having ‘zelazny’ as a part of their
1 name, but _not_ the PostScript files.
1
1 Note that these two options do not affect the downloading of HTML
1 files (as determined by a ‘.htm’ or ‘.html’ filename prefix). This
1 behavior may not be desirable for all users, and may be changed for
1 future versions of Wget.
1
1 Note, too, that query strings (strings at the end of a URL beginning
1 with a question mark (‘?’) are not included as part of the filename for
1 accept/reject rules, even though these will actually contribute to the
1 name chosen for the local file. It is expected that a future version of
1 Wget will provide an option to allow matching against query strings.
1
1 Finally, it’s worth noting that the accept/reject lists are matched
1 _twice_ against downloaded files: once against the URL’s filename
1 portion, to determine if the file should be downloaded in the first
1 place; then, after it has been accepted and successfully downloaded, the
1 local file’s name is also checked against the accept/reject lists to see
1 if it should be removed. The rationale was that, since ‘.htm’ and
1 ‘.html’ files are always downloaded regardless of accept/reject rules,
1 they should be removed _after_ being downloaded and scanned for links,
1 if they did match the accept/reject lists. However, this can lead to
1 unexpected results, since the local filenames can differ from the
1 original URL filenames in the following ways, all of which can change
1 whether an accept/reject rule matches:
1
1 • If the local file already exists and ‘--no-directories’ was
1 specified, a numeric suffix will be appended to the original name.
1 • If ‘--adjust-extension’ was specified, the local filename might
1 have ‘.html’ appended to it. If Wget is invoked with ‘-E -A.php’,
1 a filename such as ‘index.php’ will match be accepted, but upon
1 download will be named ‘index.php.html’, which no longer matches,
1 and so the file will be deleted.
1 • Query strings do not contribute to URL matching, but are included
1 in local filenames, and so _do_ contribute to filename matching.
1
1 This behavior, too, is considered less-than-desirable, and may change in
1 a future version of Wget.
1