wget: HTTP Options

1 
1 2.7 HTTP Options
1 ================
1 
1 ‘--default-page=NAME’
1      Use NAME as the default file name when it isn’t known (i.e., for
1      URLs that end in a slash), instead of ‘index.html’.
1 
1 ‘-E’
1 ‘--adjust-extension’
1      If a file of type ‘application/xhtml+xml’ or ‘text/html’ is
1      downloaded and the URL does not end with the regexp
1      ‘\.[Hh][Tt][Mm][Ll]?’, this option will cause the suffix ‘.html’ to
1      be appended to the local filename.  This is useful, for instance,
1      when you’re mirroring a remote site that uses ‘.asp’ pages, but you
1      want the mirrored pages to be viewable on your stock Apache server.
1      Another good use for this is when you’re downloading CGI-generated
1      materials.  A URL like ‘http://site.com/article.cgi?25’ will be
1      saved as ‘article.cgi?25.html’.
1 
1      Note that filenames changed in this way will be re-downloaded every
1      time you re-mirror a site, because Wget can’t tell that the local
1      ‘X.html’ file corresponds to remote URL ‘X’ (since it doesn’t yet
1      know that the URL produces output of type ‘text/html’ or
1      ‘application/xhtml+xml’.
1 
1      As of version 1.12, Wget will also ensure that any downloaded files
1      of type ‘text/css’ end in the suffix ‘.css’, and the option was
1      renamed from ‘--html-extension’, to better reflect its new
1      behavior.  The old option name is still acceptable, but should now
1      be considered deprecated.
1 
1      As of version 1.19.2, Wget will also ensure that any downloaded
1      files with a ‘Content-Encoding’ of ‘br’, ‘compress’, ‘deflate’ or
1      ‘gzip’ end in the suffix ‘.br’, ‘.Z’, ‘.zlib’ and ‘.gz’
1      respectively.
1 
1      At some point in the future, this option may well be expanded to
1      include suffixes for other types of content, including content
1      types that are not parsed by Wget.
1 
1 ‘--http-user=USER’
1 ‘--http-password=PASSWORD’
1      Specify the username USER and password PASSWORD on an HTTP server.
1      According to the type of the challenge, Wget will encode them using
1      either the ‘basic’ (insecure), the ‘digest’, or the Windows ‘NTLM’
1      authentication scheme.
1 
1      Another way to specify username and password is in the URL itself
1      (⇒URL Format).  Either method reveals your password to
1      anyone who bothers to run ‘ps’.  To prevent the passwords from
1      being seen, use the ‘--use-askpass’ or store them in ‘.wgetrc’ or
1      ‘.netrc’, and make sure to protect those files from other users
1      with ‘chmod’.  If the passwords are really important, do not leave
1      them lying in those files either—edit the files and delete them
1      after Wget has started the download.
1 
1 ‘--no-http-keep-alive’
1      Turn off the “keep-alive” feature for HTTP downloads.  Normally,
1      Wget asks the server to keep the connection open so that, when you
1      download more than one document from the same server, they get
1      transferred over the same TCP connection.  This saves time and at
1      the same time reduces the load on the server.
1 
1      This option is useful when, for some reason, persistent
1      (keep-alive) connections don’t work for you, for example due to a
1      server bug or due to the inability of server-side scripts to cope
1      with the connections.
1 
1 ‘--no-cache’
1      Disable server-side cache.  In this case, Wget will send the remote
1      server an appropriate directive (‘Pragma: no-cache’) to get the
1      file from the remote service, rather than returning the cached
1      version.  This is especially useful for retrieving and flushing
1      out-of-date documents on proxy servers.
1 
1      Caching is allowed by default.
1 
1 ‘--no-cookies’
1      Disable the use of cookies.  Cookies are a mechanism for
1      maintaining server-side state.  The server sends the client a
1      cookie using the ‘Set-Cookie’ header, and the client responds with
1      the same cookie upon further requests.  Since cookies allow the
1      server owners to keep track of visitors and for sites to exchange
1      this information, some consider them a breach of privacy.  The
1      default is to use cookies; however, _storing_ cookies is not on by
1      default.
1 
1 ‘--load-cookies FILE’
1      Load cookies from FILE before the first HTTP retrieval.  FILE is a
1      textual file in the format originally used by Netscape’s
1      ‘cookies.txt’ file.
1 
1      You will typically use this option when mirroring sites that
1      require that you be logged in to access some or all of their
1      content.  The login process typically works by the web server
1      issuing an HTTP cookie upon receiving and verifying your
1      credentials.  The cookie is then resent by the browser when
1      accessing that part of the site, and so proves your identity.
1 
1      Mirroring such a site requires Wget to send the same cookies your
1      browser sends when communicating with the site.  This is achieved
1      by ‘--load-cookies’—simply point Wget to the location of the
1      ‘cookies.txt’ file, and it will send the same cookies your browser
1      would send in the same situation.  Different browsers keep textual
1      cookie files in different locations:
1 
1      Netscape 4.x.
1           The cookies are in ‘~/.netscape/cookies.txt’.
1 
1      Mozilla and Netscape 6.x.
1           Mozilla’s cookie file is also named ‘cookies.txt’, located
1           somewhere under ‘~/.mozilla’, in the directory of your
1           profile.  The full path usually ends up looking somewhat like
1           ‘~/.mozilla/default/SOME-WEIRD-STRING/cookies.txt’.
1 
1      Internet Explorer.
1           You can produce a cookie file Wget can use by using the File
1           menu, Import and Export, Export Cookies.  This has been tested
1           with Internet Explorer 5; it is not guaranteed to work with
1           earlier versions.
1 
1      Other browsers.
1           If you are using a different browser to create your cookies,
1           ‘--load-cookies’ will only work if you can locate or produce a
1           cookie file in the Netscape format that Wget expects.
1 
1      If you cannot use ‘--load-cookies’, there might still be an
1      alternative.  If your browser supports a “cookie manager”, you can
1      use it to view the cookies used when accessing the site you’re
1      mirroring.  Write down the name and value of the cookie, and
1      manually instruct Wget to send those cookies, bypassing the
1      “official” cookie support:
1 
1           wget --no-cookies --header "Cookie: NAME=VALUE"
1 
1 ‘--save-cookies FILE’
1      Save cookies to FILE before exiting.  This will not save cookies
1      that have expired or that have no expiry time (so-called “session
1      cookies”), but also see ‘--keep-session-cookies’.
1 
1 ‘--keep-session-cookies’
1      When specified, causes ‘--save-cookies’ to also save session
1      cookies.  Session cookies are normally not saved because they are
1      meant to be kept in memory and forgotten when you exit the browser.
1      Saving them is useful on sites that require you to log in or to
1      visit the home page before you can access some pages.  With this
1      option, multiple Wget runs are considered a single browser session
1      as far as the site is concerned.
1 
1      Since the cookie file format does not normally carry session
1      cookies, Wget marks them with an expiry timestamp of 0.  Wget’s
1      ‘--load-cookies’ recognizes those as session cookies, but it might
1      confuse other browsers.  Also note that cookies so loaded will be
1      treated as other session cookies, which means that if you want
1      ‘--save-cookies’ to preserve them again, you must use
1      ‘--keep-session-cookies’ again.
1 
1 ‘--ignore-length’
1      Unfortunately, some HTTP servers (CGI programs, to be more precise)
1      send out bogus ‘Content-Length’ headers, which makes Wget go wild,
1      as it thinks not all the document was retrieved.  You can spot this
1      syndrome if Wget retries getting the same document again and again,
1      each time claiming that the (otherwise normal) connection has
1      closed on the very same byte.
1 
1      With this option, Wget will ignore the ‘Content-Length’ header—as
1      if it never existed.
1 
1 ‘--header=HEADER-LINE’
1      Send HEADER-LINE along with the rest of the headers in each HTTP
1      request.  The supplied header is sent as-is, which means it must
1      contain name and value separated by colon, and must not contain
1      newlines.
1 
1      You may define more than one additional header by specifying
1      ‘--header’ more than once.
1 
1           wget --header='Accept-Charset: iso-8859-2' \
1                --header='Accept-Language: hr'        \
1                  http://fly.srk.fer.hr/
1 
1      Specification of an empty string as the header value will clear all
1      previous user-defined headers.
1 
1      As of Wget 1.10, this option can be used to override headers
1      otherwise generated automatically.  This example instructs Wget to
1      connect to localhost, but to specify ‘foo.bar’ in the ‘Host’
1      header:
1 
1           wget --header="Host: foo.bar" http://localhost/
1 
1      In versions of Wget prior to 1.10 such use of ‘--header’ caused
1      sending of duplicate headers.
1 
1 ‘--compression=TYPE’
1      Choose the type of compression to be used.  Legal values are
1      ‘auto’, ‘gzip’ and ‘none’.
1 
1      If ‘auto’ or ‘gzip’ are specified, Wget asks the server to compress
1      the file using the gzip compression format.  If the server
1      compresses the file and responds with the ‘Content-Encoding’ header
1      field set appropriately, the file will be decompressed
1      automatically.
1 
1      If ‘none’ is specified, wget will not ask the server to compress
1      the file and will not decompress any server responses.  This is the
1      default.
1 
1      Compression support is currently experimental.  In case it is
1      turned on, please report any bugs to ‘bug-wget@gnu.org’.
1 
1 ‘--max-redirect=NUMBER’
1      Specifies the maximum number of redirections to follow for a
1      resource.  The default is 20, which is usually far more than
1      necessary.  However, on those occasions where you want to allow
1      more (or fewer), this is the option to use.
1 
1 ‘--proxy-user=USER’
1 ‘--proxy-password=PASSWORD’
1      Specify the username USER and password PASSWORD for authentication
1      on a proxy server.  Wget will encode them using the ‘basic’
1      authentication scheme.
1 
1      Security considerations similar to those with ‘--http-password’
1      pertain here as well.
1 
1 ‘--referer=URL’
1      Include ‘Referer: URL’ header in HTTP request.  Useful for
1      retrieving documents with server-side processing that assume they
1      are always being retrieved by interactive web browsers and only
1      come out properly when Referer is set to one of the pages that
1      point to them.
1 
1 ‘--save-headers’
1      Save the headers sent by the HTTP server to the file, preceding the
1      actual contents, with an empty line as the separator.
1 
1 ‘-U AGENT-STRING’
1 ‘--user-agent=AGENT-STRING’
1      Identify as AGENT-STRING to the HTTP server.
1 
1      The HTTP protocol allows the clients to identify themselves using a
1      ‘User-Agent’ header field.  This enables distinguishing the WWW
1      software, usually for statistical purposes or for tracing of
1      protocol violations.  Wget normally identifies as ‘Wget/VERSION’,
1      VERSION being the current version number of Wget.
1 
1      However, some sites have been known to impose the policy of
1      tailoring the output according to the ‘User-Agent’-supplied
1      information.  While this is not such a bad idea in theory, it has
1      been abused by servers denying information to clients other than
1      (historically) Netscape or, more frequently, Microsoft Internet
1      Explorer.  This option allows you to change the ‘User-Agent’ line
1      issued by Wget.  Use of this option is discouraged, unless you
1      really know what you are doing.
1 
1      Specifying empty user agent with ‘--user-agent=""’ instructs Wget
1      not to send the ‘User-Agent’ header in HTTP requests.
1 
1 ‘--post-data=STRING’
1 ‘--post-file=FILE’
1      Use POST as the method for all HTTP requests and send the specified
1      data in the request body.  ‘--post-data’ sends STRING as data,
1      whereas ‘--post-file’ sends the contents of FILE.  Other than that,
1      they work in exactly the same way.  In particular, they _both_
1      expect content of the form ‘key1=value1&key2=value2’, with
1      percent-encoding for special characters; the only difference is
1      that one expects its content as a command-line parameter and the
1      other accepts its content from a file.  In particular,
1      ‘--post-file’ is _not_ for transmitting files as form attachments:
1      those must appear as ‘key=value’ data (with appropriate
1      percent-coding) just like everything else.  Wget does not currently
1      support ‘multipart/form-data’ for transmitting POST data; only
1      ‘application/x-www-form-urlencoded’.  Only one of ‘--post-data’ and
1      ‘--post-file’ should be specified.
1 
1      Please note that wget does not require the content to be of the
1      form ‘key1=value1&key2=value2’, and neither does it test for it.
1      Wget will simply transmit whatever data is provided to it.  Most
1      servers however expect the POST data to be in the above format when
1      processing HTML Forms.
1 
1      When sending a POST request using the ‘--post-file’ option, Wget
1      treats the file as a binary file and will send every character in
1      the POST request without stripping trailing newline or formfeed
1      characters.  Any other control characters in the text will also be
1      sent as-is in the POST request.
1 
1      Please be aware that Wget needs to know the size of the POST data
1      in advance.  Therefore the argument to ‘--post-file’ must be a
1      regular file; specifying a FIFO or something like ‘/dev/stdin’
1      won’t work.  It’s not quite clear how to work around this
1      limitation inherent in HTTP/1.0.  Although HTTP/1.1 introduces
1      “chunked” transfer that doesn’t require knowing the request length
1      in advance, a client can’t use chunked unless it knows it’s talking
1      to an HTTP/1.1 server.  And it can’t know that until it receives a
1      response, which in turn requires the request to have been completed
1      – a chicken-and-egg problem.
1 
1      Note: As of version 1.15 if Wget is redirected after the POST
1      request is completed, its behaviour will depend on the response
1      code returned by the server.  In case of a 301 Moved Permanently,
1      302 Moved Temporarily or 307 Temporary Redirect, Wget will, in
1      accordance with RFC2616, continue to send a POST request.  In case
1      a server wants the client to change the Request method upon
1      redirection, it should send a 303 See Other response code.
1 
1      This example shows how to log in to a server using POST and then
1      proceed to download the desired pages, presumably only accessible
1      to authorized users:
1 
1           # Log in to the server.  This can be done only once.
1           wget --save-cookies cookies.txt \
1                --post-data 'user=foo&password=bar' \
1                http://example.com/auth.php
1 
1           # Now grab the page or pages we care about.
1           wget --load-cookies cookies.txt \
1                -p http://example.com/interesting/article.php
1 
1      If the server is using session cookies to track user
1      authentication, the above will not work because ‘--save-cookies’
1      will not save them (and neither will browsers) and the
1      ‘cookies.txt’ file will be empty.  In that case use
1      ‘--keep-session-cookies’ along with ‘--save-cookies’ to force
1      saving of session cookies.
1 
1 ‘--method=HTTP-METHOD’
1      For the purpose of RESTful scripting, Wget allows sending of other
1      HTTP Methods without the need to explicitly set them using
1      ‘--header=Header-Line’.  Wget will use whatever string is passed to
1      it after ‘--method’ as the HTTP Method to the server.
1 
1 ‘--body-data=DATA-STRING’
1 ‘--body-file=DATA-FILE’
1      Must be set when additional data needs to be sent to the server
1      along with the Method specified using ‘--method’.  ‘--body-data’
1      sends STRING as data, whereas ‘--body-file’ sends the contents of
1      FILE.  Other than that, they work in exactly the same way.
1 
1      Currently, ‘--body-file’ is _not_ for transmitting files as a
1      whole.  Wget does not currently support ‘multipart/form-data’ for
1      transmitting data; only ‘application/x-www-form-urlencoded’.  In
1      the future, this may be changed so that wget sends the
1      ‘--body-file’ as a complete file instead of sending its contents to
1      the server.  Please be aware that Wget needs to know the contents
1      of BODY Data in advance, and hence the argument to ‘--body-file’
1      should be a regular file.  See ‘--post-file’ for a more detailed
1      explanation.  Only one of ‘--body-data’ and ‘--body-file’ should be
1      specified.
1 
1      If Wget is redirected after the request is completed, Wget will
1      suspend the current method and send a GET request till the
1      redirection is completed.  This is true for all redirection
1      response codes except 307 Temporary Redirect which is used to
1      explicitly specify that the request method should _not_ change.
1      Another exception is when the method is set to ‘POST’, in which
1      case the redirection rules specified under ‘--post-data’ are
1      followed.
1 
1 ‘--content-disposition’
1 
1      If this is set to on, experimental (not fully-functional) support
1      for ‘Content-Disposition’ headers is enabled.  This can currently
1      result in extra round-trips to the server for a ‘HEAD’ request, and
1      is known to suffer from a few bugs, which is why it is not
1      currently enabled by default.
1 
1      This option is useful for some file-downloading CGI programs that
1      use ‘Content-Disposition’ headers to describe what the name of a
1      downloaded file should be.
1 
1      When combined with ‘--metalink-over-http’ and
1      ‘--trust-server-names’, a ‘Content-Type: application/metalink4+xml’
1      file is named using the ‘Content-Disposition’ filename field, if
1      available.
1 
1 ‘--content-on-error’
1 
1      If this is set to on, wget will not skip the content when the
1      server responds with a http status code that indicates error.
1 
1 ‘--trust-server-names’
1 
1      If this is set, on a redirect, the local file name will be based on
1      the redirection URL. By default the local file name is based on the
1      original URL. When doing recursive retrieving this can be helpful
1      because in many web sites redirected URLs correspond to an
1      underlying file structure, while link URLs do not.
1 
1 ‘--auth-no-challenge’
1 
1      If this option is given, Wget will send Basic HTTP authentication
1      information (plaintext username and password) for all requests,
1      just like Wget 1.10.2 and prior did by default.
1 
1      Use of this option is not recommended, and is intended only to
1      support some few obscure servers, which never send HTTP
1      authentication challenges, but accept unsolicited auth info, say,
1      in addition to form-based authentication.
1 
1 ‘--retry-on-http-error=CODE[,CODE,...]’
1      Consider given HTTP response codes as non-fatal, transient errors.
1      Supply a comma-separated list of 3-digit HTTP response codes as
1      argument.  Useful to work around special circumstances where
1      retries are required, but the server responds with an error code
1      normally not retried by Wget.  Such errors might be 503 (Service
1      Unavailable) and 429 (Too Many Requests).  Retries enabled by this
1      option are performed subject to the normal retry timing and retry
1      count limitations of Wget.
1 
1      Using this option is intended to support special use cases only and
1      is generally not recommended, as it can force retries even in cases
1      where the server is actually trying to decrease its load.  Please
1      use wisely and only if you know what you are doing.
1