gawkinet: GETURL

1 
1 3.2 GETURL: Retrieving Web Pages
1 ================================
1 
1 GETURL is a versatile building block for shell scripts that need to
1 retrieve files from the Internet.  It takes a web address as a
1 command-line parameter and tries to retrieve the contents of this
1 address.  The contents are printed to standard output, while the header
1 is printed to '/dev/stderr'.  A surrounding shell script could analyze
1 the contents and extract the text or the links.  An ASCII browser could
1 be written around GETURL. But more interestingly, web robots are
1 straightforward to write on top of GETURL. On the Internet, you can find
1 several programs of the same name that do the same job.  They are
1 usually much more complex internally and at least 10 times longer.
1 
1    At first, GETURL checks if it was called with exactly one web
1 address.  Then, it checks if the user chose to use a special proxy
1 server whose name is handed over in a variable.  By default, it is
1 assumed that the local machine serves as proxy.  GETURL uses the 'GET'
1 method by default to access the web page.  By handing over the name of a
1 different method (such as 'HEAD'), it is possible to choose a different
1 behavior.  With the 'HEAD' method, the user does not receive the body of
1 the page content, but does receive the header:
1 
1      BEGIN {
1        if (ARGC != 2) {
1          print "GETURL - retrieve Web page via HTTP 1.0"
1          print "IN:\n    the URL as a command-line parameter"
1          print "PARAM(S):\n    -v Proxy=MyProxy"
1          print "OUT:\n    the page content on stdout"
1          print "    the page header on stderr"
1          print "JK 16.05.1997"
1          print "ADR 13.08.2000"
1          exit
1        }
1        URL = ARGV[1]; ARGV[1] = ""
1        if (Proxy     == "")  Proxy     = "127.0.0.1"
1        if (ProxyPort ==  0)  ProxyPort = 80
1        if (Method    == "")  Method    = "GET"
1        HttpService = "/inet/tcp/0/" Proxy "/" ProxyPort
1        ORS = RS = "\r\n\r\n"
1        print Method " " URL " HTTP/1.0" |& HttpService
1        HttpService                      |& getline Header
1        print Header > "/dev/stderr"
1        while ((HttpService |& getline) > 0)
1          printf "%s", $0
1        close(HttpService)
1      }
1 
1    This program can be changed as needed, but be careful with the last
1 lines.  Make sure transmission of binary data is not corrupted by
1 additional line breaks.  Even as it is now, the byte sequence
1 '"\r\n\r\n"' would disappear if it were contained in binary data.  Don't
1 get caught in a trap when trying a quick fix on this one.
1