gawkinet: GETURL
1
1 3.2 GETURL: Retrieving Web Pages
1 ================================
1
1 GETURL is a versatile building block for shell scripts that need to
1 retrieve files from the Internet. It takes a web address as a
1 command-line parameter and tries to retrieve the contents of this
1 address. The contents are printed to standard output, while the header
1 is printed to '/dev/stderr'. A surrounding shell script could analyze
1 the contents and extract the text or the links. An ASCII browser could
1 be written around GETURL. But more interestingly, web robots are
1 straightforward to write on top of GETURL. On the Internet, you can find
1 several programs of the same name that do the same job. They are
1 usually much more complex internally and at least 10 times longer.
1
1 At first, GETURL checks if it was called with exactly one web
1 address. Then, it checks if the user chose to use a special proxy
1 server whose name is handed over in a variable. By default, it is
1 assumed that the local machine serves as proxy. GETURL uses the 'GET'
1 method by default to access the web page. By handing over the name of a
1 different method (such as 'HEAD'), it is possible to choose a different
1 behavior. With the 'HEAD' method, the user does not receive the body of
1 the page content, but does receive the header:
1
1 BEGIN {
1 if (ARGC != 2) {
1 print "GETURL - retrieve Web page via HTTP 1.0"
1 print "IN:\n the URL as a command-line parameter"
1 print "PARAM(S):\n -v Proxy=MyProxy"
1 print "OUT:\n the page content on stdout"
1 print " the page header on stderr"
1 print "JK 16.05.1997"
1 print "ADR 13.08.2000"
1 exit
1 }
1 URL = ARGV[1]; ARGV[1] = ""
1 if (Proxy == "") Proxy = "127.0.0.1"
1 if (ProxyPort == 0) ProxyPort = 80
1 if (Method == "") Method = "GET"
1 HttpService = "/inet/tcp/0/" Proxy "/" ProxyPort
1 ORS = RS = "\r\n\r\n"
1 print Method " " URL " HTTP/1.0" |& HttpService
1 HttpService |& getline Header
1 print Header > "/dev/stderr"
1 while ((HttpService |& getline) > 0)
1 printf "%s", $0
1 close(HttpService)
1 }
1
1 This program can be changed as needed, but be careful with the last
1 lines. Make sure transmission of binary data is not corrupted by
1 additional line breaks. Even as it is now, the byte sequence
1 '"\r\n\r\n"' would disappear if it were contained in binary data. Don't
1 get caught in a trap when trying a quick fix on this one.
1