gawkinet: URLCHK

1 
1 3.4 URLCHK: Look for Changed Web Pages
1 ======================================
1 
1 Most people who make heavy use of Internet resources have a large
1 bookmark file with pointers to interesting web sites.  It is impossible
1 to regularly check by hand if any of these sites have changed.  A
1 program is needed to automatically look at the headers of web pages and
1 tell which ones have changed.  URLCHK does the comparison after using
1 GETURL with the 'HEAD' method to retrieve the header.
1 
1    Like GETURL, this program first checks that it is called with exactly
1 one command-line parameter.  URLCHK also takes the same command-line
1 variables 'Proxy' and 'ProxyPort' as GETURL, because these variables are
1 handed over to GETURL for each URL that gets checked.  The one and only
1 parameter is the name of a file that contains one line for each URL. In
1 the first column, we find the URL, and the second and third columns hold
1 the length of the URL's body when checked for the two last times.  Now,
1 we follow this plan:
1 
1   1. Read the URLs from the file and remember their most recent lengths
1 
1   2. Delete the contents of the file
1 
1   3. For each URL, check its new length and write it into the file
1 
1   4. If the most recent and the new length differ, tell the user
1 
1    It may seem a bit peculiar to read the URLs from a file together with
1 their two most recent lengths, but this approach has several advantages.
1 You can call the program again and again with the same file.  After
1 running the program, you can regenerate the changed URLs by extracting
1 those lines that differ in their second and third columns:
1 
1      BEGIN {
1        if (ARGC != 2) {
1          print "URLCHK - check if URLs have changed"
1          print "IN:\n    the file with URLs as a command-line parameter"
1          print "    file contains URL, old length, new length"
1          print "PARAMS:\n    -v Proxy=MyProxy -v ProxyPort=8080"
1          print "OUT:\n    same as file with URLs"
1          print "JK 02.03.1998"
1          exit
1        }
1        URLfile = ARGV[1]; ARGV[1] = ""
1        if (Proxy     != "") Proxy     = " -v Proxy="     Proxy
1        if (ProxyPort != "") ProxyPort = " -v ProxyPort=" ProxyPort
1        while ((getline < URLfile) > 0)
1           Length[$1] = $3 + 0
1        close(URLfile)      # now, URLfile is read in and can be updated
1        GetHeader = "gawk " Proxy ProxyPort " -v Method=\"HEAD\" -f geturl.awk "
1        for (i in Length) {
1          GetThisHeader = GetHeader i " 2>&1"
1          while ((GetThisHeader | getline) > 0)
1            if (toupper($0) ~ /CONTENT-LENGTH/) NewLength = $2 + 0
1          close(GetThisHeader)
1          print i, Length[i], NewLength > URLfile
1          if (Length[i] != NewLength)  # report only changed URLs
1            print i, Length[i], NewLength
1        }
1        close(URLfile)
1      }
1 
1    Another thing that may look strange is the way GETURL is called.
1 Before calling GETURL, we have to check if the proxy variables need to
1 be passed on.  If so, we prepare strings that will become part of the
1 command line later.  In 'GetHeader()', we store these strings together
1 with the longest part of the command line.  Later, in the loop over the
1 URLs, 'GetHeader()' is appended with the URL and a redirection operator
1 to form the command that reads the URL's header over the Internet.
1 GETURL always produces the headers over '/dev/stderr'.  That is the
1 reason why we need the redirection operator to have the header piped in.
1 
1    This program is not perfect because it assumes that changing URLs
1 results in changed lengths, which is not necessarily true.  A more
1 advanced approach is to look at some other header line that holds time
1 information.  But, as always when things get a bit more complicated,
1 this is left as an exercise to the reader.
1