gawkinet: URLCHK
1
1 3.4 URLCHK: Look for Changed Web Pages
1 ======================================
1
1 Most people who make heavy use of Internet resources have a large
1 bookmark file with pointers to interesting web sites. It is impossible
1 to regularly check by hand if any of these sites have changed. A
1 program is needed to automatically look at the headers of web pages and
1 tell which ones have changed. URLCHK does the comparison after using
1 GETURL with the 'HEAD' method to retrieve the header.
1
1 Like GETURL, this program first checks that it is called with exactly
1 one command-line parameter. URLCHK also takes the same command-line
1 variables 'Proxy' and 'ProxyPort' as GETURL, because these variables are
1 handed over to GETURL for each URL that gets checked. The one and only
1 parameter is the name of a file that contains one line for each URL. In
1 the first column, we find the URL, and the second and third columns hold
1 the length of the URL's body when checked for the two last times. Now,
1 we follow this plan:
1
1 1. Read the URLs from the file and remember their most recent lengths
1
1 2. Delete the contents of the file
1
1 3. For each URL, check its new length and write it into the file
1
1 4. If the most recent and the new length differ, tell the user
1
1 It may seem a bit peculiar to read the URLs from a file together with
1 their two most recent lengths, but this approach has several advantages.
1 You can call the program again and again with the same file. After
1 running the program, you can regenerate the changed URLs by extracting
1 those lines that differ in their second and third columns:
1
1 BEGIN {
1 if (ARGC != 2) {
1 print "URLCHK - check if URLs have changed"
1 print "IN:\n the file with URLs as a command-line parameter"
1 print " file contains URL, old length, new length"
1 print "PARAMS:\n -v Proxy=MyProxy -v ProxyPort=8080"
1 print "OUT:\n same as file with URLs"
1 print "JK 02.03.1998"
1 exit
1 }
1 URLfile = ARGV[1]; ARGV[1] = ""
1 if (Proxy != "") Proxy = " -v Proxy=" Proxy
1 if (ProxyPort != "") ProxyPort = " -v ProxyPort=" ProxyPort
1 while ((getline < URLfile) > 0)
1 Length[$1] = $3 + 0
1 close(URLfile) # now, URLfile is read in and can be updated
1 GetHeader = "gawk " Proxy ProxyPort " -v Method=\"HEAD\" -f geturl.awk "
1 for (i in Length) {
1 GetThisHeader = GetHeader i " 2>&1"
1 while ((GetThisHeader | getline) > 0)
1 if (toupper($0) ~ /CONTENT-LENGTH/) NewLength = $2 + 0
1 close(GetThisHeader)
1 print i, Length[i], NewLength > URLfile
1 if (Length[i] != NewLength) # report only changed URLs
1 print i, Length[i], NewLength
1 }
1 close(URLfile)
1 }
1
1 Another thing that may look strange is the way GETURL is called.
1 Before calling GETURL, we have to check if the proxy variables need to
1 be passed on. If so, we prepare strings that will become part of the
1 command line later. In 'GetHeader()', we store these strings together
1 with the longest part of the command line. Later, in the loop over the
1 URLs, 'GetHeader()' is appended with the URL and a redirection operator
1 to form the command that reads the URL's header over the Internet.
1 GETURL always produces the headers over '/dev/stderr'. That is the
1 reason why we need the redirection operator to have the header piped in.
1
1 This program is not perfect because it assumes that changing URLs
1 results in changed lengths, which is not necessarily true. A more
1 advanced approach is to look at some other header line that holds time
1 information. But, as always when things get a bit more complicated,
1 this is left as an exercise to the reader.
1