gawkinet: WEBGRAB

1 
1 3.5 WEBGRAB: Extract Links from a Page
1 ======================================
1 
1 Sometimes it is necessary to extract links from web pages.  Browsers do
1 it, web robots do it, and sometimes even humans do it.  Since we have a
1 tool like GETURL at hand, we can solve this problem with some help from
1 the Bourne shell:
1 
1      BEGIN { RS = "http://[#%&\\+\\-\\./0-9\\:;\\?A-Z_a-z\\~]*" }
1      RT != "" {
1         command = ("gawk -v Proxy=MyProxy -f geturl.awk " RT \
1                     " > doc" NR ".html")
1         print command
1      }
1 
1    Notice that the regular expression for URLs is rather crude.  A
1 precise regular expression is much more complex.  But this one works
1 rather well.  One problem is that it is unable to find internal links of
1 an HTML document.  Another problem is that 'ftp', 'telnet', 'news',
1 'mailto', and other kinds of links are missing in the regular
1 expression.  However, it is straightforward to add them, if doing so is
1 necessary for other tasks.
1 
1    This program reads an HTML file and prints all the HTTP links that it
1 finds.  It relies on 'gawk''s ability to use regular expressions as
1 record separators.  With 'RS' set to a regular expression that matches
1 links, the second action is executed each time a non-empty link is
1 found.  We can find the matching link itself in 'RT'.
1 
1    The action could use the 'system()' function to let another GETURL
1 retrieve the page, but here we use a different approach.  This simple
1 program prints shell commands that can be piped into 'sh' for execution.
1 This way it is possible to first extract the links, wrap shell commands
1 around them, and pipe all the shell commands into a file.  After editing
1 the file, execution of the file retrieves exactly those files that we
1 really need.  In case we do not want to edit, we can retrieve all the
1 pages like this:
1 
1      gawk -f geturl.awk http://www.suse.de | gawk -f webgrab.awk | sh
1 
1    After this, you will find the contents of all referenced documents in
1 files named 'doc*.html' even if they do not contain HTML code.  The most
1 annoying thing is that we always have to pass the proxy to GETURL. If
1 you do not like to see the headers of the web pages appear on the
1 screen, you can redirect them to '/dev/null'.  Watching the headers
1 appear can be quite interesting, because it reveals interesting details
1 such as which web server the companies use.  Now, it is clear how the
1 clever marketing people use web robots to determine the market shares of
1 Microsoft and Netscape in the web server market.
1 
1    Port 80 of any web server is like a small hole in a repellent
1 firewall.  After attaching a browser to port 80, we usually catch a
1 glimpse of the bright side of the server (its home page).  With a tool
1 like GETURL at hand, we are able to discover some of the more concealed
1 or even "indecent" services (i.e., lacking conformity to standards of
1 quality).  It can be exciting to see the fancy CGI scripts that lie
1 there, revealing the inner workings of the server, ready to be called:
1 
1    * With a command such as:
1 
1           gawk -f geturl.awk http://any.host.on.the.net/cgi-bin/
1 
1      some servers give you a directory listing of the CGI files.
1      Knowing the names, you can try to call some of them and watch for
1      useful results.  Sometimes there are executables in such
1      directories (such as Perl interpreters) that you may call remotely.
1      If there are subdirectories with configuration data of the web
1      server, this can also be quite interesting to read.
1 
1    * The well-known Apache web server usually has its CGI files in the
1      directory '/cgi-bin'.  There you can often find the scripts
1      'test-cgi' and 'printenv'.  Both tell you some things about the
1      current connection and the installation of the web server.  Just
1      call:
1 
1           gawk -f geturl.awk http://any.host.on.the.net/cgi-bin/test-cgi
1           gawk -f geturl.awk http://any.host.on.the.net/cgi-bin/printenv
1 
1    * Sometimes it is even possible to retrieve system files like the web
1      server's log file--possibly containing customer data--or even the
1      file '/etc/passwd'.  (We don't recommend this!)
1 
1    *Caution:* Although this may sound funny or simply irrelevant, we are
1 talking about severe security holes.  Try to explore your own system
1 this way and make sure that none of the above reveals too much
1 information about your system.
1