gawkinet: WEBGRAB
1
1 3.5 WEBGRAB: Extract Links from a Page
1 ======================================
1
1 Sometimes it is necessary to extract links from web pages. Browsers do
1 it, web robots do it, and sometimes even humans do it. Since we have a
1 tool like GETURL at hand, we can solve this problem with some help from
1 the Bourne shell:
1
1 BEGIN { RS = "http://[#%&\\+\\-\\./0-9\\:;\\?A-Z_a-z\\~]*" }
1 RT != "" {
1 command = ("gawk -v Proxy=MyProxy -f geturl.awk " RT \
1 " > doc" NR ".html")
1 print command
1 }
1
1 Notice that the regular expression for URLs is rather crude. A
1 precise regular expression is much more complex. But this one works
1 rather well. One problem is that it is unable to find internal links of
1 an HTML document. Another problem is that 'ftp', 'telnet', 'news',
1 'mailto', and other kinds of links are missing in the regular
1 expression. However, it is straightforward to add them, if doing so is
1 necessary for other tasks.
1
1 This program reads an HTML file and prints all the HTTP links that it
1 finds. It relies on 'gawk''s ability to use regular expressions as
1 record separators. With 'RS' set to a regular expression that matches
1 links, the second action is executed each time a non-empty link is
1 found. We can find the matching link itself in 'RT'.
1
1 The action could use the 'system()' function to let another GETURL
1 retrieve the page, but here we use a different approach. This simple
1 program prints shell commands that can be piped into 'sh' for execution.
1 This way it is possible to first extract the links, wrap shell commands
1 around them, and pipe all the shell commands into a file. After editing
1 the file, execution of the file retrieves exactly those files that we
1 really need. In case we do not want to edit, we can retrieve all the
1 pages like this:
1
1 gawk -f geturl.awk http://www.suse.de | gawk -f webgrab.awk | sh
1
1 After this, you will find the contents of all referenced documents in
1 files named 'doc*.html' even if they do not contain HTML code. The most
1 annoying thing is that we always have to pass the proxy to GETURL. If
1 you do not like to see the headers of the web pages appear on the
1 screen, you can redirect them to '/dev/null'. Watching the headers
1 appear can be quite interesting, because it reveals interesting details
1 such as which web server the companies use. Now, it is clear how the
1 clever marketing people use web robots to determine the market shares of
1 Microsoft and Netscape in the web server market.
1
1 Port 80 of any web server is like a small hole in a repellent
1 firewall. After attaching a browser to port 80, we usually catch a
1 glimpse of the bright side of the server (its home page). With a tool
1 like GETURL at hand, we are able to discover some of the more concealed
1 or even "indecent" services (i.e., lacking conformity to standards of
1 quality). It can be exciting to see the fancy CGI scripts that lie
1 there, revealing the inner workings of the server, ready to be called:
1
1 * With a command such as:
1
1 gawk -f geturl.awk http://any.host.on.the.net/cgi-bin/
1
1 some servers give you a directory listing of the CGI files.
1 Knowing the names, you can try to call some of them and watch for
1 useful results. Sometimes there are executables in such
1 directories (such as Perl interpreters) that you may call remotely.
1 If there are subdirectories with configuration data of the web
1 server, this can also be quite interesting to read.
1
1 * The well-known Apache web server usually has its CGI files in the
1 directory '/cgi-bin'. There you can often find the scripts
1 'test-cgi' and 'printenv'. Both tell you some things about the
1 current connection and the installation of the web server. Just
1 call:
1
1 gawk -f geturl.awk http://any.host.on.the.net/cgi-bin/test-cgi
1 gawk -f geturl.awk http://any.host.on.the.net/cgi-bin/printenv
1
1 * Sometimes it is even possible to retrieve system files like the web
1 server's log file--possibly containing customer data--or even the
1 file '/etc/passwd'. (We don't recommend this!)
1
1 *Caution:* Although this may sound funny or simply irrelevant, we are
1 talking about severe security holes. Try to explore your own system
1 this way and make sure that none of the above reveals too much
1 information about your system.
1