gawkinet: STATIST
1
1 3.6 STATIST: Graphing a Statistical Distribution
1 ================================================
1
1 In the HTTP server examples we've shown thus far, we never present an
1 image to the browser and its user. Presenting images is one task.
1 Generating images that reflect some user input and presenting these
1 dynamically generated images is another. In this node, we use GNUPlot
1 for generating '.png', '.ps', or '.gif' files.(1)
1
1 The program we develop takes the statistical parameters of two
1 samples and computes the t-test statistics. As a result, we get the
1 probabilities that the means and the variances of both samples are the
1 same. In order to let the user check plausibility, the program presents
1 an image of the distributions. The statistical computation follows
1 'Numerical Recipes in C: The Art of Scientific Computing' by William H.
1 Press, Saul A. Teukolsky, William T. Vetterling, and Brian P. Flannery.
1 Since 'gawk' does not have a built-in function for the computation of
1 the beta function, we use the 'ibeta()' function of GNUPlot. As a side
1 effect, we learn how to use GNUPlot as a sophisticated calculator. The
1 comparison of means is done as in 'tutest', paragraph 14.2, page 613,
1 and the comparison of variances is done as in 'ftest', page 611 in
1 'Numerical Recipes'.
1
1 As usual, we take the site-independent code for servers and append
1 our own functions 'SetUpServer()' and 'HandleGET()':
1
1 function SetUpServer() {
1 TopHeader = "<HTML><title>Statistics with GAWK</title>"
1 TopDoc = "<BODY>\
1 <h2>Please choose one of the following actions:</h2>\
1 <UL>\
1 <LI><A HREF=" MyPrefix "/AboutServer>About this server</A></LI>\
1 <LI><A HREF=" MyPrefix "/EnterParameters>Enter Parameters</A></LI>\
1 </UL>"
1 TopFooter = "</BODY></HTML>"
1 GnuPlot = "gnuplot 2>&1"
1 m1=m2=0; v1=v2=1; n1=n2=10
1 }
1
1 Here, you see the menu structure that the user sees. Later, we will
1 see how the program structure of the 'HandleGET()' function reflects the
1 menu structure. What is missing here is the link for the image we
1 generate. In an event-driven environment, request, generation, and
1 delivery of images are separated.
1
1 Notice the way we initialize the 'GnuPlot' command string for the
1 pipe. By default, GNUPlot outputs the generated image via standard
1 output, as well as the results of 'print'(ed) calculations via standard
1 error. The redirection causes standard error to be mixed into standard
1 output, enabling us to read results of calculations with 'getline'. By
1 initializing the statistical parameters with some meaningful defaults,
1 we make sure the user gets an image the first time he uses the program.
1
1 Following is the rather long function 'HandleGET()', which implements
1 the contents of this service by reacting to the different kinds of
1 requests from the browser. Before you start playing with this script,
1 make sure that your browser supports JavaScript and that it also has
1 this option switched on. The script uses a short snippet of JavaScript
1 code for delayed opening of a window with an image. A more detailed
1 explanation follows:
1
1 function HandleGET() {
1 if(MENU[2] == "AboutServer") {
1 Document = "This is a GUI for a statistical computation.\
1 It compares means and variances of two distributions.\
1 It is implemented as one GAWK script and uses GNUPLOT."
1 } else if (MENU[2] == "EnterParameters") {
1 Document = ""
1 if ("m1" in GETARG) { # are there parameters to compare?
1 Document = Document "<SCRIPT LANGUAGE=\"JavaScript\">\
1 setTimeout(\"window.open(\\\"" MyPrefix "/Image" systime()\
1 "\\\",\\\"dist\\\", \\\"status=no\\\");\", 1000); </SCRIPT>"
1 m1 = GETARG["m1"]; v1 = GETARG["v1"]; n1 = GETARG["n1"]
1 m2 = GETARG["m2"]; v2 = GETARG["v2"]; n2 = GETARG["n2"]
1 t = (m1-m2)/sqrt(v1/n1+v2/n2)
1 df = (v1/n1+v2/n2)*(v1/n1+v2/n2)/((v1/n1)*(v1/n1)/(n1-1) \
1 + (v2/n2)*(v2/n2) /(n2-1))
1 if (v1>v2) {
1 f = v1/v2
1 df1 = n1 - 1
1 df2 = n2 - 1
1 } else {
1 f = v2/v1
1 df1 = n2 - 1
1 df2 = n1 - 1
1 }
1 print "pt=ibeta(" df/2 ",0.5," df/(df+t*t) ")" |& GnuPlot
1 print "pF=2.0*ibeta(" df2/2 "," df1/2 "," \
1 df2/(df2+df1*f) ")" |& GnuPlot
1 print "print pt, pF" |& GnuPlot
1 RS="\n"; GnuPlot |& getline; RS="\r\n" # $1 is pt, $2 is pF
1 print "invsqrt2pi=1.0/sqrt(2.0*pi)" |& GnuPlot
1 print "nd(x)=invsqrt2pi/sd*exp(-0.5*((x-mu)/sd)**2)" |& GnuPlot
1 print "set term png small color" |& GnuPlot
1 #print "set term postscript color" |& GnuPlot
1 #print "set term gif medium size 320,240" |& GnuPlot
1 print "set yrange[-0.3:]" |& GnuPlot
1 print "set label 'p(m1=m2) =" $1 "' at 0,-0.1 left" |& GnuPlot
1 print "set label 'p(v1=v2) =" $2 "' at 0,-0.2 left" |& GnuPlot
1 print "plot mu=" m1 ",sd=" sqrt(v1) ", nd(x) title 'sample 1',\
1 mu=" m2 ",sd=" sqrt(v2) ", nd(x) title 'sample 2'" |& GnuPlot
1 print "quit" |& GnuPlot
1 GnuPlot |& getline Image
1 while ((GnuPlot |& getline) > 0)
1 Image = Image RS $0
1 close(GnuPlot)
1 }
1 Document = Document "\
1 <h3>Do these samples have the same Gaussian distribution?</h3>\
1 <FORM METHOD=GET> <TABLE BORDER CELLPADDING=5>\
1 <TR>\
1 <TD>1. Mean </TD>
1 <TD><input type=text name=m1 value=" m1 " size=8></TD>\
1 <TD>1. Variance</TD>
1 <TD><input type=text name=v1 value=" v1 " size=8></TD>\
1 <TD>1. Count </TD>
1 <TD><input type=text name=n1 value=" n1 " size=8></TD>\
1 </TR><TR>\
1 <TD>2. Mean </TD>
1 <TD><input type=text name=m2 value=" m2 " size=8></TD>\
1 <TD>2. Variance</TD>
1 <TD><input type=text name=v2 value=" v2 " size=8></TD>\
1 <TD>2. Count </TD>
1 <TD><input type=text name=n2 value=" n2 " size=8></TD>\
1 </TR> <input type=submit value=\"Compute\">\
1 </TABLE></FORM><BR>"
1 } else if (MENU[2] ~ "Image") {
1 Reason = "OK" ORS "Content-type: image/png"
1 #Reason = "OK" ORS "Content-type: application/x-postscript"
1 #Reason = "OK" ORS "Content-type: image/gif"
1 Header = Footer = ""
1 Document = Image
1 }
1 }
1
1 As usual, we give a short description of the service in the first
1 menu choice. The third menu choice shows us that generation and
1 presentation of an image are two separate actions. While the latter
1 takes place quite instantly in the third menu choice, the former takes
1 place in the much longer second choice. Image data passes from the
1 generating action to the presenting action via the variable 'Image' that
1 contains a complete '.png' image, which is otherwise stored in a file.
1 If you prefer '.ps' or '.gif' images over the default '.png' images, you
1 may select these options by uncommenting the appropriate lines. But
1 remember to do so in two places: when telling GNUPlot which kind of
1 images to generate, and when transmitting the image at the end of the
1 program.
1
1 Looking at the end of the program, the way we pass the 'Content-type'
1 to the browser is a bit unusual. It is appended to the 'OK' of the
1 first header line to make sure the type information becomes part of the
1 header. The other variables that get transmitted across the network are
1 made empty, because in this case we do not have an HTML document to
1 transmit, but rather raw image data to contain in the body.
1
1 Most of the work is done in the second menu choice. It starts with a
1 strange JavaScript code snippet. When first implementing this server,
1 we used a short '"<IMG SRC=" MyPrefix "/Image>"' here. But then
1 browsers got smarter and tried to improve on speed by requesting the
1 image and the HTML code at the same time. When doing this, the browser
1 tries to build up a connection for the image request while the request
1 for the HTML text is not yet completed. The browser tries to connect to
1 the 'gawk' server on port 8080 while port 8080 is still in use for
1 transmission of the HTML text. The connection for the image cannot be
1 built up, so the image appears as "broken" in the browser window. We
1 solved this problem by telling the browser to open a separate window for
1 the image, but only after a delay of 1000 milliseconds. By this time,
1 the server should be ready for serving the next request.
1
1 But there is one more subtlety in the JavaScript code. Each time the
1 JavaScript code opens a window for the image, the name of the image is
1 appended with a timestamp ('systime()'). Why this constant change of
1 name for the image? Initially, we always named the image 'Image', but
1 then the Netscape browser noticed the name had _not_ changed since the
1 previous request and displayed the previous image (caching behavior).
1 The server core is implemented so that browsers are told _not_ to cache
1 anything. Obviously HTTP requests do not always work as expected. One
1 way to circumvent the cache of such overly smart browsers is to change
1 the name of the image with each request. These three lines of
1 JavaScript caused us a lot of trouble.
1
1 The rest can be broken down into two phases. At first, we check if
1 there are statistical parameters. When the program is first started,
1 there usually are no parameters because it enters the page coming from
1 the top menu. Then, we only have to present the user a form that he can
1 use to change statistical parameters and submit them. Subsequently, the
1 submission of the form causes the execution of the first phase because
1 _now_ there _are_ parameters to handle.
1
1 Now that we have parameters, we know there will be an image
1 available. Therefore we insert the JavaScript code here to initiate the
1 opening of the image in a separate window. Then, we prepare some
1 variables that will be passed to GNUPlot for calculation of the
1 probabilities. Prior to reading the results, we must temporarily change
1 'RS' because GNUPlot separates lines with newlines. After instructing
1 GNUPlot to generate a '.png' (or '.ps' or '.gif') image, we initiate the
1 insertion of some text, explaining the resulting probabilities. The
1 final 'plot' command actually generates the image data. This raw binary
1 has to be read in carefully without adding, changing, or deleting a
1 single byte. Hence the unusual initialization of 'Image' and completion
1 with a 'while' loop.
1
1 When using this server, it soon becomes clear that it is far from
1 being perfect. It mixes source code of six scripting languages or
1 protocols:
1
1 * GNU 'awk' implements a server for the protocol:
1 * HTTP which transmits:
1 * HTML text which contains a short piece of:
1 * JavaScript code opening a separate window.
1 * A Bourne shell script is used for piping commands into:
1 * GNUPlot to generate the image to be opened.
1
1 After all this work, the GNUPlot image opens in the JavaScript window
1 where it can be viewed by the user.
1
1 It is probably better not to mix up so many different languages. The
1 result is not very readable. Furthermore, the statistical part of the
1 server does not take care of invalid input. Among others, using
1 negative variances will cause invalid results.
1
1 ---------- Footnotes ----------
1
1 (1) Due to licensing problems, the default installation of GNUPlot
1 disables the generation of '.gif' files. If your installed version does
1 not accept 'set term gif', just download and install the most recent
1 version of GNUPlot and the GD library (http://www.boutell.com/gd/) by
1 Thomas Boutell. Otherwise you still have the chance to generate some
1 ASCII-art style images with GNUPlot by using 'set term dumb'. (We tried
1 it and it worked.)
1