gawkinet: STATIST

1 
1 3.6 STATIST: Graphing a Statistical Distribution
1 ================================================
1 
1 In the HTTP server examples we've shown thus far, we never present an
1 image to the browser and its user.  Presenting images is one task.
1 Generating images that reflect some user input and presenting these
1 dynamically generated images is another.  In this node, we use GNUPlot
1 for generating '.png', '.ps', or '.gif' files.(1)
1 
1    The program we develop takes the statistical parameters of two
1 samples and computes the t-test statistics.  As a result, we get the
1 probabilities that the means and the variances of both samples are the
1 same.  In order to let the user check plausibility, the program presents
1 an image of the distributions.  The statistical computation follows
1 'Numerical Recipes in C: The Art of Scientific Computing' by William H.
1 Press, Saul A. Teukolsky, William T. Vetterling, and Brian P. Flannery.
1 Since 'gawk' does not have a built-in function for the computation of
1 the beta function, we use the 'ibeta()' function of GNUPlot.  As a side
1 effect, we learn how to use GNUPlot as a sophisticated calculator.  The
1 comparison of means is done as in 'tutest', paragraph 14.2, page 613,
1 and the comparison of variances is done as in 'ftest', page 611 in
1 'Numerical Recipes'.
1 
1    As usual, we take the site-independent code for servers and append
1 our own functions 'SetUpServer()' and 'HandleGET()':
1 
1      function SetUpServer() {
1        TopHeader = "<HTML><title>Statistics with GAWK</title>"
1        TopDoc = "<BODY>\
1         <h2>Please choose one of the following actions:</h2>\
1         <UL>\
1          <LI><A HREF=" MyPrefix "/AboutServer>About this server</A></LI>\
1          <LI><A HREF=" MyPrefix "/EnterParameters>Enter Parameters</A></LI>\
1         </UL>"
1        TopFooter  = "</BODY></HTML>"
1        GnuPlot    = "gnuplot 2>&1"
1        m1=m2=0;    v1=v2=1;    n1=n2=10
1      }
1 
1    Here, you see the menu structure that the user sees.  Later, we will
1 see how the program structure of the 'HandleGET()' function reflects the
1 menu structure.  What is missing here is the link for the image we
1 generate.  In an event-driven environment, request, generation, and
1 delivery of images are separated.
1 
1    Notice the way we initialize the 'GnuPlot' command string for the
1 pipe.  By default, GNUPlot outputs the generated image via standard
1 output, as well as the results of 'print'(ed) calculations via standard
1 error.  The redirection causes standard error to be mixed into standard
1 output, enabling us to read results of calculations with 'getline'.  By
1 initializing the statistical parameters with some meaningful defaults,
1 we make sure the user gets an image the first time he uses the program.
1 
1    Following is the rather long function 'HandleGET()', which implements
1 the contents of this service by reacting to the different kinds of
1 requests from the browser.  Before you start playing with this script,
1 make sure that your browser supports JavaScript and that it also has
1 this option switched on.  The script uses a short snippet of JavaScript
1 code for delayed opening of a window with an image.  A more detailed
1 explanation follows:
1 
1      function HandleGET() {
1        if(MENU[2] == "AboutServer") {
1          Document  = "This is a GUI for a statistical computation.\
1            It compares means and variances of two distributions.\
1            It is implemented as one GAWK script and uses GNUPLOT."
1        } else if (MENU[2] == "EnterParameters") {
1          Document = ""
1          if ("m1" in GETARG) {     # are there parameters to compare?
1            Document = Document "<SCRIPT LANGUAGE=\"JavaScript\">\
1              setTimeout(\"window.open(\\\"" MyPrefix "/Image" systime()\
1               "\\\",\\\"dist\\\", \\\"status=no\\\");\", 1000); </SCRIPT>"
1            m1 = GETARG["m1"]; v1 = GETARG["v1"]; n1 = GETARG["n1"]
1            m2 = GETARG["m2"]; v2 = GETARG["v2"]; n2 = GETARG["n2"]
1            t = (m1-m2)/sqrt(v1/n1+v2/n2)
1            df = (v1/n1+v2/n2)*(v1/n1+v2/n2)/((v1/n1)*(v1/n1)/(n1-1) \
1                 + (v2/n2)*(v2/n2) /(n2-1))
1            if (v1>v2) {
1                f = v1/v2
1                df1 = n1 - 1
1                df2 = n2 - 1
1            } else {
1                f = v2/v1
1                df1 = n2 - 1
1                df2 = n1 - 1
1            }
1            print "pt=ibeta(" df/2 ",0.5," df/(df+t*t) ")"  |& GnuPlot
1            print "pF=2.0*ibeta(" df2/2 "," df1/2 "," \
1                  df2/(df2+df1*f) ")"                    |& GnuPlot
1            print "print pt, pF"                         |& GnuPlot
1            RS="\n"; GnuPlot |& getline; RS="\r\n"    # $1 is pt, $2 is pF
1            print "invsqrt2pi=1.0/sqrt(2.0*pi)"          |& GnuPlot
1            print "nd(x)=invsqrt2pi/sd*exp(-0.5*((x-mu)/sd)**2)" |& GnuPlot
1            print "set term png small color"             |& GnuPlot
1            #print "set term postscript color"           |& GnuPlot
1            #print "set term gif medium size 320,240"    |& GnuPlot
1            print "set yrange[-0.3:]"                    |& GnuPlot
1            print "set label 'p(m1=m2) =" $1 "' at 0,-0.1 left"  |& GnuPlot
1            print "set label 'p(v1=v2) =" $2 "' at 0,-0.2 left"  |& GnuPlot
1            print "plot mu=" m1 ",sd=" sqrt(v1) ", nd(x) title 'sample 1',\
1              mu=" m2 ",sd=" sqrt(v2) ", nd(x) title 'sample 2'" |& GnuPlot
1            print "quit"                                         |& GnuPlot
1            GnuPlot |& getline Image
1            while ((GnuPlot |& getline) > 0)
1                Image = Image RS $0
1            close(GnuPlot)
1          }
1          Document = Document "\
1          <h3>Do these samples have the same Gaussian distribution?</h3>\
1          <FORM METHOD=GET> <TABLE BORDER CELLPADDING=5>\
1          <TR>\
1          <TD>1. Mean    </TD>
1          <TD><input type=text name=m1 value=" m1 " size=8></TD>\
1          <TD>1. Variance</TD>
1          <TD><input type=text name=v1 value=" v1 " size=8></TD>\
1          <TD>1. Count   </TD>
1          <TD><input type=text name=n1 value=" n1 " size=8></TD>\
1          </TR><TR>\
1          <TD>2. Mean    </TD>
1          <TD><input type=text name=m2 value=" m2 " size=8></TD>\
1          <TD>2. Variance</TD>
1          <TD><input type=text name=v2 value=" v2 " size=8></TD>\
1          <TD>2. Count   </TD>
1          <TD><input type=text name=n2 value=" n2 " size=8></TD>\
1          </TR>                   <input type=submit value=\"Compute\">\
1          </TABLE></FORM><BR>"
1        } else if (MENU[2] ~ "Image") {
1          Reason = "OK" ORS "Content-type: image/png"
1          #Reason = "OK" ORS "Content-type: application/x-postscript"
1          #Reason = "OK" ORS "Content-type: image/gif"
1          Header = Footer = ""
1          Document = Image
1        }
1      }
1 
1    As usual, we give a short description of the service in the first
1 menu choice.  The third menu choice shows us that generation and
1 presentation of an image are two separate actions.  While the latter
1 takes place quite instantly in the third menu choice, the former takes
1 place in the much longer second choice.  Image data passes from the
1 generating action to the presenting action via the variable 'Image' that
1 contains a complete '.png' image, which is otherwise stored in a file.
1 If you prefer '.ps' or '.gif' images over the default '.png' images, you
1 may select these options by uncommenting the appropriate lines.  But
1 remember to do so in two places: when telling GNUPlot which kind of
1 images to generate, and when transmitting the image at the end of the
1 program.
1 
1    Looking at the end of the program, the way we pass the 'Content-type'
1 to the browser is a bit unusual.  It is appended to the 'OK' of the
1 first header line to make sure the type information becomes part of the
1 header.  The other variables that get transmitted across the network are
1 made empty, because in this case we do not have an HTML document to
1 transmit, but rather raw image data to contain in the body.
1 
1    Most of the work is done in the second menu choice.  It starts with a
1 strange JavaScript code snippet.  When first implementing this server,
1 we used a short '"<IMG SRC=" MyPrefix "/Image>"' here.  But then
1 browsers got smarter and tried to improve on speed by requesting the
1 image and the HTML code at the same time.  When doing this, the browser
1 tries to build up a connection for the image request while the request
1 for the HTML text is not yet completed.  The browser tries to connect to
1 the 'gawk' server on port 8080 while port 8080 is still in use for
1 transmission of the HTML text.  The connection for the image cannot be
1 built up, so the image appears as "broken" in the browser window.  We
1 solved this problem by telling the browser to open a separate window for
1 the image, but only after a delay of 1000 milliseconds.  By this time,
1 the server should be ready for serving the next request.
1 
1    But there is one more subtlety in the JavaScript code.  Each time the
1 JavaScript code opens a window for the image, the name of the image is
1 appended with a timestamp ('systime()').  Why this constant change of
1 name for the image?  Initially, we always named the image 'Image', but
1 then the Netscape browser noticed the name had _not_ changed since the
1 previous request and displayed the previous image (caching behavior).
1 The server core is implemented so that browsers are told _not_ to cache
1 anything.  Obviously HTTP requests do not always work as expected.  One
1 way to circumvent the cache of such overly smart browsers is to change
1 the name of the image with each request.  These three lines of
1 JavaScript caused us a lot of trouble.
1 
1    The rest can be broken down into two phases.  At first, we check if
1 there are statistical parameters.  When the program is first started,
1 there usually are no parameters because it enters the page coming from
1 the top menu.  Then, we only have to present the user a form that he can
1 use to change statistical parameters and submit them.  Subsequently, the
1 submission of the form causes the execution of the first phase because
1 _now_ there _are_ parameters to handle.
1 
1    Now that we have parameters, we know there will be an image
1 available.  Therefore we insert the JavaScript code here to initiate the
1 opening of the image in a separate window.  Then, we prepare some
1 variables that will be passed to GNUPlot for calculation of the
1 probabilities.  Prior to reading the results, we must temporarily change
1 'RS' because GNUPlot separates lines with newlines.  After instructing
1 GNUPlot to generate a '.png' (or '.ps' or '.gif') image, we initiate the
1 insertion of some text, explaining the resulting probabilities.  The
1 final 'plot' command actually generates the image data.  This raw binary
1 has to be read in carefully without adding, changing, or deleting a
1 single byte.  Hence the unusual initialization of 'Image' and completion
1 with a 'while' loop.
1 
1    When using this server, it soon becomes clear that it is far from
1 being perfect.  It mixes source code of six scripting languages or
1 protocols:
1 
1    * GNU 'awk' implements a server for the protocol:
1    * HTTP which transmits:
1    * HTML text which contains a short piece of:
1    * JavaScript code opening a separate window.
1    * A Bourne shell script is used for piping commands into:
1    * GNUPlot to generate the image to be opened.
1 
1    After all this work, the GNUPlot image opens in the JavaScript window
1 where it can be viewed by the user.
1 
1    It is probably better not to mix up so many different languages.  The
1 result is not very readable.  Furthermore, the statistical part of the
1 server does not take care of invalid input.  Among others, using
1 negative variances will cause invalid results.
1 
1    ---------- Footnotes ----------
1 
1    (1) Due to licensing problems, the default installation of GNUPlot
1 disables the generation of '.gif' files.  If your installed version does
1 not accept 'set term gif', just download and install the most recent
1 version of GNUPlot and the GD library (http://www.boutell.com/gd/) by
1 Thomas Boutell.  Otherwise you still have the chance to generate some
1 ASCII-art style images with GNUPlot by using 'set term dumb'.  (We tried
1 it and it worked.)
1