gawkinet: PROTBASE

1 
1 3.10 PROTBASE: Searching Through A Protein Database
1 ===================================================
1 
1      Hoare's Law of Large Problems: Inside every large problem is a
1      small problem struggling to get out.
1 
1    Yahoo's database of stock market data is just one among the many
1 large databases on the Internet.  Another one is located at NCBI
1 (National Center for Biotechnology Information).  Established in 1988 as
1 a national resource for molecular biology information, NCBI creates
1 public databases, conducts research in computational biology, develops
1 software tools for analyzing genome data, and disseminates biomedical
1 information.  In this section, we look at one of NCBI's public services,
1 which is called BLAST (Basic Local Alignment Search Tool).
1 
1    You probably know that the information necessary for reproducing
1 living cells is encoded in the genetic material of the cells.  The
1 genetic material is a very long chain of four base nucleotides.  It is
1 the order of appearance (the sequence) of nucleotides which contains the
1 information about the substance to be produced.  Scientists in
1 biotechnology often find a specific fragment, determine the nucleotide
1 sequence, and need to know where the sequence at hand comes from.  This
1 is where the large databases enter the game.  At NCBI, databases store
1 the knowledge about which sequences have ever been found and where they
1 have been found.  When the scientist sends his sequence to the BLAST
1 service, the server looks for regions of genetic material in its
1 database which look the most similar to the delivered nucleotide
1 sequence.  After a search time of some seconds or minutes the server
1 sends an answer to the scientist.  In order to make access simple, NCBI
1 chose to offer their database service through popular Internet
1 protocols.  There are four basic ways to use the so-called BLAST
1 services:
1 
1    * The easiest way to use BLAST is through the web.  Users may simply
1      point their browsers at the NCBI home page and link to the BLAST
1      pages.  NCBI provides a stable URL that may be used to perform
1      BLAST searches without interactive use of a web browser.  This is
1      what we will do later in this section.  A demonstration client and
1      a 'README' file demonstrate how to access this URL.
1 
1    * Currently, 'blastcl3' is the standard network BLAST client.  You
1      can download 'blastcl3' from the anonymous FTP location.
1 
1    * BLAST 2.0 can be run locally as a full executable and can be used
1      to run BLAST searches against private local databases, or
1      downloaded copies of the NCBI databases.  BLAST 2.0 executables may
1      be found on the NCBI anonymous FTP server.
1 
1    * The NCBI BLAST Email server is the best option for people without
1      convenient access to the web.  A similarity search can be performed
1      by sending a properly formatted mail message containing the
1      nucleotide or protein query sequence to <blast@ncbi.nlm.nih.gov>.
1      The query sequence is compared against the specified database using
1      the BLAST algorithm and the results are returned in an email
1      message.  For more information on formulating email BLAST searches,
1      you can send a message consisting of the word "HELP" to the same
1      address, <blast@ncbi.nlm.nih.gov>.
1 
1    Our starting point is the demonstration client mentioned in the first
1 option.  The 'README' file that comes along with the client explains the
1 whole process in a nutshell.  In the rest of this section, we first show
1 what such requests look like.  Then we show how to use 'gawk' to
1 implement a client in about 10 lines of code.  Finally, we show how to
1 interpret the result returned from the service.
1 
1    Sequences are expected to be represented in the standard IUB/IUPAC
1 amino acid and nucleic acid codes, with these exceptions: lower-case
1 letters are accepted and are mapped into upper-case; a single hyphen or
1 dash can be used to represent a gap of indeterminate length; and in
1 amino acid sequences, 'U' and '*' are acceptable letters (see below).
1 Before submitting a request, any numerical digits in the query sequence
1 should either be removed or replaced by appropriate letter codes (e.g.,
1 'N' for unknown nucleic acid residue or 'X' for unknown amino acid
1 residue).  The nucleic acid codes supported are:
1 
1      A --> adenosine               M --> A C (amino)
1      C --> cytidine                S --> G C (strong)
1      G --> guanine                 W --> A T (weak)
1      T --> thymidine               B --> G T C
1      U --> uridine                 D --> G A T
1      R --> G A (purine)            H --> A C T
1      Y --> T C (pyrimidine)        V --> G C A
1      K --> G T (keto)              N --> A G C T (any)
1                                    -  gap of indeterminate length
1 
1    Now you know the alphabet of nucleotide sequences.  The last two
1 lines of the following example query show you such a sequence, which is
1 obviously made up only of elements of the alphabet just described.
1 Store this example query into a file named 'protbase.request'.  You are
1 now ready to send it to the server with the demonstration client.
1 
1      PROGRAM blastn
1      DATALIB month
1      EXPECT 0.75
1      BEGIN
1      >GAWK310 the gawking gene GNU AWK
1      tgcttggctgaggagccataggacgagagcttcctggtgaagtgtgtttcttgaaatcat
1      caccaccatggacagcaaa
1 
1    The actual search request begins with the mandatory parameter
1 'PROGRAM' in the first column followed by the value 'blastn' (the name
1 of the program) for searching nucleic acids.  The next line contains the
1 mandatory search parameter 'DATALIB' with the value 'month' for the
1 newest nucleic acid sequences.  The third line contains an optional
1 'EXPECT' parameter and the value desired for it.  The fourth line
1 contains the mandatory 'BEGIN' directive, followed by the query sequence
1 in FASTA/Pearson format.  Each line of information must be less than 80
1 characters in length.
1 
1    The "month" database contains all new or revised sequences released
1 in the last 30 days and is useful for searching against new sequences.
1 There are five different blast programs, 'blastn' being the one that
1 compares a nucleotide query sequence against a nucleotide sequence
1 database.
1 
1    The last server directive that must appear in every request is the
1 'BEGIN' directive.  The query sequence should immediately follow the
1 'BEGIN' directive and must appear in FASTA/Pearson format.  A sequence
1 in FASTA/Pearson format begins with a single-line description.  The
1 description line, which is required, is distinguished from the lines of
1 sequence data that follow it by having a greater-than ('>') symbol in
1 the first column.  For the purposes of the BLAST server, the text of the
1 description is arbitrary.
1 
1    If you prefer to use a client written in 'gawk', just store the
1 following 10 lines of code into a file named 'protbase.awk' and use this
1 client instead.  Invoke it with 'gawk -f protbase.awk protbase.request'.
1 Then wait a minute and watch the result coming in.  In order to
1 replicate the demonstration client's behavior as closely as possible,
1 this client does not use a proxy server.  We could also have extended
1 the client program in ⇒Retrieving Web Pages GETURL, to implement
1 the client request from 'protbase.awk' as a special case.
1 
1      { request = request "\n" $0 }
1 
1      END {
1        BLASTService     = "/inet/tcp/0/www.ncbi.nlm.nih.gov/80"
1        printf "POST /cgi-bin/BLAST/nph-blast_report HTTP/1.0\n" |& BLASTService
1        printf "Content-Length: " length(request) "\n\n"         |& BLASTService
1        printf request                                           |& BLASTService
1        while ((BLASTService |& getline) > 0)
1            print $0
1        close(BLASTService)
1      }
1 
1    The demonstration client from NCBI is 214 lines long (written in C)
1 and it is not immediately obvious what it does.  Our client is so short
1 that it _is_ obvious what it does.  First it loops over all lines of the
1 query and stores the whole query into a variable.  Then the script
1 establishes an Internet connection to the NCBI server and transmits the
1 query by framing it with a proper HTTP request.  Finally it receives and
1 prints the complete result coming from the server.
1 
1    Now, let us look at the result.  It begins with an HTTP header, which
1 you can ignore.  Then there are some comments about the query having
1 been filtered to avoid spuriously high scores.  After this, there is a
1 reference to the paper that describes the software being used for
1 searching the data base.  After a repetition of the original query's
1 description we find the list of significant alignments:
1 
1      Sequences producing significant alignments:                        (bits)  Value
1 
1      gb|AC021182.14|AC021182 Homo sapiens chromosome 7 clone RP11-733...    38  0.20
1      gb|AC021056.12|AC021056 Homo sapiens chromosome 3 clone RP11-115...    38  0.20
1      emb|AL160278.10|AL160278 Homo sapiens chromosome 9 clone RP11-57...    38  0.20
1      emb|AL391139.11|AL391139 Homo sapiens chromosome X clone RP11-35...    38  0.20
1      emb|AL365192.6|AL365192 Homo sapiens chromosome 6 clone RP3-421H...    38  0.20
1      emb|AL138812.9|AL138812 Homo sapiens chromosome 11 clone RP1-276...    38  0.20
1      gb|AC073881.3|AC073881 Homo sapiens chromosome 15 clone CTD-2169...    38  0.20
1 
1    This means that the query sequence was found in seven human
1 chromosomes.  But the value 0.20 (20%) means that the probability of an
1 accidental match is rather high (20%) in all cases and should be taken
1 into account.  You may wonder what the first column means.  It is a key
1 to the specific database in which this occurrence was found.  The unique
1 sequence identifiers reported in the search results can be used as
1 sequence retrieval keys via the NCBI server.  The syntax of sequence
1 header lines used by the NCBI BLAST server depends on the database from
1 which each sequence was obtained.  The table below lists the identifiers
1 for the databases from which the sequences were derived.
1 
1      Database Name                     Identifier Syntax
1      ============================      ========================
1      GenBank                           gb|accession|locus
1      EMBL Data Library                 emb|accession|locus
1      DDBJ, DNA Database of Japan       dbj|accession|locus
1      NBRF PIR                          pir||entry
1      Protein Research Foundation       prf||name
1      SWISS-PROT                        sp|accession|entry name
1      Brookhaven Protein Data Bank      pdb|entry|chain
1      Kabat's Sequences of Immuno...    gnl|kabat|identifier
1      Patents                           pat|country|number
1      GenInfo Backbone Id               bbs|number
1 
1    For example, an identifier might be 'gb|AC021182.14|AC021182', where
1 the 'gb' tag indicates that the identifier refers to a GenBank sequence,
1 'AC021182.14' is its GenBank ACCESSION, and 'AC021182' is the GenBank
1 LOCUS. The identifier contains no spaces, so that a space indicates the
1 end of the identifier.
1 
1    Let us continue in the result listing.  Each of the seven alignments
1 mentioned above is subsequently described in detail.  We will have a
1 closer look at the first of them.
1 
1      >gb|AC021182.14|AC021182 Homo sapiens chromosome 7 clone RP11-733N23, WORKING DRAFT SEQUENCE, 4
1                   unordered pieces
1                Length = 176383
1 
1       Score = 38.2 bits (19), Expect = 0.20
1       Identities = 19/19 (100%)
1       Strand = Plus / Plus
1 
1      Query: 35    tggtgaagtgtgtttcttg 53
1                   |||||||||||||||||||
1      Sbjct: 69786 tggtgaagtgtgtttcttg 69804
1 
1    This alignment was located on the human chromosome 7.  The fragment
1 on which part of the query was found had a total length of 176383.  Only
1 19 of the nucleotides matched and the matching sequence ran from
1 character 35 to 53 in the query sequence and from 69786 to 69804 in the
1 fragment on chromosome 7.  If you are still reading at this point, you
1 are probably interested in finding out more about Computational Biology
1 and you might appreciate the following hints.
1 
1   1. There is a book called 'Introduction to Computational Biology' by
1      Michael S. Waterman, which is worth reading if you are seriously
1      interested.  You can find a good book review on the Internet.
1 
1   2. While Waterman's book can explain to you the algorithms employed
1      internally in the database search engines, most practitioners
1      prefer to approach the subject differently.  The applied side of
1      Computational Biology is called Bioinformatics, and emphasizes the
1      tools available for day-to-day work as well as how to actually
1      _use_ them.  One of the very few affordable books on Bioinformatics
1      is 'Developing Bioinformatics Computer Skills'.
1 
1   3. The sequences _gawk_ and _gnuawk_ are in widespread use in the
1      genetic material of virtually every earthly living being.  Let us
1      take this as a clear indication that the divine creator has
1      intended 'gawk' to prevail over other scripting languages such as
1      'perl', 'tcl', or 'python' which are not even proper sequences.
1      (:-)
1