gawkinet: PROTBASE
1
1 3.10 PROTBASE: Searching Through A Protein Database
1 ===================================================
1
1 Hoare's Law of Large Problems: Inside every large problem is a
1 small problem struggling to get out.
1
1 Yahoo's database of stock market data is just one among the many
1 large databases on the Internet. Another one is located at NCBI
1 (National Center for Biotechnology Information). Established in 1988 as
1 a national resource for molecular biology information, NCBI creates
1 public databases, conducts research in computational biology, develops
1 software tools for analyzing genome data, and disseminates biomedical
1 information. In this section, we look at one of NCBI's public services,
1 which is called BLAST (Basic Local Alignment Search Tool).
1
1 You probably know that the information necessary for reproducing
1 living cells is encoded in the genetic material of the cells. The
1 genetic material is a very long chain of four base nucleotides. It is
1 the order of appearance (the sequence) of nucleotides which contains the
1 information about the substance to be produced. Scientists in
1 biotechnology often find a specific fragment, determine the nucleotide
1 sequence, and need to know where the sequence at hand comes from. This
1 is where the large databases enter the game. At NCBI, databases store
1 the knowledge about which sequences have ever been found and where they
1 have been found. When the scientist sends his sequence to the BLAST
1 service, the server looks for regions of genetic material in its
1 database which look the most similar to the delivered nucleotide
1 sequence. After a search time of some seconds or minutes the server
1 sends an answer to the scientist. In order to make access simple, NCBI
1 chose to offer their database service through popular Internet
1 protocols. There are four basic ways to use the so-called BLAST
1 services:
1
1 * The easiest way to use BLAST is through the web. Users may simply
1 point their browsers at the NCBI home page and link to the BLAST
1 pages. NCBI provides a stable URL that may be used to perform
1 BLAST searches without interactive use of a web browser. This is
1 what we will do later in this section. A demonstration client and
1 a 'README' file demonstrate how to access this URL.
1
1 * Currently, 'blastcl3' is the standard network BLAST client. You
1 can download 'blastcl3' from the anonymous FTP location.
1
1 * BLAST 2.0 can be run locally as a full executable and can be used
1 to run BLAST searches against private local databases, or
1 downloaded copies of the NCBI databases. BLAST 2.0 executables may
1 be found on the NCBI anonymous FTP server.
1
1 * The NCBI BLAST Email server is the best option for people without
1 convenient access to the web. A similarity search can be performed
1 by sending a properly formatted mail message containing the
1 nucleotide or protein query sequence to <blast@ncbi.nlm.nih.gov>.
1 The query sequence is compared against the specified database using
1 the BLAST algorithm and the results are returned in an email
1 message. For more information on formulating email BLAST searches,
1 you can send a message consisting of the word "HELP" to the same
1 address, <blast@ncbi.nlm.nih.gov>.
1
1 Our starting point is the demonstration client mentioned in the first
1 option. The 'README' file that comes along with the client explains the
1 whole process in a nutshell. In the rest of this section, we first show
1 what such requests look like. Then we show how to use 'gawk' to
1 implement a client in about 10 lines of code. Finally, we show how to
1 interpret the result returned from the service.
1
1 Sequences are expected to be represented in the standard IUB/IUPAC
1 amino acid and nucleic acid codes, with these exceptions: lower-case
1 letters are accepted and are mapped into upper-case; a single hyphen or
1 dash can be used to represent a gap of indeterminate length; and in
1 amino acid sequences, 'U' and '*' are acceptable letters (see below).
1 Before submitting a request, any numerical digits in the query sequence
1 should either be removed or replaced by appropriate letter codes (e.g.,
1 'N' for unknown nucleic acid residue or 'X' for unknown amino acid
1 residue). The nucleic acid codes supported are:
1
1 A --> adenosine M --> A C (amino)
1 C --> cytidine S --> G C (strong)
1 G --> guanine W --> A T (weak)
1 T --> thymidine B --> G T C
1 U --> uridine D --> G A T
1 R --> G A (purine) H --> A C T
1 Y --> T C (pyrimidine) V --> G C A
1 K --> G T (keto) N --> A G C T (any)
1 - gap of indeterminate length
1
1 Now you know the alphabet of nucleotide sequences. The last two
1 lines of the following example query show you such a sequence, which is
1 obviously made up only of elements of the alphabet just described.
1 Store this example query into a file named 'protbase.request'. You are
1 now ready to send it to the server with the demonstration client.
1
1 PROGRAM blastn
1 DATALIB month
1 EXPECT 0.75
1 BEGIN
1 >GAWK310 the gawking gene GNU AWK
1 tgcttggctgaggagccataggacgagagcttcctggtgaagtgtgtttcttgaaatcat
1 caccaccatggacagcaaa
1
1 The actual search request begins with the mandatory parameter
1 'PROGRAM' in the first column followed by the value 'blastn' (the name
1 of the program) for searching nucleic acids. The next line contains the
1 mandatory search parameter 'DATALIB' with the value 'month' for the
1 newest nucleic acid sequences. The third line contains an optional
1 'EXPECT' parameter and the value desired for it. The fourth line
1 contains the mandatory 'BEGIN' directive, followed by the query sequence
1 in FASTA/Pearson format. Each line of information must be less than 80
1 characters in length.
1
1 The "month" database contains all new or revised sequences released
1 in the last 30 days and is useful for searching against new sequences.
1 There are five different blast programs, 'blastn' being the one that
1 compares a nucleotide query sequence against a nucleotide sequence
1 database.
1
1 The last server directive that must appear in every request is the
1 'BEGIN' directive. The query sequence should immediately follow the
1 'BEGIN' directive and must appear in FASTA/Pearson format. A sequence
1 in FASTA/Pearson format begins with a single-line description. The
1 description line, which is required, is distinguished from the lines of
1 sequence data that follow it by having a greater-than ('>') symbol in
1 the first column. For the purposes of the BLAST server, the text of the
1 description is arbitrary.
1
1 If you prefer to use a client written in 'gawk', just store the
1 following 10 lines of code into a file named 'protbase.awk' and use this
1 client instead. Invoke it with 'gawk -f protbase.awk protbase.request'.
1 Then wait a minute and watch the result coming in. In order to
1 replicate the demonstration client's behavior as closely as possible,
1 this client does not use a proxy server. We could also have extended
1 the client program in ⇒Retrieving Web Pages GETURL, to implement
1 the client request from 'protbase.awk' as a special case.
1
1 { request = request "\n" $0 }
1
1 END {
1 BLASTService = "/inet/tcp/0/www.ncbi.nlm.nih.gov/80"
1 printf "POST /cgi-bin/BLAST/nph-blast_report HTTP/1.0\n" |& BLASTService
1 printf "Content-Length: " length(request) "\n\n" |& BLASTService
1 printf request |& BLASTService
1 while ((BLASTService |& getline) > 0)
1 print $0
1 close(BLASTService)
1 }
1
1 The demonstration client from NCBI is 214 lines long (written in C)
1 and it is not immediately obvious what it does. Our client is so short
1 that it _is_ obvious what it does. First it loops over all lines of the
1 query and stores the whole query into a variable. Then the script
1 establishes an Internet connection to the NCBI server and transmits the
1 query by framing it with a proper HTTP request. Finally it receives and
1 prints the complete result coming from the server.
1
1 Now, let us look at the result. It begins with an HTTP header, which
1 you can ignore. Then there are some comments about the query having
1 been filtered to avoid spuriously high scores. After this, there is a
1 reference to the paper that describes the software being used for
1 searching the data base. After a repetition of the original query's
1 description we find the list of significant alignments:
1
1 Sequences producing significant alignments: (bits) Value
1
1 gb|AC021182.14|AC021182 Homo sapiens chromosome 7 clone RP11-733... 38 0.20
1 gb|AC021056.12|AC021056 Homo sapiens chromosome 3 clone RP11-115... 38 0.20
1 emb|AL160278.10|AL160278 Homo sapiens chromosome 9 clone RP11-57... 38 0.20
1 emb|AL391139.11|AL391139 Homo sapiens chromosome X clone RP11-35... 38 0.20
1 emb|AL365192.6|AL365192 Homo sapiens chromosome 6 clone RP3-421H... 38 0.20
1 emb|AL138812.9|AL138812 Homo sapiens chromosome 11 clone RP1-276... 38 0.20
1 gb|AC073881.3|AC073881 Homo sapiens chromosome 15 clone CTD-2169... 38 0.20
1
1 This means that the query sequence was found in seven human
1 chromosomes. But the value 0.20 (20%) means that the probability of an
1 accidental match is rather high (20%) in all cases and should be taken
1 into account. You may wonder what the first column means. It is a key
1 to the specific database in which this occurrence was found. The unique
1 sequence identifiers reported in the search results can be used as
1 sequence retrieval keys via the NCBI server. The syntax of sequence
1 header lines used by the NCBI BLAST server depends on the database from
1 which each sequence was obtained. The table below lists the identifiers
1 for the databases from which the sequences were derived.
1
1 Database Name Identifier Syntax
1 ============================ ========================
1 GenBank gb|accession|locus
1 EMBL Data Library emb|accession|locus
1 DDBJ, DNA Database of Japan dbj|accession|locus
1 NBRF PIR pir||entry
1 Protein Research Foundation prf||name
1 SWISS-PROT sp|accession|entry name
1 Brookhaven Protein Data Bank pdb|entry|chain
1 Kabat's Sequences of Immuno... gnl|kabat|identifier
1 Patents pat|country|number
1 GenInfo Backbone Id bbs|number
1
1 For example, an identifier might be 'gb|AC021182.14|AC021182', where
1 the 'gb' tag indicates that the identifier refers to a GenBank sequence,
1 'AC021182.14' is its GenBank ACCESSION, and 'AC021182' is the GenBank
1 LOCUS. The identifier contains no spaces, so that a space indicates the
1 end of the identifier.
1
1 Let us continue in the result listing. Each of the seven alignments
1 mentioned above is subsequently described in detail. We will have a
1 closer look at the first of them.
1
1 >gb|AC021182.14|AC021182 Homo sapiens chromosome 7 clone RP11-733N23, WORKING DRAFT SEQUENCE, 4
1 unordered pieces
1 Length = 176383
1
1 Score = 38.2 bits (19), Expect = 0.20
1 Identities = 19/19 (100%)
1 Strand = Plus / Plus
1
1 Query: 35 tggtgaagtgtgtttcttg 53
1 |||||||||||||||||||
1 Sbjct: 69786 tggtgaagtgtgtttcttg 69804
1
1 This alignment was located on the human chromosome 7. The fragment
1 on which part of the query was found had a total length of 176383. Only
1 19 of the nucleotides matched and the matching sequence ran from
1 character 35 to 53 in the query sequence and from 69786 to 69804 in the
1 fragment on chromosome 7. If you are still reading at this point, you
1 are probably interested in finding out more about Computational Biology
1 and you might appreciate the following hints.
1
1 1. There is a book called 'Introduction to Computational Biology' by
1 Michael S. Waterman, which is worth reading if you are seriously
1 interested. You can find a good book review on the Internet.
1
1 2. While Waterman's book can explain to you the algorithms employed
1 internally in the database search engines, most practitioners
1 prefer to approach the subject differently. The applied side of
1 Computational Biology is called Bioinformatics, and emphasizes the
1 tools available for day-to-day work as well as how to actually
1 _use_ them. One of the very few affordable books on Bioinformatics
1 is 'Developing Bioinformatics Computer Skills'.
1
1 3. The sequences _gawk_ and _gnuawk_ are in widespread use in the
1 genetic material of virtually every earthly living being. Let us
1 take this as a clear indication that the divine creator has
1 intended 'gawk' to prevail over other scripting languages such as
1 'perl', 'tcl', or 'python' which are not even proper sequences.
1 (:-)
1