LifeFormulae Blog » Posts for tag 'Sequences'

ASN.1 to XML: The Process No comments yet

asn2xml

Jim Ostell, speaking at the observance of the 25th anniversary of NCBI, stated something along the lines of, “then they wanted XML, but nah..”.

While working on the filters for the LARTS product, most specifically, the GenBank-like report, I realized how tightly-coupled the NCBI ASN.1/XML is to the toolkit. 

Basically, you’ve got to understand the toolkit code in order to translate what the XML is saying. The infinite extendability and recursive structure of the ASN.1 data model is another conundrum. This is especially true of the of the ASN.1 data structures supporting GenBank data - Bioseq-set. For example, a phy-set (phylogeny set) can include sets of Bioseq-sets nested to several levels. Most Bioseq-sets are the usual nuc-prot (DNA and translating protein), but others are pop-sets, eco-sets, segmented sequences with sets of sequence parts, etc.

After we developed LARTS, I wrote the GB filter as a Java object.  It was an interesting experience. 

NCBI ASN.1 rendered as XML, either our version or the NCBI asn2xml version, is very dependent on  the NCBI toolkit code for proper interpretation.  

The two most glaring examples are listed below.

Sequence Locations

Determing the location of sequence features for a GenBank data report, is a prime example.  Here are a few simple examples:

primer_bind   order(complement(1..19), 332..350)
gene                complement(join(1560..2030, 3304..3321))
CDS               complement(join(3492..3593, 3941..4104, 4203..4364, 4457..4553, 4655..4792))
rRNA  join(<1..156, 445..478, 1199..>1559) 5231, 76582..76767, 77517..77720, 78409..78490))
primer_bind   order(complement(1..19), 1106..1124)

For Segmented-sequences:
CDS         join(162922:124..144; 162923: 647..889, 1298..1570)

CD regions locations have frames, bonds have points (that can be packed), strand minus denotes a complement (reverse order), a set of sequence locations for a sequence feature (packed-seqint) denotes a join, and locations can be “order(”ed, or “one-of”, and fuzz-from and fuzz-to has to taken into account for points and sequence intervals.

Sequence Format

DNA sequences are stored in a packed 2-bit or 4-bit per letter format (ncbi2na and ncbi4na).  2na is used if the sequence does not contain ambiguity, otherwise 4na is the format of choice. The sequence must be unpacked to be useful. This takes a basic understanding of Hex(adecimal).
Toolkit

The NCBI Toolkit contains all of the code necessary to render a GenBank report from the ASN.1 binary or ASCII data file.  (The code is there, but you have to figure out how to compile it into an executable.)

We took the toolkit code and converted  it to Java to produce the GenBank-style output format.  It differs from the actual NCBI GenBank Report in that the LARTS report lists a FASTA-formatted sequence instead of the 10-base pairs per column that the NCBI GenBank Report produces.

The Many Variations of LARTS

GenBankReportFilter.java is provided as an example with Stand-Alone LARTS.  The LARTS Reader enables the GenBank-style report.

Using LARTS Online, the user can select the GenBank-style report as the desired Output Format.

A third option, would entail using LARTS Online to obtain the keyword or keyword/element-path data wanted in XML format. This data is then downloaded to a local machine via the Thick Client option. Finally, Stand-Alone LARTS would process the dowloaded XML data into a GenBank-style report.

Stand-Alone LARTS provides example filters and SQL for processing XML and loading the relevant data into a local SQL database.  This includes sample code for  the BLOB and CLOB objects.

The filter for FASTA-formatting sequence data is also available as an example with Stand-Alone LARTS.

These options provide ready access to NCBI data for your research.

HSEMB Conference and GenBank Release 168.0 No comments yet

Events of particular note this week – 

The HSEMB Conference –

The 26th Annual Houston Conference on Biomedical
Engineering Research (
http://www.hsemb.org) on 19-20 March 2009 at the University of Houston Hilton Hotel and Convention Center.

HSEMB has established the John Halter Award for Professional Achievement in Bioinformatics and Computational Biology.   The late Dr. John Halter is the founder of LifeFormulae, LLC.  http://www.lifeformulae.com/pages/about_jah_memorial.aspx is the link to our memorial to John.

Super Computing 2008

SC08 - Super Computing 2008, the International Conference for High
Performance Computing, Networking, Storage and Analysis. November
15-21, Austin Convention Center, Austin, Texas.
http://sc08.supercomputing.org/

And GenBank Release 168.0. –

Genbank Release 168.- flat files require roughtly 3871 GB for the sequence files only or 396 GB if you include the ’short directory’, ‘index’ and the *.txt files.  The ASN.1 data files require approximately 338 GB.

Recent statistics for non-WGS (Whole Genome Sequence), non-CON (Contig)  sequences are given below.

  Release  Date       Base Pairs   Entries

  167      Aug 2008   95033791652  92748599
  168      Oct 2008   97381682336  96400790

Recent statistics for WGS sequences:

  Release  Date       Base Pairs   Entries

  167      Aug 2008  118593509342  40214247
  168      Oct 2008  136085973423  46108952
 

During the 69 days between the close dates for GenBank Releases 167.0
and 168.0, the non-WGS/non-CON portion of GenBank grew by 2,347,890,684 basepairs and by 3,652,191 sequence records.

During that same period,1,111,311 records were updated. An average of about 69,036 non-WGS/non-CON records were added and/or updated per day.

Between releases 167.0 and 168.0, the WGS component of GenBank grew by 17,492,464,081 basepairs and by 5,894,705 records.

The combined WGS/non-WGS single-release increase of 19.84 Gbp for
Release 168.0 is the largest that GenBank has experienced, to date.
 

That’s a lot data. It’s a long, long way from the set of CDs that came out 4 times a year back in the late 90’s. Somewhere there are drawers and drawers of old Entrez CDs!  (Entrez is the engine used to search NCBI Life Sciences data.)

GenBank is considered an archive of information about sequences.  The nine-digit GI number, once the unique sequence identifier, have been supplanted by the Accession Number.

Speaking of NCBI data, we now have the complete set of the human (Homo Sapiens) data from NCBI’s dbSNP available through our LARTS product.  Currently, the files are not searchable by keyword, or keyword/element path.  This capability should be available to you the early part of next week.

Which brings me to the question.  Is GenBank data as important today as it was, say, 5 years ago?  If not GenBank, what NCBI data is considered critical to your current research and bioinformatics methods, and, if I might also ask, what are you doing with it?

Top of page / Subscribe to new Entries (RSS)