ASN.1 to XML: The Process No comments yet
asn2xml
Jim Ostell, speaking at the observance of the 25th anniversary of NCBI, stated something along the lines of, “then they wanted XML, but nah..”.
While working on the filters for the LARTS product, most specifically, the GenBank-like report, I realized how tightly-coupled the NCBI ASN.1/XML is to the toolkit.
Basically, you’ve got to understand the toolkit code in order to translate what the XML is saying. The infinite extendability and recursive structure of the ASN.1 data model is another conundrum. This is especially true of the of the ASN.1 data structures supporting GenBank data - Bioseq-set. For example, a phy-set (phylogeny set) can include sets of Bioseq-sets nested to several levels. Most Bioseq-sets are the usual nuc-prot (DNA and translating protein), but others are pop-sets, eco-sets, segmented sequences with sets of sequence parts, etc.
After we developed LARTS, I wrote the GB filter as a Java object. It was an interesting experience.
NCBI ASN.1 rendered as XML, either our version or the NCBI asn2xml version, is very dependent on the NCBI toolkit code for proper interpretation.
The two most glaring examples are listed below.
Sequence Locations
Determing the location of sequence features for a GenBank data report, is a prime example. Here are a few simple examples:
primer_bind order(complement(1..19), 332..350)
gene complement(join(1560..2030, 3304..3321))
CDS complement(join(3492..3593, 3941..4104, 4203..4364, 4457..4553, 4655..4792))
rRNA join(<1..156, 445..478, 1199..>1559) 5231, 76582..76767, 77517..77720, 78409..78490))
primer_bind order(complement(1..19), 1106..1124)
For Segmented-sequences:
CDS join(162922:124..144; 162923: 647..889, 1298..1570)
CD regions locations have frames, bonds have points (that can be packed), strand minus denotes a complement (reverse order), a set of sequence locations for a sequence feature (packed-seqint) denotes a join, and locations can be “order(”ed, or “one-of”, and fuzz-from and fuzz-to has to taken into account for points and sequence intervals.
Sequence Format
DNA sequences are stored in a packed 2-bit or 4-bit per letter format (ncbi2na and ncbi4na). 2na is used if the sequence does not contain ambiguity, otherwise 4na is the format of choice. The sequence must be unpacked to be useful. This takes a basic understanding of Hex(adecimal).
Toolkit
The NCBI Toolkit contains all of the code necessary to render a GenBank report from the ASN.1 binary or ASCII data file. (The code is there, but you have to figure out how to compile it into an executable.)
We took the toolkit code and converted it to Java to produce the GenBank-style output format. It differs from the actual NCBI GenBank Report in that the LARTS report lists a FASTA-formatted sequence instead of the 10-base pairs per column that the NCBI GenBank Report produces.
The Many Variations of LARTS
GenBankReportFilter.java is provided as an example with Stand-Alone LARTS. The LARTS Reader enables the GenBank-style report.
Using LARTS Online, the user can select the GenBank-style report as the desired Output Format.
A third option, would entail using LARTS Online to obtain the keyword or keyword/element-path data wanted in XML format. This data is then downloaded to a local machine via the Thick Client option. Finally, Stand-Alone LARTS would process the dowloaded XML data into a GenBank-style report.
Stand-Alone LARTS provides example filters and SQL for processing XML and loading the relevant data into a local SQL database. This includes sample code for the BLOB and CLOB objects.
The filter for FASTA-formatting sequence data is also available as an example with Stand-Alone LARTS.
These options provide ready access to NCBI data for your research.