LifeFormulae Blog » Posts for tag 'GenBank'

Effective Bioinformatics Programming - Part 5 No comments yet

First, a little irony. In the late ’90’s I interviewed with BMC software in Houston. At that time, BMC was a supporter of big iron, providing report facilities, etc.

When asked what software I currently used, I replied with “GNU software”. The interviewer asked, “What is GNU? I’ve never heard of it.”

I explained that it was free software that you could download from the web, etc. But they weren’t really interested.

Anyway, eWEEK.com had a feature this week - ‘MindTouch Names 20 Most Powerful Open-Source Voices of 2010. The first name mentioned was William Hurley. The chief architect of Open Source strategy at BMC. (http://www.eweek.com/c/a/IT-Management/OSBC-Names-20-Most-Powerful-Open-Source-Voices-of-2010-758420/?kc=EWKNLEDP03232010A).

I guess they’re interested now.

Data Standards

There are any number of sequence data formats. This link at EBI – http://www.ebi.ac.uk/2can/tutorials/formats.html describes several.

What is really astounding is that most of these formats have remained to same over the years. The tab-delimited and CSV (comma separated values) format is as prolific as ever, as is the GenBank report.

And equally astonishing is the fact that manipulating the data (e.g. parsing GenBank reports) is still the same.

True, the Bio libraries such as BioPerl, BioJava, BioRuby, now provide modules that make this easier, (if you can install them) but it is still the same old download and parse.

There are also several groups trying to standardize sequence data. The SO (Sequence Ontology) group (http://www.sequenceontology.org) is trying to do for sequence annotations what GO (Gene Ontology - http://www.geneontology.org) did for genes and gene product attributes.

MIGS (Minimum Information About A Genome Sequence spec at http://nora.nerc.ac.uk/5548/) is following the course of the MAGE MIAME Standard (Minimum Information About a Microarray Experiment at http://www.mged.org/Workgroups/MIAME/miame.html). Good luck with that, as many scientists have openly voiced objections to that standard.

XML and the Web

XML (eXtensible Markup Language) and WSDL (Web Services Description Language) are one method of easing the interchange of data. Links at – http://en.wikipedia.org/wiki/XML and http://en.wikipedia.org/wiki/Web_Services_Description_Language.

There are a number of drawbacks to this setup.

Not all of the sequence data is available in XML or well-formed XML.

Some XML, such as NCBI XML, needs further interpretation. For example, the sequence feature (annotation) locations must be “translated” for further use.

XSLT has performance issues, and is size-delimited. We tried processing LARTS converted NCBI ASN.1 GenBank XML data to XSLT and found there were definite size limitations.

Using WSDL means exposing yourself to the world via the web.

Javascript has too many security questions to consider seriously.

Software Development

Software development takes time and the right people. True, there is a lot of open source software out there, but I’ve mentioned the perils of that method in a previous blog.

A scientist with a grant to produce results dependent on computer analysis is only going to write code that is good enough to create code (or find someone (read post-doc) who can create that code very cheaply) that will back up those findings.

Has the code been extensively tested? Are the results produced by the code valid? Can the code be used by future projects? Is the software portable? Is it robust? Can it be ported to different hardware environments?

There is a great article – “Are we taking supercomputing code seriously?” at (http://www.zdnet.co.uk/news/it-strategy/2010/01/28/are-we-taking-supercomputing-code-seriously-40004192/). This article, in turn, has links to other articles on methods and algorithms, and error behavior, for example. This one on scientific software considers how multi-processing has influenced algorithm development and the problem of different multi-processors co-existing on the same machine (http://www.scientific-computing.com/features/feature.php?feature_id=262).

He states that in the rush to do science, scientists fail to spot software for what it is: the analogue of the experimental instrument. Therefore the software must be treated with the same respect that a physical experiment would.

When I started my career, I worked on a system that was a totally integrated database system for hospitals. It was one of those systems that was so very ahead of its time (mid-80’s), that a corporation bought the product and squashed it.

Anyway, our Systems and Extensions group supported the 6 compilers that comprised the system software that made the system function. The tailoring group wrote the code that created the screens that drove the system.

At the inception of the system, a decision was to be made over the make up of the tailoring group: should they be programmers that would be taught medical jargon, terms, etc; or should they be medical personnel – doctors, nurses, techs, that would be taught programming?

The decision was to go with medical personnel, as it was surmised they would understand hospitals better.

At the same time, a decision to limit the number of screens a hospital could request (called tailoring) to 500 was discussed. The decision was to let the hospital have however many screens it wanted.

The tailoring group got their training and set in to programming. After a period of time, it was realized that the group had, in essence, created one bad program and copied it thousands of times.

It was so bad, we did two things. We created a program profiler that produced a performance summary of the programming aspects of that program. (We were immediately asked to remove it by the tailoring group, as it was too confusing.) Two, we created an automated programming module that would create the code from the display widgets on the screen designed by the tailoring group.

This approach was helping, but people were abandoning ship as talk of an acquisition was surfacing. Our junior programmer went from new-hire to senior team member in 30 days.

I think we would have done a lot better with programmers learning medical terms.

As for the hospital screen limit, we had hospitals with 10,000 individual screens. We should have stuck with 500.

One last thing. When looking at any piece of scientific programming, please realize that in the Authors accreditation usually starts with the PI. The people who did the actual work are generally listed at the end of the line. The PI may have had the idea, but likely as not could not code it.

ASN.1 to XML: The Process No comments yet

asn2xml

Jim Ostell, speaking at the observance of the 25th anniversary of NCBI, stated something along the lines of, “then they wanted XML, but nah..”.

While working on the filters for the LARTS product, most specifically, the GenBank-like report, I realized how tightly-coupled the NCBI ASN.1/XML is to the toolkit. 

Basically, you’ve got to understand the toolkit code in order to translate what the XML is saying. The infinite extendability and recursive structure of the ASN.1 data model is another conundrum. This is especially true of the of the ASN.1 data structures supporting GenBank data - Bioseq-set. For example, a phy-set (phylogeny set) can include sets of Bioseq-sets nested to several levels. Most Bioseq-sets are the usual nuc-prot (DNA and translating protein), but others are pop-sets, eco-sets, segmented sequences with sets of sequence parts, etc.

After we developed LARTS, I wrote the GB filter as a Java object.  It was an interesting experience. 

NCBI ASN.1 rendered as XML, either our version or the NCBI asn2xml version, is very dependent on  the NCBI toolkit code for proper interpretation.  

The two most glaring examples are listed below.

Sequence Locations

Determing the location of sequence features for a GenBank data report, is a prime example.  Here are a few simple examples:

primer_bind   order(complement(1..19), 332..350)
gene                complement(join(1560..2030, 3304..3321))
CDS               complement(join(3492..3593, 3941..4104, 4203..4364, 4457..4553, 4655..4792))
rRNA  join(<1..156, 445..478, 1199..>1559) 5231, 76582..76767, 77517..77720, 78409..78490))
primer_bind   order(complement(1..19), 1106..1124)

For Segmented-sequences:
CDS         join(162922:124..144; 162923: 647..889, 1298..1570)

CD regions locations have frames, bonds have points (that can be packed), strand minus denotes a complement (reverse order), a set of sequence locations for a sequence feature (packed-seqint) denotes a join, and locations can be “order(”ed, or “one-of”, and fuzz-from and fuzz-to has to taken into account for points and sequence intervals.

Sequence Format

DNA sequences are stored in a packed 2-bit or 4-bit per letter format (ncbi2na and ncbi4na).  2na is used if the sequence does not contain ambiguity, otherwise 4na is the format of choice. The sequence must be unpacked to be useful. This takes a basic understanding of Hex(adecimal).
Toolkit

The NCBI Toolkit contains all of the code necessary to render a GenBank report from the ASN.1 binary or ASCII data file.  (The code is there, but you have to figure out how to compile it into an executable.)

We took the toolkit code and converted  it to Java to produce the GenBank-style output format.  It differs from the actual NCBI GenBank Report in that the LARTS report lists a FASTA-formatted sequence instead of the 10-base pairs per column that the NCBI GenBank Report produces.

The Many Variations of LARTS

GenBankReportFilter.java is provided as an example with Stand-Alone LARTS.  The LARTS Reader enables the GenBank-style report.

Using LARTS Online, the user can select the GenBank-style report as the desired Output Format.

A third option, would entail using LARTS Online to obtain the keyword or keyword/element-path data wanted in XML format. This data is then downloaded to a local machine via the Thick Client option. Finally, Stand-Alone LARTS would process the dowloaded XML data into a GenBank-style report.

Stand-Alone LARTS provides example filters and SQL for processing XML and loading the relevant data into a local SQL database.  This includes sample code for  the BLOB and CLOB objects.

The filter for FASTA-formatting sequence data is also available as an example with Stand-Alone LARTS.

These options provide ready access to NCBI data for your research.

Interpreting Standards No comments yet

I found an article in the December 2008 issue of Nature Methods to be of particular interest, not in the least that I personally know the authors.

The article, under CORRESPONDENCE on page 991, surveyed a series of papers  from the 2007 issues of 20 journals. The purpose was to refute the Nature Methods editorial of March 2008 which asserted that the deposition of supporting raw microarray datasets is “routine”.  Data cited in the article was compared to that currently available in public databases.  They found that the rate of deposition of datasets was less than 50%. Only half of the discovery data that was the basis of the articles was available to the public.

They further cited that the fault of the MIAME (Minimum Information about a Microarray Experiment) standard. They assert that “owing to their highly contextual nature, have a more complex metadata structure than sequence data.”

The MIAME standard was forged by the MGED (Microarray and Gene Expression Data) and published in Nature Genetics, 29, 365-371 (2001).  The MGED also house the the Microarray and Expression (MAGE) Object Model which defines the entire environment of the experiment (e.g. organism, array design, etc.). MIAME is the standard and MAGE adheres to the MIAME standard and suggests formats for representation and submission of microarray data.

The premier microarray data repository is ArrayExpress located at http://www.ebi.ac.uk/microarray-as/ae/.

ArrayExpress is a public repository for transcriptomics data, which is aimed at storing  and MINSEQE for high-throughput data (http://www.mged.org/minseqe/) - is compliant data in accordance with MGED (http://www.mged.org/recommendations). The ArrayExpress Warehouse stores gene-indexed expression profiles from a curated subset of experiments in the repository.

Other sites are GEO (Gene Expression Omnibus - http://www.ncbi.nlm.nih.gov/geo/) and CIBIX (Center for Information Biology gene Expression database - http://cibex.nig.ac.jp/index.jsp).

MIAMExpress (a MIAME compliant microarray data submission tool) is currently available at http://sourceforge.net/projects/miamexpress/ and is the submission tool for microarray experiments. It is downloaded to your local system and must be made executable (compiled) on your system in order to use it.  A local of the MySQL database is required as well as the Perl programming language. 

The MIAME/MAGE meta-data model is described in UML (Universal Modelling Language).
They suggest mark-up languages for data submissions. They provide MAGE-ML which is XML dtd.  In addition, there is a tabular format (MAGE-TAB) that has just been announced. It is a spreadsheet-like tabular format.

This data model is difficult to interpret. Fitting your data to this model can be a real trick.  I know, I’ve tried. And I’ve got years of work with formal data specifications behind me.  For the average lab tech it is almost impossible to interpret.  A bioinformatics programmer with exposure to MS Word and MS Excel (which I have read are the two most important requirements to succeed in bioinformatics (!)) would be in the same boat. 

I have nothing against models and standards.  Standards bring order to chaos — if they are simple enough to interpret and implement.

The article goes on to call for the interpretation of the microarray data in the GenBank format.

Just about every everybody in the biosciences field is familiar with this format.  Most important, they know how to submit data that will be interpreted as GenBank data.

GenBank data is stored internally at NCBI in ASN.1.  The ASN.1 format is extensively used in telecommunications and other areas.  After years of working with ASN.1 and especially NCBI ASN.1, I have to say that it is ideal for the storage of sequence and other data.

ASN.1 is infinitely extensible through its recursive abilities.   This is great in that it can encompass all the data for a particular data object.  However, the nesting nature of the ASN.1 construct can cause one to literally pull out one’s hair. 

ASN.1 doesn’t gracefully translate into SQL.  It is possible, but not very pretty and the queries are ridiculously complex. �

Using NCBI toolkit code to access ASN.1 data works if one knows C/C++ and has lots of experience working with suites of large complex software.

Our product (LARTS) was developed to make working with NCBI ASN.1 data a little easier and to create a new paradigm of searching NCBI ASN.1 data.

NCBI ASN.1 was distilled into a grammar that is parsed much like a programming language or the way a sentence is parsed for English class.  That grammar translates the ASN.1 into XML Schema.  This XML can then filtered for specific values or formatted for specific output such as a Genbank-like report.

The new paradigm means that the serious user should become somewhat familiar with the NCBI ASN.1 data structures.  By serious, I mean someone who wants to go beyand the currently offered output formats.

Our ncbixref link (http://www.lifeformulae.com/lartsonline/docs/ncbixref/NCBI-Seqset.html#Bioseq-set) provides a way to traverse these structures, starting with the top-level Bioseq-set.

In some instances, the ASN.1 data structure names don’t really describe the data they define.  For example, the ASN.1 data structure for dbSNP is ExchangeSet (http://www.lifeformulae.com/lartsonline/docs/ncbixref/Docsum-3-0.html#ExchangeSet).

Yet Another Standard

The Genomics Standards Consortium has a suggested format for next-generation sequencing experiments called MIGS (http://gensc.org/gc_wiki/index.php/Main_Page), or miminum information about a genome sequence. It’s extension is MIMS - Minimum Information about a Metagenomic Sequence.  The MIGS/MIMS data models are expressed in GCDML — Genomic Contextual Data Markup Language - http://gensc.org/gc_wiki/index.php/GCDML. GCDML is implemented using XML Schema.

Let’s hope the meta-data is kept to that “minimum”, but looking at http://www.nature.com/nbt/journal/v26/n5/box/nbt1360_BX1.html, it doesn’t seem so.

At any rate, the move toward XML Schema is a good thing and fits in well with our thinking.

HSEMB Conference and GenBank Release 168.0 No comments yet

Events of particular note this week – 

The HSEMB Conference –

The 26th Annual Houston Conference on Biomedical
Engineering Research (
http://www.hsemb.org) on 19-20 March 2009 at the University of Houston Hilton Hotel and Convention Center.

HSEMB has established the John Halter Award for Professional Achievement in Bioinformatics and Computational Biology.   The late Dr. John Halter is the founder of LifeFormulae, LLC.  http://www.lifeformulae.com/pages/about_jah_memorial.aspx is the link to our memorial to John.

Super Computing 2008

SC08 - Super Computing 2008, the International Conference for High
Performance Computing, Networking, Storage and Analysis. November
15-21, Austin Convention Center, Austin, Texas.
http://sc08.supercomputing.org/

And GenBank Release 168.0. –

Genbank Release 168.- flat files require roughtly 3871 GB for the sequence files only or 396 GB if you include the ’short directory’, ‘index’ and the *.txt files.  The ASN.1 data files require approximately 338 GB.

Recent statistics for non-WGS (Whole Genome Sequence), non-CON (Contig)  sequences are given below.

  Release  Date       Base Pairs   Entries

  167      Aug 2008   95033791652  92748599
  168      Oct 2008   97381682336  96400790

Recent statistics for WGS sequences:

  Release  Date       Base Pairs   Entries

  167      Aug 2008  118593509342  40214247
  168      Oct 2008  136085973423  46108952
 

During the 69 days between the close dates for GenBank Releases 167.0
and 168.0, the non-WGS/non-CON portion of GenBank grew by 2,347,890,684 basepairs and by 3,652,191 sequence records.

During that same period,1,111,311 records were updated. An average of about 69,036 non-WGS/non-CON records were added and/or updated per day.

Between releases 167.0 and 168.0, the WGS component of GenBank grew by 17,492,464,081 basepairs and by 5,894,705 records.

The combined WGS/non-WGS single-release increase of 19.84 Gbp for
Release 168.0 is the largest that GenBank has experienced, to date.
 

That’s a lot data. It’s a long, long way from the set of CDs that came out 4 times a year back in the late 90’s. Somewhere there are drawers and drawers of old Entrez CDs!  (Entrez is the engine used to search NCBI Life Sciences data.)

GenBank is considered an archive of information about sequences.  The nine-digit GI number, once the unique sequence identifier, have been supplanted by the Accession Number.

Speaking of NCBI data, we now have the complete set of the human (Homo Sapiens) data from NCBI’s dbSNP available through our LARTS product.  Currently, the files are not searchable by keyword, or keyword/element path.  This capability should be available to you the early part of next week.

Which brings me to the question.  Is GenBank data as important today as it was, say, 5 years ago?  If not GenBank, what NCBI data is considered critical to your current research and bioinformatics methods, and, if I might also ask, what are you doing with it?

Top of page / Subscribe to new Entries (RSS)