LifeFormulae Blog » Posts for tag 'Standards'

Interpreting Standards No comments yet

I found an article in the December 2008 issue of Nature Methods to be of particular interest, not in the least that I personally know the authors.

The article, under CORRESPONDENCE on page 991, surveyed a series of papers  from the 2007 issues of 20 journals. The purpose was to refute the Nature Methods editorial of March 2008 which asserted that the deposition of supporting raw microarray datasets is “routine”.  Data cited in the article was compared to that currently available in public databases.  They found that the rate of deposition of datasets was less than 50%. Only half of the discovery data that was the basis of the articles was available to the public.

They further cited that the fault of the MIAME (Minimum Information about a Microarray Experiment) standard. They assert that “owing to their highly contextual nature, have a more complex metadata structure than sequence data.”

The MIAME standard was forged by the MGED (Microarray and Gene Expression Data) and published in Nature Genetics, 29, 365-371 (2001).  The MGED also house the the Microarray and Expression (MAGE) Object Model which defines the entire environment of the experiment (e.g. organism, array design, etc.). MIAME is the standard and MAGE adheres to the MIAME standard and suggests formats for representation and submission of microarray data.

The premier microarray data repository is ArrayExpress located at http://www.ebi.ac.uk/microarray-as/ae/.

ArrayExpress is a public repository for transcriptomics data, which is aimed at storing  and MINSEQE for high-throughput data (http://www.mged.org/minseqe/) - is compliant data in accordance with MGED (http://www.mged.org/recommendations). The ArrayExpress Warehouse stores gene-indexed expression profiles from a curated subset of experiments in the repository.

Other sites are GEO (Gene Expression Omnibus - http://www.ncbi.nlm.nih.gov/geo/) and CIBIX (Center for Information Biology gene Expression database - http://cibex.nig.ac.jp/index.jsp).

MIAMExpress (a MIAME compliant microarray data submission tool) is currently available at http://sourceforge.net/projects/miamexpress/ and is the submission tool for microarray experiments. It is downloaded to your local system and must be made executable (compiled) on your system in order to use it.  A local of the MySQL database is required as well as the Perl programming language. 

The MIAME/MAGE meta-data model is described in UML (Universal Modelling Language).
They suggest mark-up languages for data submissions. They provide MAGE-ML which is XML dtd.  In addition, there is a tabular format (MAGE-TAB) that has just been announced. It is a spreadsheet-like tabular format.

This data model is difficult to interpret. Fitting your data to this model can be a real trick.  I know, I’ve tried. And I’ve got years of work with formal data specifications behind me.  For the average lab tech it is almost impossible to interpret.  A bioinformatics programmer with exposure to MS Word and MS Excel (which I have read are the two most important requirements to succeed in bioinformatics (!)) would be in the same boat. 

I have nothing against models and standards.  Standards bring order to chaos — if they are simple enough to interpret and implement.

The article goes on to call for the interpretation of the microarray data in the GenBank format.

Just about every everybody in the biosciences field is familiar with this format.  Most important, they know how to submit data that will be interpreted as GenBank data.

GenBank data is stored internally at NCBI in ASN.1.  The ASN.1 format is extensively used in telecommunications and other areas.  After years of working with ASN.1 and especially NCBI ASN.1, I have to say that it is ideal for the storage of sequence and other data.

ASN.1 is infinitely extensible through its recursive abilities.   This is great in that it can encompass all the data for a particular data object.  However, the nesting nature of the ASN.1 construct can cause one to literally pull out one’s hair. 

ASN.1 doesn’t gracefully translate into SQL.  It is possible, but not very pretty and the queries are ridiculously complex. �

Using NCBI toolkit code to access ASN.1 data works if one knows C/C++ and has lots of experience working with suites of large complex software.

Our product (LARTS) was developed to make working with NCBI ASN.1 data a little easier and to create a new paradigm of searching NCBI ASN.1 data.

NCBI ASN.1 was distilled into a grammar that is parsed much like a programming language or the way a sentence is parsed for English class.  That grammar translates the ASN.1 into XML Schema.  This XML can then filtered for specific values or formatted for specific output such as a Genbank-like report.

The new paradigm means that the serious user should become somewhat familiar with the NCBI ASN.1 data structures.  By serious, I mean someone who wants to go beyand the currently offered output formats.

Our ncbixref link (http://www.lifeformulae.com/lartsonline/docs/ncbixref/NCBI-Seqset.html#Bioseq-set) provides a way to traverse these structures, starting with the top-level Bioseq-set.

In some instances, the ASN.1 data structure names don’t really describe the data they define.  For example, the ASN.1 data structure for dbSNP is ExchangeSet (http://www.lifeformulae.com/lartsonline/docs/ncbixref/Docsum-3-0.html#ExchangeSet).

Yet Another Standard

The Genomics Standards Consortium has a suggested format for next-generation sequencing experiments called MIGS (http://gensc.org/gc_wiki/index.php/Main_Page), or miminum information about a genome sequence. It’s extension is MIMS - Minimum Information about a Metagenomic Sequence.  The MIGS/MIMS data models are expressed in GCDML — Genomic Contextual Data Markup Language - http://gensc.org/gc_wiki/index.php/GCDML. GCDML is implemented using XML Schema.

Let’s hope the meta-data is kept to that “minimum”, but looking at http://www.nature.com/nbt/journal/v26/n5/box/nbt1360_BX1.html, it doesn’t seem so.

At any rate, the move toward XML Schema is a good thing and fits in well with our thinking.

Top of page / Subscribe to new Entries (RSS)