LifeFormulae Blog » Posts in 'News' category

The end of Bioinformatics?! No comments yet

The End of Bioinformatics?!

I read with some interest the announcement of the Wolfram Alpha.  Wolfram intends to be the end all and be all data mining systems and some say, makes bioinformatics obsolete.

Wolfram’s basis is a formal Mathematica representation.  It’s inference engine is a large number of hand-written scripts that access data that has been accumulated and curated.  The developers stress that the system is not Artificial Intelligence and is not aiming to be.  For instance,  a sample query,

“List all human genes with significant evidence of positive selection since the human-chimpanzee common ancestor, where either the GO category or OMIM entry includes ‘muscle’”

could currently be executed with SQL, provided the underlying data is there. 

Wolfram won’t replace bioinformatics.  What it will do is make it easier for a neophyte to get answers to his or her questions because they can be asked in a simpler format.

 I would guess Wolfram uses one or more these scripts to address a specific data set in conjunction with a natural language parser.  These scripts would move this data to a common model that could then be modeled on a web page.

But why not AI?  Why not replace all those “hand-written” scripts, etc.  with a real inference engine.

I rode the first AI wave.  I was a member of the first of 25 engineers selected to be a part of the McAir AI Initiative at McDonnell Aircraft Company.  (”There is AI in McAir”).  In all, 100 engineers were chosen from engineering departments to attend courses leading to a Certificate in Artificial Intelligence from Washington University in St. Louis.

One of the neat things about the course was the purchase of at least 30 workstations (maybe as many as 60) for a young company called Sun that were loaned to Washington University for the duration of the course.  Afterwards, we got a few Symbolics machines for our CADD project. 

Other than Lisp and Prolog, the software we used was called KEE (Knowledge Engineering Environment).  Also, there was a DEC (Digital Equipment Company) language called OPS5.

The course was quite fast-paced but very extensive.  We had the best AI consultants available at the time lecture and give assignments in epistemology, interviewing techniques, and so on. I had a whole stack of books.

The only problem was that no money was budgeted (or so I was told) for AI development for the departments for the engineers when they returned from the course eager to AI everything.  A lot of people left.

Anyway, my group of three developed a “Battle Damage Repair” system that basically “patched up” the composite wing skins of combat aircraft.   Given the size and location of the damage, the system would certify whether the aircraft would be able to return to combat, and would output the size and substance of the patch if the damage wasn’t that bad.

One interesting tidbit:  We wanted to present our system at a conference in San Antonio and had a picture of a battle-damaged F-15 we wanted to use.  Well, we were told that the picture was classified and, as such, we couldn’t use it.  Well, about that same time, a glossy McAir brochure featuring our system and that photo were distributed at the AAAI (American Assn. of Artificial Intelligence) to thousands of people. 

Another system I developed dealt with engineering schematics.  These schematics were layered.  Some layers and circuits were classified.   Still another system scheduled aircraft for painting and yet another charted a path for aircraft through hostile territory, activating electronic counter measures as necessary.

I guess the most sophisticated system I worked on was with the B-2 program.  The B-2 skin is a composite material.  This material has to be removed from a freezer, molded into a final shape and cooked in a huge autoclave before it completely thawed. 

We had to schedule materials, and the behavior of that material under various circumstances, as well as people and equipment.  The purpose was to avoid “bottlenecks” in people and equipment.  I was exposed to the Texas Instruments Explorer and Smalltalk-80 on an Apple.  I’ve been in love with Smalltalk ever since.

The system was developed, but it was never used.  The problem was that we had to rank workers by expertise.  That’s union workers and that wasn’t allowed. 

It was a nice system that integrated a lot of systems and worked well.  Our RFP (Request for Proposals) went out to people like Carnegie-Mellon.  We had certain performance and date requirements that we wanted to see in the final system.  We were told that the benchmarks would be difficult, in not impossible, to attain.  Well, we did it, on our own without their help.

We also had a neural net solution that inspected completed composite parts. The parts were submerged in water and bombarded with sound waves.  The echoes were used by the system to determine part quality.

AI promised the world, and then it couldn’t really deliver.  So it kind of went to the back burner.

One problem with the end and be all.  It will only be as good as your model.  It will only be as good as the developers can determine the behavior of the parts and how they interact with the whole.  Currently, this is a moving target and is changing day to day.  Good luck.

Links -

Will Wolfram Make Bioinformatics Obsolete? - http://johnhawks.net/weblog/reviews/genomics/bioinformatics/wolfram-alpha-bioinformatics-2009.html

Computer Science Wild No comments yet

(I’m delaying the “horror stories” until next week, because I want to fully document them all.)

 

I ran across the phrase “computer science wild” at a recent conference.   I’ve got my own thought, especially since  the top 25 coding errors was released yesterday.  The link to the article  is - http://www.computerworld.com/action/article.do?command=viewArticleBasic&articleId=9125678&source=NLT_SIT&nlid=91.

 

I think any programmer should have the opportunity to write software that might kill someone, blow up an extremely expensive piece of equipment, or cause a waste of thousands of dollars because the system is down.  Maybe then they would think, write better code, and debug the software thoroughly before they released it into the wild.

 

The Ariadne V rocket blew up on take-off because the software didn’t contain an exception handler for buffer overflow!   (This translates to something like an array overflow. An overflow would trigger a programming mechanism that would write out the buffer contents and clear it.  The usual device is to transfer data capture to another buffer while the full buffer is written to i/o.)

 

The excuse for the disaster was that the specifications didn’t spell out the need for that programming mechanism.   An exception handler is a very basic mechanism for catching and correcting errors.  There is no excuse for this oversight.

 

One major project I worked on was acoustical (noise) testing of aircraft engines.  Our crew would go to some really great places like Roswell, NM, Moses Lake, WA, or Uvalde, TX.  We would record and analyze the noise of the engines as the aircraft flew over at different altitudes with variable loads in various approach patterns.

 

There were several pieces of software that had to work in tandem.  The airborne system, the ground-based weather station,  the meteorological (met) plane, the accoustic data analyzer, and the analysis station all had to work together to get the required results. 

 

There was no room for error.  Measuresments had to be exact, even out to 16 places after the decimal point. 

 

Modeling techniques, programming languages and IDEs (Interactive Development Environments) have become very sophisticated and complex.  A programmer today can “gee whiz” just about anything.

“Because we can” has become the norm.

 

This is great, but I’ve run into lab techs, etc. who were just this side of computer illiterate.  Like my dad, they adhere to a limited number of computer applications, accessed by a few key strokes or mouse clicks they have memorized.

 

And don’t think that engineers are immune.  They had to be drawn “screaming and kicking” away from their sliderules.

 

I’m for simple to start.  You can always add more “bells and whistles” as the system (and its users) matures.

 

 

 

Interpreting Standards No comments yet

I found an article in the December 2008 issue of Nature Methods to be of particular interest, not in the least that I personally know the authors.

The article, under CORRESPONDENCE on page 991, surveyed a series of papers  from the 2007 issues of 20 journals. The purpose was to refute the Nature Methods editorial of March 2008 which asserted that the deposition of supporting raw microarray datasets is “routine”.  Data cited in the article was compared to that currently available in public databases.  They found that the rate of deposition of datasets was less than 50%. Only half of the discovery data that was the basis of the articles was available to the public.

They further cited that the fault of the MIAME (Minimum Information about a Microarray Experiment) standard. They assert that “owing to their highly contextual nature, have a more complex metadata structure than sequence data.”

The MIAME standard was forged by the MGED (Microarray and Gene Expression Data) and published in Nature Genetics, 29, 365-371 (2001).  The MGED also house the the Microarray and Expression (MAGE) Object Model which defines the entire environment of the experiment (e.g. organism, array design, etc.). MIAME is the standard and MAGE adheres to the MIAME standard and suggests formats for representation and submission of microarray data.

The premier microarray data repository is ArrayExpress located at http://www.ebi.ac.uk/microarray-as/ae/.

ArrayExpress is a public repository for transcriptomics data, which is aimed at storing  and MINSEQE for high-throughput data (http://www.mged.org/minseqe/) - is compliant data in accordance with MGED (http://www.mged.org/recommendations). The ArrayExpress Warehouse stores gene-indexed expression profiles from a curated subset of experiments in the repository.

Other sites are GEO (Gene Expression Omnibus - http://www.ncbi.nlm.nih.gov/geo/) and CIBIX (Center for Information Biology gene Expression database - http://cibex.nig.ac.jp/index.jsp).

MIAMExpress (a MIAME compliant microarray data submission tool) is currently available at http://sourceforge.net/projects/miamexpress/ and is the submission tool for microarray experiments. It is downloaded to your local system and must be made executable (compiled) on your system in order to use it.  A local of the MySQL database is required as well as the Perl programming language. 

The MIAME/MAGE meta-data model is described in UML (Universal Modelling Language).
They suggest mark-up languages for data submissions. They provide MAGE-ML which is XML dtd.  In addition, there is a tabular format (MAGE-TAB) that has just been announced. It is a spreadsheet-like tabular format.

This data model is difficult to interpret. Fitting your data to this model can be a real trick.  I know, I’ve tried. And I’ve got years of work with formal data specifications behind me.  For the average lab tech it is almost impossible to interpret.  A bioinformatics programmer with exposure to MS Word and MS Excel (which I have read are the two most important requirements to succeed in bioinformatics (!)) would be in the same boat. 

I have nothing against models and standards.  Standards bring order to chaos — if they are simple enough to interpret and implement.

The article goes on to call for the interpretation of the microarray data in the GenBank format.

Just about every everybody in the biosciences field is familiar with this format.  Most important, they know how to submit data that will be interpreted as GenBank data.

GenBank data is stored internally at NCBI in ASN.1.  The ASN.1 format is extensively used in telecommunications and other areas.  After years of working with ASN.1 and especially NCBI ASN.1, I have to say that it is ideal for the storage of sequence and other data.

ASN.1 is infinitely extensible through its recursive abilities.   This is great in that it can encompass all the data for a particular data object.  However, the nesting nature of the ASN.1 construct can cause one to literally pull out one’s hair. 

ASN.1 doesn’t gracefully translate into SQL.  It is possible, but not very pretty and the queries are ridiculously complex. �

Using NCBI toolkit code to access ASN.1 data works if one knows C/C++ and has lots of experience working with suites of large complex software.

Our product (LARTS) was developed to make working with NCBI ASN.1 data a little easier and to create a new paradigm of searching NCBI ASN.1 data.

NCBI ASN.1 was distilled into a grammar that is parsed much like a programming language or the way a sentence is parsed for English class.  That grammar translates the ASN.1 into XML Schema.  This XML can then filtered for specific values or formatted for specific output such as a Genbank-like report.

The new paradigm means that the serious user should become somewhat familiar with the NCBI ASN.1 data structures.  By serious, I mean someone who wants to go beyand the currently offered output formats.

Our ncbixref link (http://www.lifeformulae.com/lartsonline/docs/ncbixref/NCBI-Seqset.html#Bioseq-set) provides a way to traverse these structures, starting with the top-level Bioseq-set.

In some instances, the ASN.1 data structure names don’t really describe the data they define.  For example, the ASN.1 data structure for dbSNP is ExchangeSet (http://www.lifeformulae.com/lartsonline/docs/ncbixref/Docsum-3-0.html#ExchangeSet).

Yet Another Standard

The Genomics Standards Consortium has a suggested format for next-generation sequencing experiments called MIGS (http://gensc.org/gc_wiki/index.php/Main_Page), or miminum information about a genome sequence. It’s extension is MIMS - Minimum Information about a Metagenomic Sequence.  The MIGS/MIMS data models are expressed in GCDML — Genomic Contextual Data Markup Language - http://gensc.org/gc_wiki/index.php/GCDML. GCDML is implemented using XML Schema.

Let’s hope the meta-data is kept to that “minimum”, but looking at http://www.nature.com/nbt/journal/v26/n5/box/nbt1360_BX1.html, it doesn’t seem so.

At any rate, the move toward XML Schema is a good thing and fits in well with our thinking.

HSEMB Conference and GenBank Release 168.0 No comments yet

Events of particular note this week – 

The HSEMB Conference –

The 26th Annual Houston Conference on Biomedical
Engineering Research (
http://www.hsemb.org) on 19-20 March 2009 at the University of Houston Hilton Hotel and Convention Center.

HSEMB has established the John Halter Award for Professional Achievement in Bioinformatics and Computational Biology.   The late Dr. John Halter is the founder of LifeFormulae, LLC.  http://www.lifeformulae.com/pages/about_jah_memorial.aspx is the link to our memorial to John.

Super Computing 2008

SC08 - Super Computing 2008, the International Conference for High
Performance Computing, Networking, Storage and Analysis. November
15-21, Austin Convention Center, Austin, Texas.
http://sc08.supercomputing.org/

And GenBank Release 168.0. –

Genbank Release 168.- flat files require roughtly 3871 GB for the sequence files only or 396 GB if you include the ’short directory’, ‘index’ and the *.txt files.  The ASN.1 data files require approximately 338 GB.

Recent statistics for non-WGS (Whole Genome Sequence), non-CON (Contig)  sequences are given below.

  Release  Date       Base Pairs   Entries

  167      Aug 2008   95033791652  92748599
  168      Oct 2008   97381682336  96400790

Recent statistics for WGS sequences:

  Release  Date       Base Pairs   Entries

  167      Aug 2008  118593509342  40214247
  168      Oct 2008  136085973423  46108952
 

During the 69 days between the close dates for GenBank Releases 167.0
and 168.0, the non-WGS/non-CON portion of GenBank grew by 2,347,890,684 basepairs and by 3,652,191 sequence records.

During that same period,1,111,311 records were updated. An average of about 69,036 non-WGS/non-CON records were added and/or updated per day.

Between releases 167.0 and 168.0, the WGS component of GenBank grew by 17,492,464,081 basepairs and by 5,894,705 records.

The combined WGS/non-WGS single-release increase of 19.84 Gbp for
Release 168.0 is the largest that GenBank has experienced, to date.
 

That’s a lot data. It’s a long, long way from the set of CDs that came out 4 times a year back in the late 90’s. Somewhere there are drawers and drawers of old Entrez CDs!  (Entrez is the engine used to search NCBI Life Sciences data.)

GenBank is considered an archive of information about sequences.  The nine-digit GI number, once the unique sequence identifier, have been supplanted by the Accession Number.

Speaking of NCBI data, we now have the complete set of the human (Homo Sapiens) data from NCBI’s dbSNP available through our LARTS product.  Currently, the files are not searchable by keyword, or keyword/element path.  This capability should be available to you the early part of next week.

Which brings me to the question.  Is GenBank data as important today as it was, say, 5 years ago?  If not GenBank, what NCBI data is considered critical to your current research and bioinformatics methods, and, if I might also ask, what are you doing with it?

It’s official: the LifeFormulae Blog! No comments yet

Welcome to LifeFormulae’s official Blog site. Thank you for checking us out. Feel free to post comments for us, including any topics you would like us to cover. The purpose of this blog is to bring current events within the life sciences and bioinformatics communities to the forefront of our thoughts, to stay up-to-date on what’s going on in the research community, and to create a forum of discussion about the ever-changing environment to which we, as researchers, have become accustomed.

As you may know, Cambridge Healthtech Institute’s Data-Driven Discovery Summit 2008 was held in Rhode Island at the end of September. We had so many great conversations and were introduced to so many great people, we wanted to make sure those conversations continued. There were so many questions that covered diverse topics, we couldn’t find room on our website to answer them all comprehensively. So, we decided that we wanted a built-in community to foster communication on any topic related to bioinformatics, or any sub-topic beyond that.

We all read the industry newsletters and follow the latest publications when we get to them, but we want you to be able to ask questions about the topic, share feedback, let people know how it’s affecting you, vent, enlighten, inquire, observe, remark, express yourself.

At LifeFormulae, we have some every-day people who have been in the business for over 20 years. We think you might like what they have to say. If you don’t, just let us know. Nothing would make us happier. We plan to have alternating bloggers, as well as a few guest bloggers from time to time (let us know if you’re interested). We’ll try to keep it interesting (and pertinent), but remember that feedback always helps!

Talk to you soon!

The LifeFormulae staff

Top of page / Subscribe to new Entries (RSS)