LifeFormulae Blog » Posts for tag 'scientific programming'

Effective Bioinformatics Programming - Part 5 No comments yet

First, a little irony. In the late ’90’s I interviewed with BMC software in Houston. At that time, BMC was a supporter of big iron, providing report facilities, etc.

When asked what software I currently used, I replied with “GNU software”. The interviewer asked, “What is GNU? I’ve never heard of it.”

I explained that it was free software that you could download from the web, etc. But they weren’t really interested.

Anyway, had a feature this week - ‘MindTouch Names 20 Most Powerful Open-Source Voices of 2010. The first name mentioned was William Hurley. The chief architect of Open Source strategy at BMC. (

I guess they’re interested now.

Data Standards

There are any number of sequence data formats. This link at EBI – describes several.

What is really astounding is that most of these formats have remained to same over the years. The tab-delimited and CSV (comma separated values) format is as prolific as ever, as is the GenBank report.

And equally astonishing is the fact that manipulating the data (e.g. parsing GenBank reports) is still the same.

True, the Bio libraries such as BioPerl, BioJava, BioRuby, now provide modules that make this easier, (if you can install them) but it is still the same old download and parse.

There are also several groups trying to standardize sequence data. The SO (Sequence Ontology) group ( is trying to do for sequence annotations what GO (Gene Ontology - did for genes and gene product attributes.

MIGS (Minimum Information About A Genome Sequence spec at is following the course of the MAGE MIAME Standard (Minimum Information About a Microarray Experiment at Good luck with that, as many scientists have openly voiced objections to that standard.

XML and the Web

XML (eXtensible Markup Language) and WSDL (Web Services Description Language) are one method of easing the interchange of data. Links at – and

There are a number of drawbacks to this setup.

Not all of the sequence data is available in XML or well-formed XML.

Some XML, such as NCBI XML, needs further interpretation. For example, the sequence feature (annotation) locations must be “translated” for further use.

XSLT has performance issues, and is size-delimited. We tried processing LARTS converted NCBI ASN.1 GenBank XML data to XSLT and found there were definite size limitations.

Using WSDL means exposing yourself to the world via the web.

Javascript has too many security questions to consider seriously.

Software Development

Software development takes time and the right people. True, there is a lot of open source software out there, but I’ve mentioned the perils of that method in a previous blog.

A scientist with a grant to produce results dependent on computer analysis is only going to write code that is good enough to create code (or find someone (read post-doc) who can create that code very cheaply) that will back up those findings.

Has the code been extensively tested? Are the results produced by the code valid? Can the code be used by future projects? Is the software portable? Is it robust? Can it be ported to different hardware environments?

There is a great article – “Are we taking supercomputing code seriously?” at ( This article, in turn, has links to other articles on methods and algorithms, and error behavior, for example. This one on scientific software considers how multi-processing has influenced algorithm development and the problem of different multi-processors co-existing on the same machine (

He states that in the rush to do science, scientists fail to spot software for what it is: the analogue of the experimental instrument. Therefore the software must be treated with the same respect that a physical experiment would.

When I started my career, I worked on a system that was a totally integrated database system for hospitals. It was one of those systems that was so very ahead of its time (mid-80’s), that a corporation bought the product and squashed it.

Anyway, our Systems and Extensions group supported the 6 compilers that comprised the system software that made the system function. The tailoring group wrote the code that created the screens that drove the system.

At the inception of the system, a decision was to be made over the make up of the tailoring group: should they be programmers that would be taught medical jargon, terms, etc; or should they be medical personnel – doctors, nurses, techs, that would be taught programming?

The decision was to go with medical personnel, as it was surmised they would understand hospitals better.

At the same time, a decision to limit the number of screens a hospital could request (called tailoring) to 500 was discussed. The decision was to let the hospital have however many screens it wanted.

The tailoring group got their training and set in to programming. After a period of time, it was realized that the group had, in essence, created one bad program and copied it thousands of times.

It was so bad, we did two things. We created a program profiler that produced a performance summary of the programming aspects of that program. (We were immediately asked to remove it by the tailoring group, as it was too confusing.) Two, we created an automated programming module that would create the code from the display widgets on the screen designed by the tailoring group.

This approach was helping, but people were abandoning ship as talk of an acquisition was surfacing. Our junior programmer went from new-hire to senior team member in 30 days.

I think we would have done a lot better with programmers learning medical terms.

As for the hospital screen limit, we had hospitals with 10,000 individual screens. We should have stuck with 500.

One last thing. When looking at any piece of scientific programming, please realize that in the Authors accreditation usually starts with the PI. The people who did the actual work are generally listed at the end of the line. The PI may have had the idea, but likely as not could not code it.

Top of page / Subscribe to new Entries (RSS)