Effective Bioinformatics Programming – Part 2
Instrumentation Programming
Instrumentation Programming usually concerns computer control over the actions of an instrument and/or the streaming or download of data from the device. Instrumentation in the Life Sciences covers data loggers, waveform data acquisition systems, pulse generators, image capture, and others used extensively in LIMS (Laboratory Information Management Systems), Spectroscopy, and other scientific arenas.
Most instruments are controlled by codes called “control codes”. These codes are usually sent or received by a C/C++ program. Some instrumentation manufacturers, however, have a proprietary programming language that must be used to “talk” to the instrument.
Some companies are nice enough to provide information on the structure of the data that comes from their instrument. When they don’t you may have to use good old “reverse engineering”. That’s where the Unix/Linux od utility comes in handy, because lots of time will be spent poring over hex dumps.
As you can tell, programming instruments requires a lot of patience. This is especially true if everything hangs or gets into a confused state. There is nothing you can do but recycle the power to everything and start over. This is usually accompanied by a banging of keyboards and the muttering of a few choice words.
Development Platforms or IDEs (Integrated Development Environment)
I have to mention development platforms as they can be useful, but also problematic. My favorite is Eclipse (http://www.eclipse.org). Originating at IBM, Eclipse was supported by a consortium of software vendors. Eclipse has now become the Eclipse open source community, supported by the Eclipse Foundation.
Eclipse is a development platform for programmers comprised of extensible frameworks, tools and runtimes for building, deploying and managing software across the lifecycle. You can find plug-ins that will enable you to accomplish just about anything you want to do. A plug-in is an addition to the Eclipse platform that is not included in the base package, like an Eclipse memory manager or a debugging a Tomcat servlet.
Sun offers NetBeans (“The only IDE you need.”). I used NetBeans (http://netbeans.org) at lot on the Mac. Previously, Sun offered StudioOne and Creator. I used StudioOne (on Unix) and Creator (on Linux). I haven’t worked with NetBeans lately because they’re currently mostly Swing-centric (GUI) development and are not fully JSF (java Server Faces) aware. NetBeans will make a template for JSF but doesn’t (as yet) provide an easy way to create a JSF interface.
There are two main problems with development platforms. For one, the learning curve is fairly steep. There area lot of tutorials and examples available, but you still have take the time to do it.
The best way to use a development platform is to divide the work. One group does web content, one group does database, one group does middleware (the glue that holds everything together), etc. Each group or person can then become knowledgeable in their area and move on or absorb other areas as needed.
The second problem with these tools in that you are stuck with their developmental approach.
You have to do things a certain way and adhere to a certain structure. Flexibility can be a problem.
This is especially true of interface building. You are stuck with the code the tool generates and the files and file structures created. With most tools, you have to use that tool to access files that the tool created.
IDEs can be useful in that they will perform mundane coding tasks for you. For instance, given a database record, the IDE can use those table elements to generate web forms and the SQL queries driving those forms. You can then expand the simple framework or leave as is.
Open Source/Free Software and Bioinformatics Libraries
There a lot of good an not-so-good Open Source code out there for the Life Sciences.
There are several “gotchas” to look out for, including –
Is the code reliable? Are others using it? Are they having problems?
Will the code run on your architecture? What will it take to install
What kind of user support is available? What’s the response time?
Is there a mailing list available for the library, package, or project of interest?
The are several bioinformatics software libraries available for various languages. All of these libraries are OpenSource/Free Software. Installing these libraries takes a little more that just downloading and uncompressing a package. There are “dependencies” (other libraries, modules, programs, and access to external sites) that must be resident or accessible before a complete build of these libraries is possible.
The following is a list of the most popular libraries and their respective dependencies.
BioPerl 1.6.1: Modules section of http://www.cpan.org/
Required modules:
perl => 5.6.1
IO::String => 0
DB_File => 0
Data::Stag => 0.11
Scalar::Util => 0
ExtUtils::Manifest => 1.52
Required modules for source build:
Test::More => 0
Module::Build => 0.2805
Test::Harness => 2.62
CPAN => 1.81
Recommended modules: some of these have circular dependencies
Ace => 0
Algorithm::Munkres => 0
Array::Compare => 0
Bio::ASN1::EntrezGene => 0
Clone => 0
Convert::Binary::C => 0
Graph => 0
GraphViz => 0
HTML::Entities => 0
HTML::HeadParser => 3
HTTP::Request::Common => 0
List::MoreUtils => 0
LWP::UserAgent => 0
Math::Random => 0
PostScript::TextBlock => 0
Set::Scalar => 0
SOAP::Lite => 0
Spreadsheet::ParseExcel => 0
Spreadsheet::WriteExcel => 0
Storable => 2.05
SVG => 2.26
SVG::Graph => 0.01
Text::ParseWords => 0
URI::Escape => 0
XML::Parser => 0
XML::Parser::PerlSAX => 0
XML::SAX => 0.15
XML::SAX::Writer => 0
XML::Simple => 0
XML::Twig => 0
XML::Writer => 0.4
Some of these modules such as SOAP::Lite depend upon many other
modules.
BioPython 1.53: http://biopython.org/
Additional packages:
NumPy (recommended) http://numpy.scipy.org/
ReportLab (optional) http://www.reportlab.com/software/opensource/
MySQLdb (optional) May be in core Python distribution.
BioRuby 1.4.0: http://www.bioruby.org/
The base distribution is self-contained and uses the RubyGems installer.
Optional packages.
RAA:xmlparser
RAA:bdb
RubyForge:ActiveRecord and at least one driver (or adapter) from
RubyForge:MySQL/Ruby, RubyForge:postgres-pr, or RubyForge:ActiveRecord
Oracle enhanced adapter.
RubyForge:libxml-ruby (Ruby language bindings for the GNOME Libxml2 XML toolkit)
BioJava 1.7.1: http://www.biojava.org/
biojava-1.7.1-all.jar: self-contained binary distribution with
all dependencies included.
biojava-1.7.1.jar: bare distribution that requires the following additional
jar files. These are required for building from source code.
Most are from http://www.apache.org/
bytecode.jar: required to run BioJava
commons-cli.jar: used by some demos.
commons-collections-2.1.jar: demos, BioSQL Access
commons-dbcp-1.1.jar: legacy BioSQL access
commons-pool-1.1.jar: legacy BioSQL access
jgraph-jdk1.5.jar: NEXUS file parsing
Don’t forget to sign up for the mailing list for that library or libraries of interest to get the lastest news, problems, solutions, etc. for that library or just life science topics in general.
Software Hosting and Indexing Sites
There are several Software Hosting and Indexing Sites that serve as software distribution points for bioinformatics software.
SourceForge.net – Search on bioinformatics for a list of software available. Projects include:MIAMExpress - http://sourceforge.net/projects/miamexpress/
freshmeat– The Web’s largest index of Unix and cross-platform software
Bioinformatics Organization – The Open Access Institute
Open Bioinformatics Foundation (O|B|F) - Hosts Many Open Bioinformatics Projects
Public Domain Manifesto
In this time of curtailment of civil rights, the Public Domain Manifesto seems appropriate (http://www.publicdomainmanifesto.org/node/8). Sign the petition while you’re there.
This is the end of Part 2. Part 3 will explore more software skills, project management, and other computational topics.
Programming Practices
I’ve been privy to all sorts of coding adventures. I’ve had a website and supporting components dropped in my lap with little overview other than the directory structure. I’ve had to plow through the methodology and software written for one aircraft certification program to determine if any of it was relevant for the next certification project.
In either case, it wasn’t a lot of fun. I spent lots of time reading code and debugging applications in addition to talking to vendors, customer support, technical staff, etc.
There were days when I would have killed for a well-documented program. Instead, I had to spend weeks in debug, poking around, learning how things worked. In both cases, the developers of said projects were no longer available for consultation.
Here are few programming practices that I’ve tried to adhere to when writing code.
Document, Document, Document
I am a big proponent of Javadoc. Javadoc is a tool for generating API documentation in HTML format from doc comments in source code. It can be downloaded only as part of the Java 2 SDK. To see documentation generated by the Javadoc tool, go to J2SE 1.5.0 API Documentation at
http://java/sun/com/j2se/1.5.0/docs/api/index.html. Go to “How to Write Doc Comments for the Javadoc Tool at http://java.sun.com/j2se/javadoc/writingdoccomments for more information.
Other languages have similar markup languages for source code documentation.
Perl has perlpod - http://perldoc.perl.org/perlpod.html
Python has pydoc - http://docs.python.org/library/pydoc.html
Ruby has RDoc - http://rdoc.sourceforge.net
I usually start each program I write with a comment header that lists: who developed the program, when it was developed, why it was developed, for whom it was developed, and what the program is supposed to accomplish. It’s also helpful to list the version number of the development language. List any dependencies such as support modules that are not part of the main install that were downloaded for the application.
Each class, method, module (etc.) should be headed by a short doc comment containing a description followed by definitions of the parameters utilized by the entity, both input and output.
Coding practices
I think all code should be read like a book. Otherwise -
- Code should flow from a good design.
- The design should be evolutionary.
- Code should be modular.
- Re-use should be a main concern.
- Each stage of development should be functional.
- Review code on a daily or at least a weekly basis.
I’ve mostly found that peer review can be a great time-waster.
There are several design methodologies, such as Extreme Programming, that are the flavor de jour. None have been completely successful in producing perfect software.
To-Do Lists - Use them!
There are project management and other tools available for this, but a plain text To-Do file that lists system extension, enhancements, and fixes is a good thing. It’s simple, you don’t need access to or have to learn how to navigate a complex piece of software.
Find Out Who Knows What and GO ASK THEM
I worked on a project and shared a cube with a guy named Al. Al was not the most pleasant person (he was the resident curmudgeon), but we got along. Al had been working on the huge CAD (Computer Aided Design) project since the first line of code was written. If I couldn’t understand something, a brief conversation with Al was all I needed.
Every time a programmer on that project complained to me about not understanding something, I told them to go ask Al. However, I ended up as the “Go Ask Al” person. I didn’t mind, as we became the top development group in that environment.
Use Code Repositories
The determination of which code repository to use - SVN (Subversion), Mercurial, and GIT are the big three - has become something of a religious issue. Any of these have their pros and cons. JUST USE ONE!
Integrated Development Environments (IDE)
There are several of these available. My favorite is Eclipse (www.eclipse.org). I’ve also used SunOne which became Creator which is rolled into Net.Beans which was out there all along.
There are a lot of plug ins available for Eclipse and Net.Beans that enable you to develop for almost any environment. The ability to map an SQL data record directly to a web page and automatically generate the SELECT statement to populate that page has to be at the top of my list.
I’ve used VisualStudio for C++ development in a Windows environment for a few applications, but most of my development as been on Unix, Linux, and Mac platforms.
The problem with most IDE’s is that they are complex and it takes awhile for a programmer to become adept at using them. You’re also locked into that development methodology which may become inflexible, due to the applications under development.
We used EMACS on Linux for all LARTS development (LifeFormulae ASN.1 Reader Tool Set, my current project), although my favorite editor is vim/vi. I can do things faster in vi, mainly because I’ve used it for so long.
Which Language?
My favorite language of all time is Smalltalk. If things had worked out, we would all be doing Smalltalk instead of Java.
Perl is a good scripting language for text manipulation. It’s the language of choice for spot programming or Perl in a panic. Spot programming used infrequently is okay. However, if everything you are doing is panic programming, your department needs to re-think its software development practices.
Lately, I’ve been working in Java. Java is powerful, but it also has its drawbacks.
We will always have C. According to slashdot.org, most open source projects submitted in 2008 were in C, and this was by a very wide margin. C has a degenerative partner, C++. C++ does not clean up after itself. You have to use delete. Foo(); can either be a function declaration or a call to a constructor, depending on the type of Foo.
Fortran is another one that will always be around. I’ve done a lot of Fortran programming. It is used quite extensively in engineering, as is Assembly Language. I have been called a “bit-twiddler” because I knew how to use assembler.
Variable Names
This is a touchy subject. I’ve been around programmers who have said code should be self documenting. Okay, but long variable names can be a headache, especially if one is trying to debug an application with a lot of long, unwieldy variable lanes.
Let’s just say variable names should be descriptive.
Debugging
This should probably go under the IDE section, but I’ll make my case for debugging here.
Every programmer should know how to debug an application.
The simplest form of debugging is the print() or println() statement to examine variables at a particular stage of the application.
Some debugging efforts can become quite complex, so as debugging a process that utilizes several machines, or pieces of equipment such as an airplane.
Sun Solaris has an application truss that lets you debug things at the system level and lets you follow the system calls. The Linux version is strace.
I am most familiar with gdb - The GNU Debugger. I’ve also used dbx on Unix.
The IDE’s offer a rich debugging experience and plug-ins let you debug almost anything.
Closing Thoughts
Always program for those coming behind you. They will appreciate your effort.
It’s best to keep it simple. Especially the user interface.
Speaking of users, talk to them. Get their feedback on everything you do to the user interface. I spent my lunch hour teaching the Flight Test Tool Crib folks (for whom I was creating a database inventory system) how to dance the Cotton-Eyed Joe. The moral, Keep your users happy.
Technology is wonderful, but technology for technology’s sake (”because we can!”) is usually overkill.
The guiding principle should be — Is it necessary?
By the way, going back to that certification project. I found that instead of spending $15K for a disk drive, the company was conned into spending $750K for custom developed software that I had to throw in the trash, except for one tiny piece. Was it necessary? The point is that the best answer is not always the most expensive one.
(I’m delaying the “horror stories” until next week, because I want to fully document them all.)
I ran across the phrase “computer science wild” at a recent conference. I’ve got my own thought, especially since the top 25 coding errors was released yesterday. The link to the article is - http://www.computerworld.com/action/article.do?command=viewArticleBasic&articleId=9125678&source=NLT_SIT&nlid=91.
I think any programmer should have the opportunity to write software that might kill someone, blow up an extremely expensive piece of equipment, or cause a waste of thousands of dollars because the system is down. Maybe then they would think, write better code, and debug the software thoroughly before they released it into the wild.
The Ariadne V rocket blew up on take-off because the software didn’t contain an exception handler for buffer overflow! (This translates to something like an array overflow. An overflow would trigger a programming mechanism that would write out the buffer contents and clear it. The usual device is to transfer data capture to another buffer while the full buffer is written to i/o.)
The excuse for the disaster was that the specifications didn’t spell out the need for that programming mechanism. An exception handler is a very basic mechanism for catching and correcting errors. There is no excuse for this oversight.
One major project I worked on was acoustical (noise) testing of aircraft engines. Our crew would go to some really great places like Roswell, NM, Moses Lake, WA, or Uvalde, TX. We would record and analyze the noise of the engines as the aircraft flew over at different altitudes with variable loads in various approach patterns.
There were several pieces of software that had to work in tandem. The airborne system, the ground-based weather station, the meteorological (met) plane, the accoustic data analyzer, and the analysis station all had to work together to get the required results.
There was no room for error. Measuresments had to be exact, even out to 16 places after the decimal point.
Modeling techniques, programming languages and IDEs (Interactive Development Environments) have become very sophisticated and complex. A programmer today can “gee whiz” just about anything.
“Because we can” has become the norm.
This is great, but I’ve run into lab techs, etc. who were just this side of computer illiterate. Like my dad, they adhere to a limited number of computer applications, accessed by a few key strokes or mouse clicks they have memorized.
And don’t think that engineers are immune. They had to be drawn “screaming and kicking” away from their sliderules.
I’m for simple to start. You can always add more “bells and whistles” as the system (and its users) matures.