LifeFormulae Blog » Posts for tag 'Bioinformatics'

Cue the Hallelujah Chorus! No comments yet

The Volume 29 Number 1 January 2011 issue of nature biotechnology ( finally puts in print what I’ve been recommending all along. The Feature article on computational BIOLOGY, “Trends in computation biology – 2010” on page 45 states, “Interviews with leading scientists highlight several notable breakthroughs in computational biology from the past year and suggest areas where computation may drive biological discovery,”

The researchers were asked to nominate papers of particular interest published in the previous year that have influenced the direction of their research.

The article is good, but what was really interesting was Box 2 – Cross-functional individuals on page 49. To quote, “Our analysis…suggests that researchers of a particular type are driving much of cutting-edge computational biology. Read on to find out what characterizes them.”

I’m going to re-print Box 2 Cross-functional individuals in it’s entirety since it’s short and the message is so very important.

Box 2 Cross-functional individuals

In the courses of compiling this survey, several investigators remarked that it tends to be easier for computer scientists to learn biology that for biologists to learn computer science. Even so, it is hard to believe that learning the central dogma and the Krebs cycle will enable your typical programmer-turned-computational biologist to stumble upon a project that yields important novel biological insights. So what characterizes successful computational biologists?

George Church, whose laboratory at Harvard Medical School (Cambridge, MA USA) has a history of producing bleeding-edge research in many cross-disciplinary domains, including computational biology, say, “Individuals in my lab tend to be curious and somewhat dissatisfied with the way things are. They are comfortable in two domains simultaneously. This has allowed us to go after problems in the space between traditional research projects.”

A former Church lab member, Greg Porreca, articulates this idea further, “I’ve found that many advances in computational biology start with simple solutions written by cross-functional individuals to accomplish simple tasks. Bigger problems are hard to address with those rudimentary algorithms, so folks with classical training in computer science step in and devise highly optimized solutions that are faster and more flexible.”

An overarching theme that also emerges from this survey suggests that tools for computational analysis permeated biological research according to three states: first, a cross-functional individual sees a problem and devises a solution good enough to demonstrate the feasibility of a type of analysis; second, robust tools are created, often utilizing the specialized knowledge of formally trained computer scientists; and third, the tools reach biologists focused on understanding specific phenomena, who incorporate the tools into everyday use. These stages echo existing broader literature on disruptive innovations1 and technology-adoption life cycles2,3, which may suggest how breakthroughs in computational biology can be nurtured.

  1. Christiansen, C.M. & Bower, J.I., Disruptive technologies: catching the wave. Harvard Business Review (1995).

  2. Moore, G.A. Crossing the Chase: Marketing and Selling High-Tech Products to Mainstream Customers (Harvard Business, 1999)

  3. Rogers, E.M. Diffusion of Innovations (Free Press, 2003).

Biologists must become aware of what the disciplines of computer science and engineering can offer computational biology. Until this happens, forward progress in computational biological innovations and discovery will be unnecessarily hampered by a number of superfluous factors not the least of which is complacence.

Effective Bioinformatics Programming - Part 3 No comments yet

All Things Unix

Bioinformatics started with Unix. At the Human Genome Center, for a long time, I had the one and only PC. (We got a request from our users for a PC-based client for the Search Launcher). Everything else was Solaris (Unix) and Mac, which was followed by Linux.

Unix supports a number of nifty commands like grep, strings, df, du, ls, etc. These commands are run inside the shell, or command line interpreter, for the operating system (Unix). There have been a number of these shells in the history of Unix development.

The bash shell is the default shell for the Linux environment. This shell provides several unique capabilities over other shells. For instance, bash supports a history buffer of system commands. With the history buffer, the “up” arrow will return the previous command. The history command lets you view a history of past commands. The bang operator (!) lets you rerun a previous command from the history buffer. (Which saves a lot of typing!)

bash enables a user to redirect program output. The pipeline feature allows the user to connect a series of commands. With the pipeline (“|”) operator, a chain of commands can be linked together where the output of one command is the input to the next command an so forth.

A shell script ( is script written for the shell or command line interpreter. Shell scripts enable batch processing. Together with the cron command, these scripts can be set to run automatically at times when system usage is minimum.

For general information about bash, go to the Bash Reference Manual at

A whole wealth of bash shell script examples is available at -

Unix on Other Platforms

Cygwin ( is a Linux-like environment for windows. The basic download installs a minimum environment, but you can add additional packages at any time. Go to for a list of Cygwin packages available for download.

Apple’s OS X is based on Unix. Other than the MACH kernel, the OS is BSD-derived. Their Java package is usually not the latest as Apple has to port Java due to differences such as the graphics portion.

All Things Software – Documenting and Archiving

I’ve run into all sorts of approaches to program code documentation in my career. A lead engineer demanded that every line of assembler code be documented. A senior programmer insisted that code should be self-documenting.

By that, she used variable names such as save_the_file_to_the_home_directory, and so on. Debugging these programs was a real pain. The first thing you had to do was set up aliases for all the unwieldy names.

The FORTRAN programmers cried when variable names longer than 6 characters were allowed in version 77 of VAX FORTRAN.. Personally, I thought it was great. The same with IMPLICIT NONE.

In the ancient times, FORTRAN integers variables had to start with i thru n. Real variables could use the other letters. The IMPLICIT NONE directive told the compiler to shut that off.

All FORTRAN variables had to be in capital letters. But you could stuff strings into integer variables which I found extremely useful. All FORTRAN statements had to begin with a number. This number usually started at 10 and went up in increments of 10.

At one time Microsoft used Hungarian notation ( for variables in most of their documentation. In this method, the name of the variable indicated it’s use. For example, lAccountNumber was a long integer.

The IDEs (Eclipse, NetBeans, and others) will automatically create the header comment with a list of variables. The user just adds the proper definitions. (If you’re using Java, the auto comment is JavaDoc compatible, etc.)

Otherwise, Java supports the JavaDoc tool, Python has PyDoc, and Ruby has RDoc.

Personally, I feel that software programs should be read like a book, with documentation providing the footnotes, such as an overview of what the code in question does and a definition of the main variables for both input and output. Module/Object documentation should also note who uses the function and why. Keep variable names short but descriptive and make comments meaningful.

Keep code clean, but don’t go overboard. I worked with one programmer who stated, “My code is so clean you could eat off it.” I found that a little too obnoxious, not to mention overly optimistic as a number of bugs popped out as time went by.

Archiving Code

Version Control Systems (VCS) have evolved as source code projects became larger and more complex.

RCS (Revision Control System) meant that the days of the keeping the Emacs numbered files (e.g. foo.~1~) as backups were over. RCS used the diff concept (just kept a list of the changes make to a file as a backup strategy).

I found this unsuited for what I had to do – revert to an old version in a matter of seconds.

CVS was much, much better. CVS was replaced by Subversion. But they’re centralized repository structure can create problems. You basically check out what you want to work on from a library and check it back in when you’re done. This can be a slow process depending on network usage or central server available.

The current favorite is Git. Git was created by Linus Torvalds (of Linux fame). Git is a free, open source distributed version control system. (

Everyone on the project has a copy of all project files complete with revision histories and tracking capabilities. Permissions allow exchanges between users and merging to a central location is fast.

The IDE’s (Eclipse and NetBeans) will have CVS and Subversion plug ins already configured for accessing those repositories. NetBeans also supports Mercurical. Plug ins for the other versioning software modules are available on the web. The Eclipse plug in for Git is available at

System Backup

Always have a plan B. My plan A had IT backup my systems on a weekly to monthly basis based on usage. A natural disaster completely decimated my systems. No problem, I thought, I have system backup. Imagine how I felt when I heard that IT had not archived a single on of my systems in over three years! Well, I had a plan B. I had a mirror of the most important stuff on an old machine and other media. We were back up almost immediately.

The early Tandem NonStop systems (now known as HP Integrity NonStop) automatically mirrored your system in real-time, so down time was not a problem.

Real-time backup is expensive and unless you’re a bank or airline, it’s not necessary.

Snapshot Backup on Linux with rsync

If you’re running Linux, Mac, Solaris, or any Unix-based system, you can use rsync for generating automatic rotating “snapshot” style back-ups. These systems generally have rsync already installed. If not, the source is available at –

This website - will tell you everything you need to know to implement rsync based backups, complete with sample scripts.

Properly configured, the method can also protect against hard disk failure, root compromises, or even back up a network of heterogeneous desktops automatically.

Acknowledgment – Thanks, Bill!

I want to thank Bill Eaton for his assistance with these blog entries on Effective Bioinformatics Programming. He filled in a lot of the technical details, performed product analysis, and gave me direction in writing these blog entries.

To Be Continued - Part 4

Part 4 will cover relational database management systems (RDBMS), HPC (high performance computing) - parallel processing, FPGC, clusters, grids, and other topics.

Effective Bioinformatics Programming - Part 2 No comments yet

Effective Bioinformatics Programming – Part 2

Instrumentation Programming

Instrumentation Programming usually concerns computer control over the actions of an instrument and/or the streaming or download of data from the device. Instrumentation in the Life Sciences covers data loggers, waveform data acquisition systems, pulse generators, image capture, and others used extensively in LIMS (Laboratory Information Management Systems), Spectroscopy, and other scientific arenas.

Most instruments are controlled by codes called “control codes”. These codes are usually sent or received by a C/C++ program. Some instrumentation manufacturers, however, have a proprietary programming language that must be used to “talk” to the instrument.

Some companies are nice enough to provide information on the structure of the data that comes from their instrument. When they don’t you may have to use good old “reverse engineering”. That’s where the Unix/Linux od utility comes in handy, because lots of time will be spent poring over hex dumps.

As you can tell, programming instruments requires a lot of patience. This is especially true if everything hangs or gets into a confused state. There is nothing you can do but recycle the power to everything and start over. This is usually accompanied by a banging of keyboards and the muttering of a few choice words.

Development Platforms or IDEs (Integrated Development Environment)

I have to mention development platforms as they can be useful, but also problematic. My favorite is Eclipse ( Originating at IBM, Eclipse was supported by a consortium of software vendors. Eclipse has now become the Eclipse open source community, supported by the Eclipse Foundation.

Eclipse is a development platform for programmers comprised of extensible frameworks, tools and runtimes for building, deploying and managing software across the lifecycle. You can find plug-ins that will enable you to accomplish just about anything you want to do. A plug-in is an addition to the Eclipse platform that is not included in the base package, like an Eclipse memory manager or a debugging a Tomcat servlet.

Sun offers NetBeans (“The only IDE you need.”). I used NetBeans ( at lot on the Mac. Previously, Sun offered StudioOne and Creator. I used StudioOne (on Unix) and Creator (on Linux). I haven’t worked with NetBeans lately because they’re currently mostly Swing-centric (GUI) development and are not fully JSF (java Server Faces) aware. NetBeans will make a template for JSF but doesn’t (as yet) provide an easy way to create a JSF interface.

There are two main problems with development platforms. For one, the learning curve is fairly steep. There area lot of tutorials and examples available, but you still have take the time to do it.

The best way to use a development platform is to divide the work. One group does web content, one group does database, one group does middleware (the glue that holds everything together), etc. Each group or person can then become knowledgeable in their area and move on or absorb other areas as needed.

The second problem with these tools in that you are stuck with their developmental approach.

You have to do things a certain way and adhere to a certain structure. Flexibility can be a problem.

This is especially true of interface building. You are stuck with the code the tool generates and the files and file structures created. With most tools, you have to use that tool to access files that the tool created.

IDEs can be useful in that they will perform mundane coding tasks for you. For instance, given a database record, the IDE can use those table elements to generate web forms and the SQL queries driving those forms. You can then expand the simple framework or leave as is.

Open Source/Free Software and Bioinformatics Libraries

There a lot of good an not-so-good Open Source code out there for the Life Sciences.

There are several “gotchas” to look out for, including –

Is the code reliable? Are others using it? Are they having problems?

Will the code run on your architecture? What will it take to install

What kind of user support is available? What’s the response time?

Is there a mailing list available for the library, package, or project of interest?

The are several bioinformatics software libraries available for various languages. All of these libraries are OpenSource/Free Software. Installing these libraries takes a little more that just downloading and uncompressing a package. There are “dependencies” (other libraries, modules, programs, and access to external sites) that must be resident or accessible before a complete build of these libraries is possible.

The following is a list of the most popular libraries and their respective dependencies.

BioPerl 1.6.1: Modules section of

Required modules:
perl               => 5.6.1
IO::String         => 0
DB_File            => 0
Data::Stag         => 0.11
Scalar::Util       => 0
ExtUtils::Manifest => 1.52

Required modules for source build:
Test::More    => 0
Module::Build => 0.2805
Test::Harness => 2.62
CPAN          => 1.81

Recommended modules:  some of these have circular dependencies
Ace                       => 0
Algorithm::Munkres        => 0
Array::Compare            => 0
Bio::ASN1::EntrezGene     => 0
Clone                     => 0
Convert::Binary::C        => 0
Graph                     => 0
GraphViz                  => 0
HTML::Entities            => 0
HTML::HeadParser          => 3
HTTP::Request::Common     => 0
List::MoreUtils           => 0
LWP::UserAgent            => 0
Math::Random              => 0
PostScript::TextBlock     => 0
Set::Scalar               => 0
SOAP::Lite                => 0
Spreadsheet::ParseExcel   => 0
Spreadsheet::WriteExcel   => 0
Storable                  => 2.05
SVG                       => 2.26
SVG::Graph                => 0.01
Text::ParseWords          => 0
URI::Escape               => 0
XML::Parser               => 0
XML::Parser::PerlSAX      => 0
XML::SAX                  => 0.15
XML::SAX::Writer          => 0
XML::Simple               => 0
XML::Twig                 => 0
XML::Writer               => 0.4

Some of these modules such as SOAP::Lite depend upon many other

BioPython 1.53:

Additional packages:
NumPy     (recommended)
ReportLab (optional)
MySQLdb   (optional)    May be in core Python distribution.

BioRuby 1.4.0:

The base distribution is self-contained and uses the RubyGems installer.
Optional packages.

RubyForge:ActiveRecord and at least one driver (or adapter) from
   RubyForge:MySQL/Ruby, RubyForge:postgres-pr, or RubyForge:ActiveRecord
   Oracle enhanced adapter.
RubyForge:libxml-ruby (Ruby language bindings for the GNOME Libxml2 XML toolkit)

BioJava 1.7.1:

biojava-1.7.1-all.jar:  self-contained binary distribution with
  all dependencies included.

biojava-1.7.1.jar:  bare distribution that requires the following additional
  jar files.  These are required for building from source code.
  Most are from

bytecode.jar:                  required to run BioJava
commons-cli.jar:               used by some demos.
commons-collections-2.1.jar:   demos, BioSQL Access
commons-dbcp-1.1.jar:          legacy BioSQL access
commons-pool-1.1.jar:          legacy BioSQL access
jgraph-jdk1.5.jar:          NEXUS file parsing

Don’t forget to sign up for the mailing list for that library or libraries of interest to get the lastest news, problems, solutions, etc. for that library or just life science topics in general.

Software Hosting and Indexing Sites

There are several Software Hosting and Indexing Sites that serve as software distribution points for bioinformatics software. – Search on bioinformatics for a list of software available. Projects include:MIAMExpress -

freshmeat– The Web’s largest index of Unix and cross-platform software

Bioinformatics Organization – The Open Access Institute

Open Bioinformatics Foundation (O|B|F) - Hosts Many Open Bioinformatics Projects

Public Domain Manifesto

In this time of curtailment of civil rights, the Public Domain Manifesto seems appropriate ( Sign the petition while you’re there.

This is the end of Part 2. Part 3 will explore more software skills, project management, and other computational topics.

Effective Bioinformatics Programming - Part 1 No comments yet

The PLOS Computational Biology website recently published “A Quick Guide for Developing Effective Bioinformatics Programming Skills” by Joel T. Dudley and Atul J. Butte (

This article is a good that survey covers all the latest topics and mentions all the currently-popular buzzwords circulating above, around, and through the computing ionosphere. It’s a good article, but I can envision readers’ eyes glazing over about page 3. It’s a lot of computer-speak in a little space.

I’ll add in a few things they skipped or merely skimmed over to give a better overview of what’s out there and how it pertains to bioinformatics.

They state that a biologist should put together a Technology Toolbox. They continue, “The most fundamental and versatile tools in your technology toolbox are programming languages.”

Programming Concepts

Programming languages are important, but I think that Programming Concepts are way, way more important. A good grasp of programming concepts will enable you to understand any programming language.

To get a good handle on programming concepts, I recommend at book. This book, Structure and Implementation of Computer Programs from MIT Press (,is the basis for an intro to computer science at MIT. It’s called the Wizard Book or the Purple Book.

I got the 1984 version of the book which used the LISP language. The current 1996 version is based on LISP/Scheme. Scheme is basically a cleaned-up LISP, in case you’re interested.

Best of all course (and the down loadable book) are freely available from MIT through the MIT OpenCourseWare website –

There’s a blog entry - - that goes into further explanation about the course and the book..

And just because you can program, it doesn’t mean you know (or even need to know) all the concepts. For instance, my partner for a engineering education extension course was an electrical engineer who was programming microprocessors. When the instructor mentioned the term “scope” in reference to some topic, he turned to me and asked, “What’s scope?”

According to MIT’s purple book –” In a procedure definition, the bound variables declared as the formal parameters of the procedure have the body of the procedure as their scope.”

You don’t need to know about scope to program in assembler, because everything you need is right there. (In case you’re wondering, I consider assembler programmers to be among the programming elites.)

Programming Languages

The article mentions Perl, Python, and Ruby as the “preferred and most prudent choices” in which to seek mastery for bioinformatics.

These languages are selected because “they simplify the programming process by obviating the need to manage many lower level details of program execution (e.g. memory management), affording the programmer the ability to focus foremost on application logic…”

Let me add the following. There are differences in programming languages. By that, I mean compiled vs scripted. Languages such as C, C++, and Fortran are compiled. Program instructions written in these languages are parsed and translated into object code, or a language specific to the computer architecture the code is to run on. Compiled code has a definite speed advantage, but if the code is the main or any supporting module is changed, the entire project must be recompiled. Since the program is compiled into the machine code of a specific computer architecture, portability of the code is limited.

Perl, Python, and Ruby are examples of scripted or interpreted languages. These languages are translated into byte code which is optimized and compressed, but is not machine code. This byte code is then interpreted by a virtual machine (or byte code interpreter) usually written in C.

An interpreted program runs more slowly than a compiled program. Every line of an interpreted program must be analyzed as it is read. But the code isn’t particularly tied to one machine architecture making portability easier (provided the byte code interpreter is present). Since code is only interpreted at run time, extensions and modifications to the code base is easier, making these languages great for beginning programmers or rapid prototyping.

But, let’s get back to the memory management. This, and processing speed will be a huge deal in next gen data analysis and management.

Perl automatic memory management has a problem with circularity, as Perl (and Ruby and Python) count references.

If object 1 points to object 2 and object 2 points back to 1 , but nothing else in the program points to either object 1 or object 2 (this is a weak reference), these objects don’t get destroyed. They remain in memory. If these objects get created again and again, it’s called a memory leak.

I also have to ask – What about C/C++ , Fortran, and even Turbo Pascal? The NCBI Toolkit is written in C/C++. If you work with foreign scientists, you will probably see a lot Fortran.


You can’t mention programming with mentioning debugging. I consider the act of debugging code an art form any serious programmer should doggedly pursue.

Here’s a link to a ebook, The Art of Debugging It’s mainly Unix-based, C-centric and a little dated. But good stuff never goes out of style.

Chapter 4, Debugging: Theory explains various debugging techniques. Chapter 5 – Profiling talks about profiling your code, or determining where your program is spending most of its time.

He also mentions core dumps. A core is what happens when your C/C++/Fortran program crashes in Unix/Linux. You can examine this core to determine where your program went wrong. (It gives you a place to start.)

The Linux Foundation Developer Network has an on-line tutorial – Zen and the Art of Debugging C/C++ in Linux with GDB – They write a C program (incorporating a bug), create a make file, compile, and then use gdb to find the problem. You are also introduced to several Unix/Linux commands in the process.

You can debug Perl by invoking it with the -d switch. Perl usually crashes at the line number that caused the problem and some explanation of what went wrong.

The -d option also turns on parser debugging output for Python.

Object Dumps

One of the most useful utilities in Unix/Linux is od (object dump). You can examine files in octal (default), hex, or ASCII characters

od is very handy for examining data structures, finding hidden characters, and reverse engineering.

If you think you’re code is right, the problem may be in what you are trying to read. Use od to get a good look at the input data.

That’s it for Part 1. Part 2 will cover Open Source, project management, archiving source code and other topics.

“You can’t do Bioinformatics if you haven’t worked in a wet lab” No comments yet

The line “you can’t do bioinformatics if you haven’t worked in a wet lab”, has been used as the basis for the “you need to know where the data comes from” argument time and time again.   I actually saw this in print in a slide presentation at the Next-Generation Sequencing Data Analysis conference in Providence, RI, in September 2008. 

I can sympathize with this viewpoint, but I don’t agree with it.  For instance, I designed the data system, compiled the data, and did the field testing that certified a re-engined aircraft, but I can’t pilot a plane.   I did do a lot of field laboratory work and it was “wet” - if snow, sleet, and rain count, along with desert dust and volcanic ash.

Knowing where the data comes from is very important, but what is of more importance is whether or not the data is actually measuring what it is supposed to measure  — data validity (are your instruments correctly calibrated and is the sampling rate sufficient), what is the format of the data, what is the size of the data, and to what sort of analysis will the data be subjected.

If the lab experience is so very important, a simple systems analysis is a very good tool to use.  As I’ve done it, the observer/programmer/engineer would “live” in the lab for a period of time — usually two to four weeks, or until they have a good grasp of the processes involved, taking copious notes and asking lots of questions.  That person may actually perform some of the work involved if desired.
This person should have some understanding of molecular biology, etc. to fully appreciate the lab experience. 

This activity has the potential of illuminating possible bottlenecks or methods that may need modification or fine tuning.  If more that one site is involved, so much the better, as discrepancies in processes will be made obvious.

My biological wet lab experience got me a “you have excellent lab technique” and a job offer, which I declined.

Bioinformatics training also comes into question.  Many courses just help the student determine which internet site to go to for information, or how to construct a FASTA-formatted sequence, or parse a BLAST output or a GenBank report.  They can’t do much except offer a survey of things “bioinformatic”. Not much time is spent on information management or engineering approaches.

I jumped from engineering to bioinformatics in the early 90’s.  The object-oriented data model I presented apparently found an audience.  I did some reading up on genetics, etc. before the interview, but most of the knowledge used to answer interview questions such as, “what are the four basic building block of life”, came from watching X-Files.  Things have gotten a lot more complicated (the textbooks have gotten heavier), and keeping up with new discoveries can become quite a task.

Next week I will offer a series of “horror stories”, or some of my experiences in the bioinformatics arena.

It’s official: the LifeFormulae Blog! No comments yet

Welcome to LifeFormulae’s official Blog site. Thank you for checking us out. Feel free to post comments for us, including any topics you would like us to cover. The purpose of this blog is to bring current events within the life sciences and bioinformatics communities to the forefront of our thoughts, to stay up-to-date on what’s going on in the research community, and to create a forum of discussion about the ever-changing environment to which we, as researchers, have become accustomed.

As you may know, Cambridge Healthtech Institute’s Data-Driven Discovery Summit 2008 was held in Rhode Island at the end of September. We had so many great conversations and were introduced to so many great people, we wanted to make sure those conversations continued. There were so many questions that covered diverse topics, we couldn’t find room on our website to answer them all comprehensively. So, we decided that we wanted a built-in community to foster communication on any topic related to bioinformatics, or any sub-topic beyond that.

We all read the industry newsletters and follow the latest publications when we get to them, but we want you to be able to ask questions about the topic, share feedback, let people know how it’s affecting you, vent, enlighten, inquire, observe, remark, express yourself.

At LifeFormulae, we have some every-day people who have been in the business for over 20 years. We think you might like what they have to say. If you don’t, just let us know. Nothing would make us happier. We plan to have alternating bloggers, as well as a few guest bloggers from time to time (let us know if you’re interested). We’ll try to keep it interesting (and pertinent), but remember that feedback always helps!

Talk to you soon!

The LifeFormulae staff

Top of page / Subscribe to new Entries (RSS)