LifeFormulae Blog » Page 'Effective Bioinformatics Programming - Part 3'

Effective Bioinformatics Programming - Part 3

All Things Unix

Bioinformatics started with Unix. At the Human Genome Center, for a long time, I had the one and only PC. (We got a request from our users for a PC-based client for the Search Launcher). Everything else was Solaris (Unix) and Mac, which was followed by Linux.

Unix supports a number of nifty commands like grep, strings, df, du, ls, etc. These commands are run inside the shell, or command line interpreter, for the operating system (Unix). There have been a number of these shells in the history of Unix development.

The bash shell http://en.wikipedia.org/wiki/Bash is the default shell for the Linux environment. This shell provides several unique capabilities over other shells. For instance, bash supports a history buffer of system commands. With the history buffer, the “up” arrow will return the previous command. The history command lets you view a history of past commands. The bang operator (!) lets you rerun a previous command from the history buffer. (Which saves a lot of typing!)

bash enables a user to redirect program output. The pipeline feature allows the user to connect a series of commands. With the pipeline (“|”) operator, a chain of commands can be linked together where the output of one command is the input to the next command an so forth.

A shell script (http://en.wikipedia.org/wiki/Shell_script) is script written for the shell or command line interpreter. Shell scripts enable batch processing. Together with the cron command, these scripts can be set to run automatically at times when system usage is minimum.

For general information about bash, go to the Bash Reference Manual at http://www.gnu.org/software/bash/manual/bashref.html.

A whole wealth of bash shell script examples is available at - http://tldp.org/LDP/abs/html/.

Unix on Other Platforms

Cygwin (http://www.cygwin.com/) is a Linux-like environment for windows. The basic download installs a minimum environment, but you can add additional packages at any time. Go to http://cygwin.com/packages/ for a list of Cygwin packages available for download.

Apple’s OS X is based on Unix. Other than the MACH kernel, the OS is BSD-derived. Their Java package is usually not the latest as Apple has to port Java due to differences such as the graphics portion.

All Things Software – Documenting and Archiving

I’ve run into all sorts of approaches to program code documentation in my career. A lead engineer demanded that every line of assembler code be documented. A senior programmer insisted that code should be self-documenting.

By that, she used variable names such as save_the_file_to_the_home_directory, and so on. Debugging these programs was a real pain. The first thing you had to do was set up aliases for all the unwieldy names.

The FORTRAN programmers cried when variable names longer than 6 characters were allowed in version 77 of VAX FORTRAN.. Personally, I thought it was great. The same with IMPLICIT NONE.

In the ancient times, FORTRAN integers variables had to start with i thru n. Real variables could use the other letters. The IMPLICIT NONE directive told the compiler to shut that off.

All FORTRAN variables had to be in capital letters. But you could stuff strings into integer variables which I found extremely useful. All FORTRAN statements had to begin with a number. This number usually started at 10 and went up in increments of 10.

At one time Microsoft used Hungarian notation (http://en.wikipedia.org/wiki/Hungarian_notation) for variables in most of their documentation. In this method, the name of the variable indicated it’s use. For example, lAccountNumber was a long integer.

The IDEs (Eclipse, NetBeans, and others) will automatically create the header comment with a list of variables. The user just adds the proper definitions. (If you’re using Java, the auto comment is JavaDoc compatible, etc.)

Otherwise, Java supports the JavaDoc tool, Python has PyDoc, and Ruby has RDoc.

Personally, I feel that software programs should be read like a book, with documentation providing the footnotes, such as an overview of what the code in question does and a definition of the main variables for both input and output. Module/Object documentation should also note who uses the function and why. Keep variable names short but descriptive and make comments meaningful.

Keep code clean, but don’t go overboard. I worked with one programmer who stated, “My code is so clean you could eat off it.” I found that a little too obnoxious, not to mention overly optimistic as a number of bugs popped out as time went by.

Archiving Code

Version Control Systems (VCS) have evolved as source code projects became larger and more complex.

RCS (Revision Control System) meant that the days of the keeping the Emacs numbered files (e.g. foo.~1~) as backups were over. RCS used the diff concept (just kept a list of the changes make to a file as a backup strategy).

I found this unsuited for what I had to do – revert to an old version in a matter of seconds.

CVS was much, much better. CVS was replaced by Subversion. But they’re centralized repository structure can create problems. You basically check out what you want to work on from a library and check it back in when you’re done. This can be a slow process depending on network usage or central server available.

The current favorite is Git. Git was created by Linus Torvalds (of Linux fame). Git is a free, open source distributed version control system. (http://git-scm.com/).

Everyone on the project has a copy of all project files complete with revision histories and tracking capabilities. Permissions allow exchanges between users and merging to a central location is fast.

The IDE’s (Eclipse and NetBeans) will have CVS and Subversion plug ins already configured for accessing those repositories. NetBeans also supports Mercurical. Plug ins for the other versioning software modules are available on the web. The Eclipse plug in for Git is available at http://git.wiki.kernel.org/index.php/EclipsePlugin.

System Backup

Always have a plan B. My plan A had IT backup my systems on a weekly to monthly basis based on usage. A natural disaster completely decimated my systems. No problem, I thought, I have system backup. Imagine how I felt when I heard that IT had not archived a single on of my systems in over three years! Well, I had a plan B. I had a mirror of the most important stuff on an old machine and other media. We were back up almost immediately.

The early Tandem NonStop systems (now known as HP Integrity NonStop) automatically mirrored your system in real-time, so down time was not a problem.

Real-time backup is expensive and unless you’re a bank or airline, it’s not necessary.

Snapshot Backup on Linux with rsync

If you’re running Linux, Mac, Solaris, or any Unix-based system, you can use rsync for generating automatic rotating “snapshot” style back-ups. These systems generally have rsync already installed. If not, the source is available at – http://rsync.samba.org/.

This website - http://www.mikerubel.org/computers/rsync_snapshots/ will tell you everything you need to know to implement rsync based backups, complete with sample scripts.

Properly configured, the method can also protect against hard disk failure, root compromises, or even back up a network of heterogeneous desktops automatically.

Acknowledgment – Thanks, Bill!

I want to thank Bill Eaton for his assistance with these blog entries on Effective Bioinformatics Programming. He filled in a lot of the technical details, performed product analysis, and gave me direction in writing these blog entries.

To Be Continued - Part 4

Part 4 will cover relational database management systems (RDBMS), HPC (high performance computing) - parallel processing, FPGC, clusters, grids, and other topics.

Like this post? Spread the word!
delicious digg google
stumbleupon technorati Yahoo!

Leave a comment

You need to log in to comment.

Top of page / Subscribe to new Entries (RSS)