LifeFormulae Blog » Posts in 'Bioinformatics' category

Effective Bioinformatics Programming - Part 3 No comments yet

All Things Unix

Bioinformatics started with Unix. At the Human Genome Center, for a long time, I had the one and only PC. (We got a request from our users for a PC-based client for the Search Launcher). Everything else was Solaris (Unix) and Mac, which was followed by Linux.

Unix supports a number of nifty commands like grep, strings, df, du, ls, etc. These commands are run inside the shell, or command line interpreter, for the operating system (Unix). There have been a number of these shells in the history of Unix development.

The bash shell is the default shell for the Linux environment. This shell provides several unique capabilities over other shells. For instance, bash supports a history buffer of system commands. With the history buffer, the “up” arrow will return the previous command. The history command lets you view a history of past commands. The bang operator (!) lets you rerun a previous command from the history buffer. (Which saves a lot of typing!)

bash enables a user to redirect program output. The pipeline feature allows the user to connect a series of commands. With the pipeline (“|”) operator, a chain of commands can be linked together where the output of one command is the input to the next command an so forth.

A shell script ( is script written for the shell or command line interpreter. Shell scripts enable batch processing. Together with the cron command, these scripts can be set to run automatically at times when system usage is minimum.

For general information about bash, go to the Bash Reference Manual at

A whole wealth of bash shell script examples is available at -

Unix on Other Platforms

Cygwin ( is a Linux-like environment for windows. The basic download installs a minimum environment, but you can add additional packages at any time. Go to for a list of Cygwin packages available for download.

Apple’s OS X is based on Unix. Other than the MACH kernel, the OS is BSD-derived. Their Java package is usually not the latest as Apple has to port Java due to differences such as the graphics portion.

All Things Software – Documenting and Archiving

I’ve run into all sorts of approaches to program code documentation in my career. A lead engineer demanded that every line of assembler code be documented. A senior programmer insisted that code should be self-documenting.

By that, she used variable names such as save_the_file_to_the_home_directory, and so on. Debugging these programs was a real pain. The first thing you had to do was set up aliases for all the unwieldy names.

The FORTRAN programmers cried when variable names longer than 6 characters were allowed in version 77 of VAX FORTRAN.. Personally, I thought it was great. The same with IMPLICIT NONE.

In the ancient times, FORTRAN integers variables had to start with i thru n. Real variables could use the other letters. The IMPLICIT NONE directive told the compiler to shut that off.

All FORTRAN variables had to be in capital letters. But you could stuff strings into integer variables which I found extremely useful. All FORTRAN statements had to begin with a number. This number usually started at 10 and went up in increments of 10.

At one time Microsoft used Hungarian notation ( for variables in most of their documentation. In this method, the name of the variable indicated it’s use. For example, lAccountNumber was a long integer.

The IDEs (Eclipse, NetBeans, and others) will automatically create the header comment with a list of variables. The user just adds the proper definitions. (If you’re using Java, the auto comment is JavaDoc compatible, etc.)

Otherwise, Java supports the JavaDoc tool, Python has PyDoc, and Ruby has RDoc.

Personally, I feel that software programs should be read like a book, with documentation providing the footnotes, such as an overview of what the code in question does and a definition of the main variables for both input and output. Module/Object documentation should also note who uses the function and why. Keep variable names short but descriptive and make comments meaningful.

Keep code clean, but don’t go overboard. I worked with one programmer who stated, “My code is so clean you could eat off it.” I found that a little too obnoxious, not to mention overly optimistic as a number of bugs popped out as time went by.

Archiving Code

Version Control Systems (VCS) have evolved as source code projects became larger and more complex.

RCS (Revision Control System) meant that the days of the keeping the Emacs numbered files (e.g. foo.~1~) as backups were over. RCS used the diff concept (just kept a list of the changes make to a file as a backup strategy).

I found this unsuited for what I had to do – revert to an old version in a matter of seconds.

CVS was much, much better. CVS was replaced by Subversion. But they’re centralized repository structure can create problems. You basically check out what you want to work on from a library and check it back in when you’re done. This can be a slow process depending on network usage or central server available.

The current favorite is Git. Git was created by Linus Torvalds (of Linux fame). Git is a free, open source distributed version control system. (

Everyone on the project has a copy of all project files complete with revision histories and tracking capabilities. Permissions allow exchanges between users and merging to a central location is fast.

The IDE’s (Eclipse and NetBeans) will have CVS and Subversion plug ins already configured for accessing those repositories. NetBeans also supports Mercurical. Plug ins for the other versioning software modules are available on the web. The Eclipse plug in for Git is available at

System Backup

Always have a plan B. My plan A had IT backup my systems on a weekly to monthly basis based on usage. A natural disaster completely decimated my systems. No problem, I thought, I have system backup. Imagine how I felt when I heard that IT had not archived a single on of my systems in over three years! Well, I had a plan B. I had a mirror of the most important stuff on an old machine and other media. We were back up almost immediately.

The early Tandem NonStop systems (now known as HP Integrity NonStop) automatically mirrored your system in real-time, so down time was not a problem.

Real-time backup is expensive and unless you’re a bank or airline, it’s not necessary.

Snapshot Backup on Linux with rsync

If you’re running Linux, Mac, Solaris, or any Unix-based system, you can use rsync for generating automatic rotating “snapshot” style back-ups. These systems generally have rsync already installed. If not, the source is available at –

This website - will tell you everything you need to know to implement rsync based backups, complete with sample scripts.

Properly configured, the method can also protect against hard disk failure, root compromises, or even back up a network of heterogeneous desktops automatically.

Acknowledgment – Thanks, Bill!

I want to thank Bill Eaton for his assistance with these blog entries on Effective Bioinformatics Programming. He filled in a lot of the technical details, performed product analysis, and gave me direction in writing these blog entries.

To Be Continued - Part 4

Part 4 will cover relational database management systems (RDBMS), HPC (high performance computing) - parallel processing, FPGC, clusters, grids, and other topics.

Effective Bioinformatics Programming - Part 2 No comments yet

Effective Bioinformatics Programming – Part 2

Instrumentation Programming

Instrumentation Programming usually concerns computer control over the actions of an instrument and/or the streaming or download of data from the device. Instrumentation in the Life Sciences covers data loggers, waveform data acquisition systems, pulse generators, image capture, and others used extensively in LIMS (Laboratory Information Management Systems), Spectroscopy, and other scientific arenas.

Most instruments are controlled by codes called “control codes”. These codes are usually sent or received by a C/C++ program. Some instrumentation manufacturers, however, have a proprietary programming language that must be used to “talk” to the instrument.

Some companies are nice enough to provide information on the structure of the data that comes from their instrument. When they don’t you may have to use good old “reverse engineering”. That’s where the Unix/Linux od utility comes in handy, because lots of time will be spent poring over hex dumps.

As you can tell, programming instruments requires a lot of patience. This is especially true if everything hangs or gets into a confused state. There is nothing you can do but recycle the power to everything and start over. This is usually accompanied by a banging of keyboards and the muttering of a few choice words.

Development Platforms or IDEs (Integrated Development Environment)

I have to mention development platforms as they can be useful, but also problematic. My favorite is Eclipse ( Originating at IBM, Eclipse was supported by a consortium of software vendors. Eclipse has now become the Eclipse open source community, supported by the Eclipse Foundation.

Eclipse is a development platform for programmers comprised of extensible frameworks, tools and runtimes for building, deploying and managing software across the lifecycle. You can find plug-ins that will enable you to accomplish just about anything you want to do. A plug-in is an addition to the Eclipse platform that is not included in the base package, like an Eclipse memory manager or a debugging a Tomcat servlet.

Sun offers NetBeans (“The only IDE you need.”). I used NetBeans ( at lot on the Mac. Previously, Sun offered StudioOne and Creator. I used StudioOne (on Unix) and Creator (on Linux). I haven’t worked with NetBeans lately because they’re currently mostly Swing-centric (GUI) development and are not fully JSF (java Server Faces) aware. NetBeans will make a template for JSF but doesn’t (as yet) provide an easy way to create a JSF interface.

There are two main problems with development platforms. For one, the learning curve is fairly steep. There area lot of tutorials and examples available, but you still have take the time to do it.

The best way to use a development platform is to divide the work. One group does web content, one group does database, one group does middleware (the glue that holds everything together), etc. Each group or person can then become knowledgeable in their area and move on or absorb other areas as needed.

The second problem with these tools in that you are stuck with their developmental approach.

You have to do things a certain way and adhere to a certain structure. Flexibility can be a problem.

This is especially true of interface building. You are stuck with the code the tool generates and the files and file structures created. With most tools, you have to use that tool to access files that the tool created.

IDEs can be useful in that they will perform mundane coding tasks for you. For instance, given a database record, the IDE can use those table elements to generate web forms and the SQL queries driving those forms. You can then expand the simple framework or leave as is.

Open Source/Free Software and Bioinformatics Libraries

There a lot of good an not-so-good Open Source code out there for the Life Sciences.

There are several “gotchas” to look out for, including –

Is the code reliable? Are others using it? Are they having problems?

Will the code run on your architecture? What will it take to install

What kind of user support is available? What’s the response time?

Is there a mailing list available for the library, package, or project of interest?

The are several bioinformatics software libraries available for various languages. All of these libraries are OpenSource/Free Software. Installing these libraries takes a little more that just downloading and uncompressing a package. There are “dependencies” (other libraries, modules, programs, and access to external sites) that must be resident or accessible before a complete build of these libraries is possible.

The following is a list of the most popular libraries and their respective dependencies.

BioPerl 1.6.1: Modules section of

Required modules:
perl               => 5.6.1
IO::String         => 0
DB_File            => 0
Data::Stag         => 0.11
Scalar::Util       => 0
ExtUtils::Manifest => 1.52

Required modules for source build:
Test::More    => 0
Module::Build => 0.2805
Test::Harness => 2.62
CPAN          => 1.81

Recommended modules:  some of these have circular dependencies
Ace                       => 0
Algorithm::Munkres        => 0
Array::Compare            => 0
Bio::ASN1::EntrezGene     => 0
Clone                     => 0
Convert::Binary::C        => 0
Graph                     => 0
GraphViz                  => 0
HTML::Entities            => 0
HTML::HeadParser          => 3
HTTP::Request::Common     => 0
List::MoreUtils           => 0
LWP::UserAgent            => 0
Math::Random              => 0
PostScript::TextBlock     => 0
Set::Scalar               => 0
SOAP::Lite                => 0
Spreadsheet::ParseExcel   => 0
Spreadsheet::WriteExcel   => 0
Storable                  => 2.05
SVG                       => 2.26
SVG::Graph                => 0.01
Text::ParseWords          => 0
URI::Escape               => 0
XML::Parser               => 0
XML::Parser::PerlSAX      => 0
XML::SAX                  => 0.15
XML::SAX::Writer          => 0
XML::Simple               => 0
XML::Twig                 => 0
XML::Writer               => 0.4

Some of these modules such as SOAP::Lite depend upon many other

BioPython 1.53:

Additional packages:
NumPy     (recommended)
ReportLab (optional)
MySQLdb   (optional)    May be in core Python distribution.

BioRuby 1.4.0:

The base distribution is self-contained and uses the RubyGems installer.
Optional packages.

RubyForge:ActiveRecord and at least one driver (or adapter) from
   RubyForge:MySQL/Ruby, RubyForge:postgres-pr, or RubyForge:ActiveRecord
   Oracle enhanced adapter.
RubyForge:libxml-ruby (Ruby language bindings for the GNOME Libxml2 XML toolkit)

BioJava 1.7.1:

biojava-1.7.1-all.jar:  self-contained binary distribution with
  all dependencies included.

biojava-1.7.1.jar:  bare distribution that requires the following additional
  jar files.  These are required for building from source code.
  Most are from

bytecode.jar:                  required to run BioJava
commons-cli.jar:               used by some demos.
commons-collections-2.1.jar:   demos, BioSQL Access
commons-dbcp-1.1.jar:          legacy BioSQL access
commons-pool-1.1.jar:          legacy BioSQL access
jgraph-jdk1.5.jar:          NEXUS file parsing

Don’t forget to sign up for the mailing list for that library or libraries of interest to get the lastest news, problems, solutions, etc. for that library or just life science topics in general.

Software Hosting and Indexing Sites

There are several Software Hosting and Indexing Sites that serve as software distribution points for bioinformatics software. – Search on bioinformatics for a list of software available. Projects include:MIAMExpress -

freshmeat– The Web’s largest index of Unix and cross-platform software

Bioinformatics Organization – The Open Access Institute

Open Bioinformatics Foundation (O|B|F) - Hosts Many Open Bioinformatics Projects

Public Domain Manifesto

In this time of curtailment of civil rights, the Public Domain Manifesto seems appropriate ( Sign the petition while you’re there.

This is the end of Part 2. Part 3 will explore more software skills, project management, and other computational topics.

Effective Bioinformatics Programming - Part 1 No comments yet

The PLOS Computational Biology website recently published “A Quick Guide for Developing Effective Bioinformatics Programming Skills” by Joel T. Dudley and Atul J. Butte (

This article is a good that survey covers all the latest topics and mentions all the currently-popular buzzwords circulating above, around, and through the computing ionosphere. It’s a good article, but I can envision readers’ eyes glazing over about page 3. It’s a lot of computer-speak in a little space.

I’ll add in a few things they skipped or merely skimmed over to give a better overview of what’s out there and how it pertains to bioinformatics.

They state that a biologist should put together a Technology Toolbox. They continue, “The most fundamental and versatile tools in your technology toolbox are programming languages.”

Programming Concepts

Programming languages are important, but I think that Programming Concepts are way, way more important. A good grasp of programming concepts will enable you to understand any programming language.

To get a good handle on programming concepts, I recommend at book. This book, Structure and Implementation of Computer Programs from MIT Press (,is the basis for an intro to computer science at MIT. It’s called the Wizard Book or the Purple Book.

I got the 1984 version of the book which used the LISP language. The current 1996 version is based on LISP/Scheme. Scheme is basically a cleaned-up LISP, in case you’re interested.

Best of all course (and the down loadable book) are freely available from MIT through the MIT OpenCourseWare website –

There’s a blog entry - - that goes into further explanation about the course and the book..

And just because you can program, it doesn’t mean you know (or even need to know) all the concepts. For instance, my partner for a engineering education extension course was an electrical engineer who was programming microprocessors. When the instructor mentioned the term “scope” in reference to some topic, he turned to me and asked, “What’s scope?”

According to MIT’s purple book –” In a procedure definition, the bound variables declared as the formal parameters of the procedure have the body of the procedure as their scope.”

You don’t need to know about scope to program in assembler, because everything you need is right there. (In case you’re wondering, I consider assembler programmers to be among the programming elites.)

Programming Languages

The article mentions Perl, Python, and Ruby as the “preferred and most prudent choices” in which to seek mastery for bioinformatics.

These languages are selected because “they simplify the programming process by obviating the need to manage many lower level details of program execution (e.g. memory management), affording the programmer the ability to focus foremost on application logic…”

Let me add the following. There are differences in programming languages. By that, I mean compiled vs scripted. Languages such as C, C++, and Fortran are compiled. Program instructions written in these languages are parsed and translated into object code, or a language specific to the computer architecture the code is to run on. Compiled code has a definite speed advantage, but if the code is the main or any supporting module is changed, the entire project must be recompiled. Since the program is compiled into the machine code of a specific computer architecture, portability of the code is limited.

Perl, Python, and Ruby are examples of scripted or interpreted languages. These languages are translated into byte code which is optimized and compressed, but is not machine code. This byte code is then interpreted by a virtual machine (or byte code interpreter) usually written in C.

An interpreted program runs more slowly than a compiled program. Every line of an interpreted program must be analyzed as it is read. But the code isn’t particularly tied to one machine architecture making portability easier (provided the byte code interpreter is present). Since code is only interpreted at run time, extensions and modifications to the code base is easier, making these languages great for beginning programmers or rapid prototyping.

But, let’s get back to the memory management. This, and processing speed will be a huge deal in next gen data analysis and management.

Perl automatic memory management has a problem with circularity, as Perl (and Ruby and Python) count references.

If object 1 points to object 2 and object 2 points back to 1 , but nothing else in the program points to either object 1 or object 2 (this is a weak reference), these objects don’t get destroyed. They remain in memory. If these objects get created again and again, it’s called a memory leak.

I also have to ask – What about C/C++ , Fortran, and even Turbo Pascal? The NCBI Toolkit is written in C/C++. If you work with foreign scientists, you will probably see a lot Fortran.


You can’t mention programming with mentioning debugging. I consider the act of debugging code an art form any serious programmer should doggedly pursue.

Here’s a link to a ebook, The Art of Debugging It’s mainly Unix-based, C-centric and a little dated. But good stuff never goes out of style.

Chapter 4, Debugging: Theory explains various debugging techniques. Chapter 5 – Profiling talks about profiling your code, or determining where your program is spending most of its time.

He also mentions core dumps. A core is what happens when your C/C++/Fortran program crashes in Unix/Linux. You can examine this core to determine where your program went wrong. (It gives you a place to start.)

The Linux Foundation Developer Network has an on-line tutorial – Zen and the Art of Debugging C/C++ in Linux with GDB – They write a C program (incorporating a bug), create a make file, compile, and then use gdb to find the problem. You are also introduced to several Unix/Linux commands in the process.

You can debug Perl by invoking it with the -d switch. Perl usually crashes at the line number that caused the problem and some explanation of what went wrong.

The -d option also turns on parser debugging output for Python.

Object Dumps

One of the most useful utilities in Unix/Linux is od (object dump). You can examine files in octal (default), hex, or ASCII characters

od is very handy for examining data structures, finding hidden characters, and reverse engineering.

If you think you’re code is right, the problem may be in what you are trying to read. Use od to get a good look at the input data.

That’s it for Part 1. Part 2 will cover Open Source, project management, archiving source code and other topics.

Keep to Good Faith No comments yet

I was going through a box of textbooks last week and stumbled upon a copy of the Enron Code of Ethics. I have another one stored away with a form, signed by Ken Lay, that states I have read and will comply with the Enron Code of Ethics.

I was employed at Enron from 2000 through 2002 and was there when the wheels came off. Our department was left intact. Otherwise, whole floors of the Enron building were vacated. It really was a shame, because Enron was a great place to work. Several friends and acquaintances lost most of what they had because of the malfeasance of a greedy few.

This had to be the most blatant example of unethical conduct in the workplace I encountered. There were others, that appeared seemingly minor, ended up costing companies money and talent. Most of these losses were mostly the result of mismanagement and not outright unethical behavior. But, then again, is mismanagement itself unethical?

I book I read recently entitled “A Small Treatise of the Great Virtues, The Uses of Philosophy in Everyday Life” by Andre Comte-Sponville (Metropolitan Books), talks about truth as “Good Faith”.

He states on page 196, that “at the very least that one speaks the truth about what one believes, and this truth, even if what one believes is false, is less true for all that. Good faith, in this sense, is what we cal sincerity (or truthfulness or candor) and is the opposite of mendacity, hypocrisy, and duplicity, in short, the opposite of bad faith in all its private and public forms.”

In my position at a major hardware/software developer I was told that I “didn’t need to know about a product to sell it.”

At another position, I found that a few fraudulent claims by a contractor caused a company to fork over three quarters of a million dollars for custom software when a fifteen thousand dollar piece of hardware would have an enabled an already existing piece of commercial software to do the job. With a more accurate accountability of the data, I might add.

In fact, the whole program was completely mismanaged, to the detriment of the company, not the contractor. In fact, he was ready for the next program as he had one of his engineers hired in to head up that project. An engineer who didn’t have the slightest idea about our system, much less its theory. Thankfully, we got him transferred out of there and back to design where he belonged.  The contractor was kicked out of the company.

These are straight-forward examples of bad faith. The following are a little harder to classify.

Beware the ulterior motive, especially if the new system you are proposing will impose on someone’s fiefdom.

Data analysis for the existing program consisted of placing a request with the a data analysis group and waiting up to 3 days for results. The system proposed (and later deployed) would give each and every engineer access to an analysis application that they could use to inspect the data one and a half hours after a particular test cycle was completed. A little training and they were ready to go.

Countless hours were spent in useless meetings defending the system. Everybody shut up when the system came up on day one and stayed up through months of testing.

This test/record/analysis cycle fits perfectly into the Laboratory Information Management Systems (LIMS) cycle of genomics research. A successful LIMS implementation in one lab aroused the ire of yet another lab attempting to develop their own solution. Let’s just say a lot of bad faith erupted.

The real loser in the above examples is the company. Money is wasted and talented people go elsewhere.

Biotechnology is a hot commodity right now. Stimulus funding bringing fresh capital to many projects. Companies are leveraging existing corporate products by repackaging them as biotech ready.

National Instruments LabView is one of these. I used it a lot in engineering. Now it’s a big player in the lab, incorporating interfaces for research lab instrumentation.

What is a LIMS (Laboratory Information Management System)? Is it an inventory management system? Is it a data pipeline? Can one size fit all?

Some companies have taken existing Inventory Management Systems and relabeled them as a Laboratory Information Management Systems. (At least the acronym fits.) Most of these systems don’t distinguish between research and manufacturing environments. They also don’t support basic validation of the LIMS application for its intended purpose. No wonder some 80% of LIMS users are dissatisfied.

At a recent conference I talked with researchers from various pharmaceutical companies and they were thoroughly dissatisfied with their LIMS systems. One scientist stated that they had a problem with their LIMS. When they went to report the problem, they found the company was no longer in business.

The latest IT (Information Technology) trends – SaaS, Cloud computing – may work in a business environment , but they won’t translate well to a pharmaceutical research area where they want everything safe behind the firewall.

There are many, many factors that go into developing biotechnology applications. Getting the right people, controlling the political environment, finding or developing the right software – it’s a jungle out there.

Keep to Good Faith and please be careful.

Women In Technology No comments yet

Today, one in ten engineers is a woman – In avionics, it’s fewer than that.

This is really a shame, because I find that women are extremely well suited for jobs in high tech careers.

Here’s a short list of why I think this is true along with explanations as to why I think this is so.

  1. Women are more patient and determined
  2. Women can juggle a lot of tasks simultaneously
  3. Women can attend to small details and see the big picture at the same time
  4. Women don’t get derailed by the small stuff
  5. Women have a better support system.
  6. Women are more sympathetic and understanding

I’ll stop at this group of six, although I could add a few more. They are not true of all women, but that’s probably because they haven’t had the experience.

Just take a look at what current society expects of women and I think you’ll see why I think women are more patient and determined! Case in point, I just got an email on “How to Create Perfect Eyes” through makeup application. Can you imagine a heterosexual male having the patience to take the time to apply all the goop we women have to put on our faces to be seen in public? Also, remember how determined we were to walk in high heels so we could pretend we were grown-ups?

Programming, system design and integration requires patience and determination. It’s a step-by-step process. All the pieces have to work together to produce the correct outcome. It’s no different that making a food dish from a recipe, although in most cases you’ll have only your experience to formulate the list of ingredients and right steps to finish the job.

Think about getting the family ready for school/work in the morning. How many things are you trying to do at once? Multi-tasking is standard operating procedure for most women, who can adapt to chaos in the blink of an eye.

I know chaos. Other than being the oldest of nine children (5 girls, 4 boys), I drove a school bus for about 4 years while I was attending college. I was given a long, country route that paid well and gave me enough hours to qualify for health insurance. After I had driven the route for about six weeks, my supervisor asked me how i was doing and what did I think of the kids. I said I thought I was doing okay and the kids were a little rowdy, but we got that under control. Otherwise, I said the kids were a bright bunch and generally inquisitive about everything. (“Miss Pam, what’s a hickey? Our teacher says it’s something you get in dominoes.”)

I found out later that these kids had been through 4 bus drivers in 4 weeks. The last day of that school year the kids on the route gave me a plaque that said “World’s Best School Bus Driver”. I was impressed, even though they misspelled my name.

I’ve discovered that women, as a whole, performed better on mission critical tasks that required a lot on concentration and coordination of several activities that had to occur simultaneously.

I couldn’t make a practice session for a particular field test, so the guys were going to fill in for me. I heard that it took them an extra long time to get started, because they couldn’t figure out how to calibrate the instrumentation. (They took the same training class that I did!) Let’s just say that they were more than happy to let me take over the operation after they were introduced to all the steps involved in the pre and post fly-over operations.

Lots of tasks mean lots of details to keep track of with almost no time to double-check anything. Women do this sort of thing all the time. Think about putting together a meal, folding clothes fresh from the dryer, putting on makeup. You don’t really think about it, you just do it. Juggling home, family, and career by itself is one big accomplishment.

We took two years to perfect all the pieces that made up the testing for the 727QF certification. We worked out the weather station in Roswell, NM. We took the acoustic analyzer to Moses Lake, WA ( to work out the routine we needed for testing. (Desert dust at 35 knots in no fun, but it can’t hold a candle to the volcanic ash from Mt. St. Helen’s that we ran into in Moses Lake. They got about a foot of ash from that explosion and the ash was dumped at the airport. Right where we were working!)

The only missing piece was the data download from the data logger on the meteorological (met) plane.

I sat under the wing of the small Cessna in the hot Texas August heat with a laptop atop my crossed legs, dodging fire ants, as I worked out the best method for our technician to save the data acquired after each run of the met plane. I got it down to a few steps, ran through it with him, and we had the met data canned.

All those pieces, met plane, weather station, acoustic analyzer and DAT (digital audio tape) data, were part of the big picture that was noise testing. The other parts were the group support systems – data download, availability, and analysis, There was so much data flowing through the pipeline, we held a meeting every morning to discuss who needed what, how much, how they wanted it, and what data could be taken to archive.

The next-gen sequencing efforts are producing an astronomical amount of raw data. Data that has to be stored, analyzed, and archived, creating one complex system. It’s a massive task and one I can sympathize with.

Women don’t get derailed by the small stuff.

Maybe this wasn’t so small, and sometimes it hit close to home, but a lot of the things I did got satirized via a cartoon or paste-up on bulletin boards all over the plant on the 727QF program.

For instance, I developed this relational database model that would store measurement information for the two aircraft we were testing.

One of the technicians had started his own local database, but he had no understanding of relational data concepts. So he had thermocoupleA and thermocoupleB, where A represented on aircraft and B represented the other. The thermocouple in question was the same on both aircraft, causing duplicate records for the same part info.

At a informal meeting we were having in the instrumentation lab, I said that his database design was stupid because we didn’t need more than one copy of the part’s basic attributes. The next day there was a flyer on the bulletin boards with a picture of the tech with a bubble over his head that said, “I stupid.”

There was some other verbiage, “Coming soon son of stupid. When relational is not enough.”

Stupid Database Flyer

Stupid Database Flyer

Since the technician was a friend, this was funny. There were others that weren’t so entertaining.

I think women are more sympathetic and understanding of other people. The problem is to not be so understanding that you are taken for a ride.

As a support system, we have probably the best weapon in the arsenal – we can cry. Not in public, not on the job, but we can got somewhere private and cry. Sometimes this is the only way to get it our of your system.

I put a lot of dents in a lot of old hardware and ran miles and miles, but. sometimes. even that did not cover it.

I will end by saying that I was pleasantly surprised at the number of women involved in the life sciences. By this, I mean as directors, P.I.’s, or other positions of power. However, men in the field still earn one-third more than the women.

Maybe one day, women will wield as much power in all branches of technology, and their paychecks will actually reflect this status.

BioCamp 2009 at Rice University

Bill and I attended BioCamp 2009 at Rice University on Saturday, Sept. 12. There were several presentations followed by lively question and answer sessions.

The atttendance consisted of entrepreneurs, those seeking guidance on turning their ideas and research into viable products, consultants searching for marketable products, and members of the legal profession offering advice on intellectual property, patents, trademarks, and the like.

The end of Bioinformatics?! No comments yet

The End of Bioinformatics?!

I read with some interest the announcement of the Wolfram Alpha.  Wolfram intends to be the end all and be all data mining systems and some say, makes bioinformatics obsolete.

Wolfram’s basis is a formal Mathematica representation.  It’s inference engine is a large number of hand-written scripts that access data that has been accumulated and curated.  The developers stress that the system is not Artificial Intelligence and is not aiming to be.  For instance,  a sample query,

“List all human genes with significant evidence of positive selection since the human-chimpanzee common ancestor, where either the GO category or OMIM entry includes ‘muscle’”

could currently be executed with SQL, provided the underlying data is there. 

Wolfram won’t replace bioinformatics.  What it will do is make it easier for a neophyte to get answers to his or her questions because they can be asked in a simpler format.

 I would guess Wolfram uses one or more these scripts to address a specific data set in conjunction with a natural language parser.  These scripts would move this data to a common model that could then be modeled on a web page.

But why not AI?  Why not replace all those “hand-written” scripts, etc.  with a real inference engine.

I rode the first AI wave.  I was a member of the first of 25 engineers selected to be a part of the McAir AI Initiative at McDonnell Aircraft Company.  (”There is AI in McAir”).  In all, 100 engineers were chosen from engineering departments to attend courses leading to a Certificate in Artificial Intelligence from Washington University in St. Louis.

One of the neat things about the course was the purchase of at least 30 workstations (maybe as many as 60) for a young company called Sun that were loaned to Washington University for the duration of the course.  Afterwards, we got a few Symbolics machines for our CADD project. 

Other than Lisp and Prolog, the software we used was called KEE (Knowledge Engineering Environment).  Also, there was a DEC (Digital Equipment Company) language called OPS5.

The course was quite fast-paced but very extensive.  We had the best AI consultants available at the time lecture and give assignments in epistemology, interviewing techniques, and so on. I had a whole stack of books.

The only problem was that no money was budgeted (or so I was told) for AI development for the departments for the engineers when they returned from the course eager to AI everything.  A lot of people left.

Anyway, my group of three developed a “Battle Damage Repair” system that basically “patched up” the composite wing skins of combat aircraft.   Given the size and location of the damage, the system would certify whether the aircraft would be able to return to combat, and would output the size and substance of the patch if the damage wasn’t that bad.

One interesting tidbit:  We wanted to present our system at a conference in San Antonio and had a picture of a battle-damaged F-15 we wanted to use.  Well, we were told that the picture was classified and, as such, we couldn’t use it.  Well, about that same time, a glossy McAir brochure featuring our system and that photo were distributed at the AAAI (American Assn. of Artificial Intelligence) to thousands of people. 

Another system I developed dealt with engineering schematics.  These schematics were layered.  Some layers and circuits were classified.   Still another system scheduled aircraft for painting and yet another charted a path for aircraft through hostile territory, activating electronic counter measures as necessary.

I guess the most sophisticated system I worked on was with the B-2 program.  The B-2 skin is a composite material.  This material has to be removed from a freezer, molded into a final shape and cooked in a huge autoclave before it completely thawed. 

We had to schedule materials, and the behavior of that material under various circumstances, as well as people and equipment.  The purpose was to avoid “bottlenecks” in people and equipment.  I was exposed to the Texas Instruments Explorer and Smalltalk-80 on an Apple.  I’ve been in love with Smalltalk ever since.

The system was developed, but it was never used.  The problem was that we had to rank workers by expertise.  That’s union workers and that wasn’t allowed. 

It was a nice system that integrated a lot of systems and worked well.  Our RFP (Request for Proposals) went out to people like Carnegie-Mellon.  We had certain performance and date requirements that we wanted to see in the final system.  We were told that the benchmarks would be difficult, in not impossible, to attain.  Well, we did it, on our own without their help.

We also had a neural net solution that inspected completed composite parts. The parts were submerged in water and bombarded with sound waves.  The echoes were used by the system to determine part quality.

AI promised the world, and then it couldn’t really deliver.  So it kind of went to the back burner.

One problem with the end and be all.  It will only be as good as your model.  It will only be as good as the developers can determine the behavior of the parts and how they interact with the whole.  Currently, this is a moving target and is changing day to day.  Good luck.

Links -

Will Wolfram Make Bioinformatics Obsolete? -

Computer System Configurations No comments yet

The most complex system I’ve configured was the airborne data acquisition and ground support systems.  However, not many people have to or want do anything that large or complex.  Some labs will need info from thermocouples, strain gauges, or other instrumentation, but most of you will be satisfied with a well-configured system that can handle today’s data without a large cash outlay that  can be expanded at minimum cost to handle the data of tomorrow.

This week’s guest blogger, Bill Eaton, provides some guidelines for  the configuration  of a Database Server,  a Web Server, and a Compute Node, the three most requested configurations.

(Bill Eaton)

General Considerations
Choice of 32-bit or 64-bit Operating System on standard PC hardware

  • A 32-bit operating system limits the maximum memory usage of a program to 4 GB or less, and may limit maximum physical memory.
    • Linux:  for most kernels, programs are limited to 3 GB.  Physical memory can usually exceed 4 GB.
    • Windows :The stock settings limit a program to 2 GB, and physical memory to 4 GB.
      The server versions have a /3GB boot flag to allow 3 GB programs and a /PAE flag to enable more than 4 GB of physical memory.
      Other operating systems usually have a 2 or 3 GB program memory limit.
  • A 64-bit operating system removes these limits.  It also enables some additional CPU registers and instructions that may improve performance. Most will allow running older 32-bit program files.

Database Server:
Biological databases are often large, 100 GB or more, often too large to fit on a single physical disk drive. A database system needs fast disk storage and a large memory to cache frequently-used data.  These systems tend to be I/O bound.

Disk storage:

  • Direct-attached storage:  disk array that appears as one or more physical disk drives, usually connected using a standard disk interface such as SCSI.
  • Network-attached storage:  disk array connected to one or more hosts by a standard network.  These may appear as network file systems using NFS, CIFS, or similar, or physical disks using iSCSI.
  • SAN:  includes above cases, multiple disk units sharing a network dedicated to disk I/O.  Fibre Channel is usually used for this.
  • Disk arrays for large databases need high I/O bandwidth, and must properly handle flush-to-disk requests.


  • Storage overhead:  data repositories may require several times the amount of disk space required by the raw data.  Adding an index to a table can double its size.  A test using a simple mostly numeric table with one index gave these overheads for some common databases.
    • MySQL using MyISAM 2.81
    • MySQL using InnoDB 3.28
    • Apache Derby       5.88
    • PostgreSQL         7.02
  • Data Integrity support:  The server and disk system should handle failures and power loss as cleanly as possible.  A UPS with clean shutdown support is recommended.

Web Server and middleware hosts:
A web server needs high network bandwidth, and should have a large memory to cache frequently-used content.

Web Service Software Considerations:

  • PHP:  Thread support still has problems.  PHP applications running on a Windows system under either Apache httpd or IIS may encounter these.  We had seen a case where WordPress run under Windows IIS and Apache httpd gave error messages, but worked without problems under Apache httpd on Linux.  IIS FastCGI made the problem worse. PHP acceleration systems may be needed to support large user bases.
  • Perl:  similar thread support and scaling issues may be present. For large user bases, use of mod_perl or FastCGI can help.
  • Java-based containers:  (Apache Tomcat, JBoss, GlassFish, etc) These run on almost anything without problems, and usually scale quite well.

Compute nodes:
Requirements depend upon the expected usage.  Common biological applications tend to be memory-intensive.  A high-bandwidth network between the nodes is recommended, especially for large clusters.  Network attached storage is often used to provide a shared file system visible to all the nodes.

  • Classical “Beowulf” cluster:  used for parallel tasks that require frequent communication between nodes.  These usually use the MPI communication model, and often have a communication network tuned for this use such as Myrinet.  One master “head” node controls all the others, and is usually the only one connected to the outside world. The cluster may have a private internal Ethernet network as well.
  • Farm:  used where little inter-node communication is needed.  Nodes usually just attach to a conventional Ethernet network, and may be visible to the outside world.

The most important thing is to have a plan from the beginning that addresses all the system’s needs for storage today and is scalable for tommorrow’s unknowns.

Data Stewardship (including Archiving) No comments yet

Data Stewardship -
The Conducting, Supervising, and Management of Data

Next-gen sequencing promises to unload reams and reams of data on the world.  Pieces of that data will prove relevant to one or the other of specific research projects in your enterprise.  At the same time, your lab may produce more data by annotation or simple research.  How do you handle it all?

First, you should appoint a data steward.  This person must understand where the data comes from, how it is modeled, who uses what parts of it, and any results this data may produce, such as forms, etc. Most importantly, they must be able to verify the integrity of that data.

Data, Data, Data

I’ve handled lots of engineering and bioinformatics data in my time…

In engineering, I had to be sure all instrumentation was calibrated correctly and production data was representative or correct.  Every morning at 7 a.m., I held a meeting with data analysts, system administrators, database representatives, etc. focused on who was doing what to which data, what data could be archived, what data should be recovered from archive, and so on.   This data inventory session proved to be extremely useful as there were terabytes of data swept through the system on a weekly basis.

For bioinformatics, I had to locate and merge data from disparate sources into one whole and run that result against several analysis programs to isolate the relevant data.  That data was then uploaded to a local database for access by various applications.  As the amount of available sequence data grew, culling the data, storage of this data, and archiving of the initial and final data became something of a headache.

My biggest bioinformatics problem was NCBI data, as that was how we got most of our data.
I spent weeks/months/years plowing though the NCBI toolkit, mostly in debug. Grep became my friend. 

I tried downloading complete GenBank reports from the NCBI ftp website but that took too much space.  I used keywords with the Entrez eutils, but the granularity wasn’t fine enough, and I ended up with way too much data.  Finally, I resorted to the NCBI Toolkit on NCBI ASN.1 binary files.
LARTS would have made this part so much easier. 

The Data Steward should also be familiar with data maintenance and storage strategies.

Our guest blogger, Bill Eaton, explains the difference between backup and archiving of data, and lists the pros and cons of various storage technologies.

Bill Eaton: Data Backup and Archival Storage

  Backups are usually kept for a year or so, then the storage media is reused.
  Archives are kept forever.  Retrievals are usually infrequent for both.

Storage Technologies

Tape:  suitable for backup, not as good for archiving.

Pro: Current tape cartridge capacities are around 800 GB uncompressed.

Cost per bit is roughly the same as for hard disks.

Con: Tape hardware compression is ineffective on already-compressed data.
      Tapes and tape drives wear out with use.
      Software is usually required to retrieve tape contents. (tar, cpio, etc)
      Tape technology changes frequently, formats have a short life.

Optical:  better for archiving than backup

Pro:  DVD 8.5 GB, Blu-Ray  50 GB
      DVD contents can be a mountable file system, so that no special software is needed for retrieval.
      Unlimited reading, no media wear.
      Old formats are readable in new drives.
Con:  Limited number of write cycles.

Hard Disks:  could replace tape

Pro:   Simple:  Use removable hard disks as backup/archive devices.
        Disk interfaces are usually supported for several years.
Con: Drives may need to be spun up every few months and contents
          rewritten every few years.

MAID:  Massive Array of Idle Disks
        Disk array in which most disks are powered down when
        not in active use.

Pro: The array controller manages disk health,
        spinning up and copying disks as needed.
        The array usually appears as a file system. Some can emulate a tape drive.

Con: Expensive.

Classical:  the longest-life archival formats are those known
      to archaeologists. 

Pro:  Symbols carved into a granite slab are
      often still readable after thousands of years.
Con: Backing up large amounts of data this way could take hundreds of years.


ASN.1 to XML: The Process No comments yet


Jim Ostell, speaking at the observance of the 25th anniversary of NCBI, stated something along the lines of, “then they wanted XML, but nah..”.

While working on the filters for the LARTS product, most specifically, the GenBank-like report, I realized how tightly-coupled the NCBI ASN.1/XML is to the toolkit. 

Basically, you’ve got to understand the toolkit code in order to translate what the XML is saying. The infinite extendability and recursive structure of the ASN.1 data model is another conundrum. This is especially true of the of the ASN.1 data structures supporting GenBank data - Bioseq-set. For example, a phy-set (phylogeny set) can include sets of Bioseq-sets nested to several levels. Most Bioseq-sets are the usual nuc-prot (DNA and translating protein), but others are pop-sets, eco-sets, segmented sequences with sets of sequence parts, etc.

After we developed LARTS, I wrote the GB filter as a Java object.  It was an interesting experience. 

NCBI ASN.1 rendered as XML, either our version or the NCBI asn2xml version, is very dependent on  the NCBI toolkit code for proper interpretation.  

The two most glaring examples are listed below.

Sequence Locations

Determing the location of sequence features for a GenBank data report, is a prime example.  Here are a few simple examples:

primer_bind   order(complement(1..19), 332..350)
gene                complement(join(1560..2030, 3304..3321))
CDS               complement(join(3492..3593, 3941..4104, 4203..4364, 4457..4553, 4655..4792))
rRNA  join(<1..156, 445..478, 1199..>1559) 5231, 76582..76767, 77517..77720, 78409..78490))
primer_bind   order(complement(1..19), 1106..1124)

For Segmented-sequences:
CDS         join(162922:124..144; 162923: 647..889, 1298..1570)

CD regions locations have frames, bonds have points (that can be packed), strand minus denotes a complement (reverse order), a set of sequence locations for a sequence feature (packed-seqint) denotes a join, and locations can be “order(”ed, or “one-of”, and fuzz-from and fuzz-to has to taken into account for points and sequence intervals.

Sequence Format

DNA sequences are stored in a packed 2-bit or 4-bit per letter format (ncbi2na and ncbi4na).  2na is used if the sequence does not contain ambiguity, otherwise 4na is the format of choice. The sequence must be unpacked to be useful. This takes a basic understanding of Hex(adecimal).

The NCBI Toolkit contains all of the code necessary to render a GenBank report from the ASN.1 binary or ASCII data file.  (The code is there, but you have to figure out how to compile it into an executable.)

We took the toolkit code and converted  it to Java to produce the GenBank-style output format.  It differs from the actual NCBI GenBank Report in that the LARTS report lists a FASTA-formatted sequence instead of the 10-base pairs per column that the NCBI GenBank Report produces.

The Many Variations of LARTS is provided as an example with Stand-Alone LARTS.  The LARTS Reader enables the GenBank-style report.

Using LARTS Online, the user can select the GenBank-style report as the desired Output Format.

A third option, would entail using LARTS Online to obtain the keyword or keyword/element-path data wanted in XML format. This data is then downloaded to a local machine via the Thick Client option. Finally, Stand-Alone LARTS would process the dowloaded XML data into a GenBank-style report.

Stand-Alone LARTS provides example filters and SQL for processing XML and loading the relevant data into a local SQL database.  This includes sample code for  the BLOB and CLOB objects.

The filter for FASTA-formatting sequence data is also available as an example with Stand-Alone LARTS.

These options provide ready access to NCBI data for your research.

Programming Practices to Live By No comments yet

Programming Practices

I’ve been privy to all sorts of coding adventures.   I’ve had a website and supporting components dropped in my lap with little overview other than the directory structure.  I’ve had to plow through the methodology and software written for one aircraft certification program to determine if any of it was relevant for the next certification project.
In either case, it wasn’t a lot of fun.   I spent lots of time reading code and debugging applications in addition to talking to vendors, customer support, technical staff, etc.

There were days when I would have killed for a well-documented program.  Instead, I had to spend weeks in debug, poking around, learning how things worked.  In both cases, the developers of said projects were no longer available for consultation.

Here are few programming practices that I’ve tried to adhere to when writing code.

Document, Document, Document

I am a big proponent of Javadoc.  Javadoc is a tool for generating API documentation in HTML format from doc comments in source code. It can be downloaded only as part of the Java 2 SDK. To see documentation generated by the Javadoc tool, go to J2SE 1.5.0 API Documentation at
http://java/sun/com/j2se/1.5.0/docs/api/index.html. Go to “How to Write Doc Comments for the Javadoc Tool at for more information.

Other languages have similar markup languages for source code documentation.

Perl has perlpod -
Python has pydoc -
Ruby has RDoc -

I usually start each program I write with a comment header that lists: who developed the program, when it was developed, why it was developed, for whom it was developed, and what the program is supposed to accomplish.  It’s also helpful to list the version number of the development language.  List any dependencies such as support modules that are not part of the main install that were downloaded for the application.

Each class, method, module (etc.) should be headed by a short doc comment containing a description followed by definitions of the parameters utilized by the entity, both input and output.

Coding practices

I think all code should be read like a book.  Otherwise -

- Code should flow from a good design.
- The design should be evolutionary.
- Code should be modular.
- Re-use should be a main concern.
- Each stage of development should be functional.
- Review code on a daily or at least a weekly basis.
I’ve mostly found that peer review can be a great time-waster.

There are  several design methodologies, such as Extreme Programming, that are the flavor de jour.  None have been completely successful in producing perfect software.

To-Do Lists - Use them!

There are project management and other tools available for this, but a plain text To-Do file that lists system extension, enhancements, and fixes is a good thing.  It’s simple, you don’t need access to or have to learn how to navigate a complex piece of software. 

Find Out Who Knows What and GO ASK THEM

I worked on a project and shared a cube with a guy named Al.  Al was not the most pleasant person (he was the resident curmudgeon), but we got along.  Al had been working on the huge CAD (Computer Aided Design) project since the first line of code was written.  If I couldn’t understand something, a brief conversation with Al was all I needed. 

Every time a programmer on that project complained to me about not understanding something, I told them to go ask Al.  However, I ended up as the “Go Ask Al” person.  I didn’t mind, as we became the top development group in that environment.

Use Code Repositories

The determination of which code repository to use - SVN (Subversion), Mercurial, and GIT are the big three - has become something of a religious issue.  Any of these have their pros and cons.  JUST USE ONE!

Integrated Development Environments (IDE)

There are several of these available.  My favorite is Eclipse (  I’ve also used SunOne which became Creator which is rolled into Net.Beans which was out there all along.

There are a lot of plug ins available for Eclipse and Net.Beans that enable you to develop for almost any environment.  The ability to map an SQL data record directly to a web page and automatically generate the SELECT statement to populate that page has to be at the top of my list.

I’ve used VisualStudio for C++ development in a Windows environment for a few applications, but most of my development as been on Unix, Linux, and Mac platforms.

The problem with most IDE’s is that they are complex and it takes awhile for a programmer to become adept at using them.  You’re also locked into that development methodology which may become inflexible, due to the applications under development.

We used EMACS on Linux for all LARTS development (LifeFormulae ASN.1 Reader Tool Set, my current project), although my favorite editor is vim/vi.  I can do things faster in vi, mainly because I’ve used it for so long. 

Which Language?

My favorite language of all time is Smalltalk.  If things had worked out, we would all be doing Smalltalk instead of Java. 

Perl is a good scripting language for text manipulation. It’s the language of choice for spot programming or Perl in a panic.  Spot programming used infrequently is okay.  However, if everything you are doing is panic programming, your department needs to re-think its software development practices.

Lately, I’ve been working in Java.  Java is powerful, but it also has its drawbacks. 

We will always have C.  According to, most open source projects submitted in 2008 were in C, and this was by a very wide margin. C has a degenerative partner, C++.  C++ does not clean up after itself.  You have to use delete. Foo(); can either be a function declaration or a call to a constructor, depending on the type of Foo.

Fortran is another one that will always be around.  I’ve done a lot of Fortran programming.  It is used quite extensively in engineering, as is Assembly Language.  I have been called a “bit-twiddler” because I knew how to use assembler.  

Variable Names

This is a touchy subject.  I’ve been around programmers who have said code should be self documenting.  Okay, but long variable names can be a headache, especially if one is trying to debug an application with a lot of long, unwieldy variable lanes.

Let’s just say variable names should be descriptive.


This should probably go under the IDE section, but I’ll make my case for debugging here.
Every programmer should know how to debug an application.

The simplest form of debugging is the print() or println() statement to examine variables at a particular stage of the application.

Some debugging efforts can become quite complex, so as debugging a process that utilizes several machines, or pieces of equipment such as an airplane. 

Sun Solaris has an application truss that lets you debug things at the system level and lets you follow the system calls.  The Linux version is strace.

I am most familiar with gdb - The GNU Debugger.   I’ve also used dbx on Unix.

The IDE’s offer a rich debugging experience and plug-ins let you debug almost anything.  

Closing Thoughts

Always program for those coming behind you.  They will appreciate your effort.

It’s best to keep it simple.  Especially the user interface. 

Speaking of users, talk to them.  Get their feedback on everything you do to the user interface.  I spent my lunch hour teaching the Flight Test Tool Crib folks (for whom I was creating a database inventory system) how to dance the Cotton-Eyed Joe. The moral, Keep your users happy.

 Technology is wonderful, but technology for technology’s sake (”because we can!”) is usually overkill.
The guiding principle should be — Is it necessary?

By the way, going back to that certification project.  I found that instead of spending $15K for a disk drive, the company was conned into spending $750K for custom developed software that I had to throw in the trash, except for one tiny piece. Was it necessary? The point is that the best answer is not always the most expensive one.

Top of page / Subscribe to new Entries (RSS)