The PLOS Computational Biology website recently published “A Quick Guide for Developing Effective Bioinformatics Programming Skills” by Joel T. Dudley and Atul J. Butte (http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1000589).
This article is a good that survey covers all the latest topics and mentions all the currently-popular buzzwords circulating above, around, and through the computing ionosphere. It’s a good article, but I can envision readers’ eyes glazing over about page 3. It’s a lot of computer-speak in a little space.
I’ll add in a few things they skipped or merely skimmed over to give a better overview of what’s out there and how it pertains to bioinformatics.
They state that a biologist should put together a Technology Toolbox. They continue, “The most fundamental and versatile tools in your technology toolbox are programming languages.”
Programming Concepts
Programming languages are important, but I think that Programming Concepts are way, way more important. A good grasp of programming concepts will enable you to understand any programming language.
To get a good handle on programming concepts, I recommend at book. This book, Structure and Implementation of Computer Programs from MIT Press (http://mitpress.mit.edu/sicp/),is the basis for an intro to computer science at MIT. It’s called the Wizard Book or the Purple Book.
I got the 1984 version of the book which used the LISP language. The current 1996 version is based on LISP/Scheme. Scheme is basically a cleaned-up LISP, in case you’re interested.
Best of all course (and the down loadable book) are freely available from MIT through the MIT OpenCourseWare website – http://ocw.mit.edu/OcwWeb/Electrical-Engineering-and-Computer-Science/6-001Spring-2005/CourseHome/index.htm.
There’s a blog entry - http://onlamp.com/pub/wlg/8397 - that goes into further explanation about the course and the book..
And just because you can program, it doesn’t mean you know (or even need to know) all the concepts. For instance, my partner for a engineering education extension course was an electrical engineer who was programming microprocessors. When the instructor mentioned the term “scope” in reference to some topic, he turned to me and asked, “What’s scope?”
According to MIT’s purple book –” In a procedure definition, the bound variables declared as the formal parameters of the procedure have the body of the procedure as their scope.”
You don’t need to know about scope to program in assembler, because everything you need is right there. (In case you’re wondering, I consider assembler programmers to be among the programming elites.)
Programming Languages
The article mentions Perl, Python, and Ruby as the “preferred and most prudent choices” in which to seek mastery for bioinformatics.
These languages are selected because “they simplify the programming process by obviating the need to manage many lower level details of program execution (e.g. memory management), affording the programmer the ability to focus foremost on application logic…”
Let me add the following. There are differences in programming languages. By that, I mean compiled vs scripted. Languages such as C, C++, and Fortran are compiled. Program instructions written in these languages are parsed and translated into object code, or a language specific to the computer architecture the code is to run on. Compiled code has a definite speed advantage, but if the code is the main or any supporting module is changed, the entire project must be recompiled. Since the program is compiled into the machine code of a specific computer architecture, portability of the code is limited.
Perl, Python, and Ruby are examples of scripted or interpreted languages. These languages are translated into byte code which is optimized and compressed, but is not machine code. This byte code is then interpreted by a virtual machine (or byte code interpreter) usually written in C.
An interpreted program runs more slowly than a compiled program. Every line of an interpreted program must be analyzed as it is read. But the code isn’t particularly tied to one machine architecture making portability easier (provided the byte code interpreter is present). Since code is only interpreted at run time, extensions and modifications to the code base is easier, making these languages great for beginning programmers or rapid prototyping.
But, let’s get back to the memory management. This, and processing speed will be a huge deal in next gen data analysis and management.
Perl automatic memory management has a problem with circularity, as Perl (and Ruby and Python) count references.
If object 1 points to object 2 and object 2 points back to 1 , but nothing else in the program points to either object 1 or object 2 (this is a weak reference), these objects don’t get destroyed. They remain in memory. If these objects get created again and again, it’s called a memory leak.
I also have to ask – What about C/C++ , Fortran, and even Turbo Pascal? The NCBI Toolkit is written in C/C++. If you work with foreign scientists, you will probably see a lot Fortran.
Debugging
You can’t mention programming with mentioning debugging. I consider the act of debugging code an art form any serious programmer should doggedly pursue.
Here’s a link to a ebook, The Art of Debugging – http://www.circlemud.org/cdp/hacker/. It’s mainly Unix-based, C-centric and a little dated. But good stuff never goes out of style.
Chapter 4, Debugging: Theory explains various debugging techniques. Chapter 5 – Profiling talks about profiling your code, or determining where your program is spending most of its time.
He also mentions core dumps. A core is what happens when your C/C++/Fortran program crashes in Unix/Linux. You can examine this core to determine where your program went wrong. (It gives you a place to start.)
The Linux Foundation Developer Network has an on-line tutorial – Zen and the Art of Debugging C/C++ in Linux with GDB – http://ldn.linuxfoundation.org/article/zen-and-art-debugging-cc-linux-with-gdb. They write a C program (incorporating a bug), create a make file, compile, and then use gdb to find the problem. You are also introduced to several Unix/Linux commands in the process.
You can debug Perl by invoking it with the -d switch. Perl usually crashes at the line number that caused the problem and some explanation of what went wrong.
The -d option also turns on parser debugging output for Python.
Object Dumps
One of the most useful utilities in Unix/Linux is od (object dump). You can examine files in octal (default), hex, or ASCII characters
od is very handy for examining data structures, finding hidden characters, and reverse engineering.
If you think you’re code is right, the problem may be in what you are trying to read. Use od to get a good look at the input data.
That’s it for Part 1. Part 2 will cover Open Source, project management, archiving source code and other topics.

Programming Practices
I’ve been privy to all sorts of coding adventures. I’ve had a website and supporting components dropped in my lap with little overview other than the directory structure. I’ve had to plow through the methodology and software written for one aircraft certification program to determine if any of it was relevant for the next certification project.
In either case, it wasn’t a lot of fun. I spent lots of time reading code and debugging applications in addition to talking to vendors, customer support, technical staff, etc.
There were days when I would have killed for a well-documented program. Instead, I had to spend weeks in debug, poking around, learning how things worked. In both cases, the developers of said projects were no longer available for consultation.
Here are few programming practices that I’ve tried to adhere to when writing code.
Document, Document, Document
I am a big proponent of Javadoc. Javadoc is a tool for generating API documentation in HTML format from doc comments in source code. It can be downloaded only as part of the Java 2 SDK. To see documentation generated by the Javadoc tool, go to J2SE 1.5.0 API Documentation at
http://java/sun/com/j2se/1.5.0/docs/api/index.html. Go to “How to Write Doc Comments for the Javadoc Tool at http://java.sun.com/j2se/javadoc/writingdoccomments for more information.
Other languages have similar markup languages for source code documentation.
Perl has perlpod - http://perldoc.perl.org/perlpod.html
Python has pydoc - http://docs.python.org/library/pydoc.html
Ruby has RDoc - http://rdoc.sourceforge.net
I usually start each program I write with a comment header that lists: who developed the program, when it was developed, why it was developed, for whom it was developed, and what the program is supposed to accomplish. It’s also helpful to list the version number of the development language. List any dependencies such as support modules that are not part of the main install that were downloaded for the application.
Each class, method, module (etc.) should be headed by a short doc comment containing a description followed by definitions of the parameters utilized by the entity, both input and output.
Coding practices
I think all code should be read like a book. Otherwise -
- Code should flow from a good design.
- The design should be evolutionary.
- Code should be modular.
- Re-use should be a main concern.
- Each stage of development should be functional.
- Review code on a daily or at least a weekly basis.
I’ve mostly found that peer review can be a great time-waster.
There are several design methodologies, such as Extreme Programming, that are the flavor de jour. None have been completely successful in producing perfect software.
To-Do Lists - Use them!
There are project management and other tools available for this, but a plain text To-Do file that lists system extension, enhancements, and fixes is a good thing. It’s simple, you don’t need access to or have to learn how to navigate a complex piece of software.
Find Out Who Knows What and GO ASK THEM
I worked on a project and shared a cube with a guy named Al. Al was not the most pleasant person (he was the resident curmudgeon), but we got along. Al had been working on the huge CAD (Computer Aided Design) project since the first line of code was written. If I couldn’t understand something, a brief conversation with Al was all I needed.
Every time a programmer on that project complained to me about not understanding something, I told them to go ask Al. However, I ended up as the “Go Ask Al” person. I didn’t mind, as we became the top development group in that environment.
Use Code Repositories
The determination of which code repository to use - SVN (Subversion), Mercurial, and GIT are the big three - has become something of a religious issue. Any of these have their pros and cons. JUST USE ONE!
Integrated Development Environments (IDE)
There are several of these available. My favorite is Eclipse (www.eclipse.org). I’ve also used SunOne which became Creator which is rolled into Net.Beans which was out there all along.
There are a lot of plug ins available for Eclipse and Net.Beans that enable you to develop for almost any environment. The ability to map an SQL data record directly to a web page and automatically generate the SELECT statement to populate that page has to be at the top of my list.
I’ve used VisualStudio for C++ development in a Windows environment for a few applications, but most of my development as been on Unix, Linux, and Mac platforms.
The problem with most IDE’s is that they are complex and it takes awhile for a programmer to become adept at using them. You’re also locked into that development methodology which may become inflexible, due to the applications under development.
We used EMACS on Linux for all LARTS development (LifeFormulae ASN.1 Reader Tool Set, my current project), although my favorite editor is vim/vi. I can do things faster in vi, mainly because I’ve used it for so long.
Which Language?
My favorite language of all time is Smalltalk. If things had worked out, we would all be doing Smalltalk instead of Java.
Perl is a good scripting language for text manipulation. It’s the language of choice for spot programming or Perl in a panic. Spot programming used infrequently is okay. However, if everything you are doing is panic programming, your department needs to re-think its software development practices.
Lately, I’ve been working in Java. Java is powerful, but it also has its drawbacks.
We will always have C. According to slashdot.org, most open source projects submitted in 2008 were in C, and this was by a very wide margin. C has a degenerative partner, C++. C++ does not clean up after itself. You have to use delete. Foo(); can either be a function declaration or a call to a constructor, depending on the type of Foo.
Fortran is another one that will always be around. I’ve done a lot of Fortran programming. It is used quite extensively in engineering, as is Assembly Language. I have been called a “bit-twiddler” because I knew how to use assembler.
Variable Names
This is a touchy subject. I’ve been around programmers who have said code should be self documenting. Okay, but long variable names can be a headache, especially if one is trying to debug an application with a lot of long, unwieldy variable lanes.
Let’s just say variable names should be descriptive.
Debugging
This should probably go under the IDE section, but I’ll make my case for debugging here.
Every programmer should know how to debug an application.
The simplest form of debugging is the print() or println() statement to examine variables at a particular stage of the application.
Some debugging efforts can become quite complex, so as debugging a process that utilizes several machines, or pieces of equipment such as an airplane.
Sun Solaris has an application truss that lets you debug things at the system level and lets you follow the system calls. The Linux version is strace.
I am most familiar with gdb - The GNU Debugger. I’ve also used dbx on Unix.
The IDE’s offer a rich debugging experience and plug-ins let you debug almost anything.
Closing Thoughts
Always program for those coming behind you. They will appreciate your effort.
It’s best to keep it simple. Especially the user interface.
Speaking of users, talk to them. Get their feedback on everything you do to the user interface. I spent my lunch hour teaching the Flight Test Tool Crib folks (for whom I was creating a database inventory system) how to dance the Cotton-Eyed Joe. The moral, Keep your users happy.
Technology is wonderful, but technology for technology’s sake (”because we can!”) is usually overkill.
The guiding principle should be — Is it necessary?
By the way, going back to that certification project. I found that instead of spending $15K for a disk drive, the company was conned into spending $750K for custom developed software that I had to throw in the trash, except for one tiny piece. Was it necessary? The point is that the best answer is not always the most expensive one.