LifeFormulae Blog » Posts for tag 'Java'

Effective Bioinformatics Programming - Part 3 No comments yet

All Things Unix

Bioinformatics started with Unix. At the Human Genome Center, for a long time, I had the one and only PC. (We got a request from our users for a PC-based client for the Search Launcher). Everything else was Solaris (Unix) and Mac, which was followed by Linux.

Unix supports a number of nifty commands like grep, strings, df, du, ls, etc. These commands are run inside the shell, or command line interpreter, for the operating system (Unix). There have been a number of these shells in the history of Unix development.

The bash shell http://en.wikipedia.org/wiki/Bash is the default shell for the Linux environment. This shell provides several unique capabilities over other shells. For instance, bash supports a history buffer of system commands. With the history buffer, the “up” arrow will return the previous command. The history command lets you view a history of past commands. The bang operator (!) lets you rerun a previous command from the history buffer. (Which saves a lot of typing!)

bash enables a user to redirect program output. The pipeline feature allows the user to connect a series of commands. With the pipeline (“|”) operator, a chain of commands can be linked together where the output of one command is the input to the next command an so forth.

A shell script (http://en.wikipedia.org/wiki/Shell_script) is script written for the shell or command line interpreter. Shell scripts enable batch processing. Together with the cron command, these scripts can be set to run automatically at times when system usage is minimum.

For general information about bash, go to the Bash Reference Manual at http://www.gnu.org/software/bash/manual/bashref.html.

A whole wealth of bash shell script examples is available at - http://tldp.org/LDP/abs/html/.

Unix on Other Platforms

Cygwin (http://www.cygwin.com/) is a Linux-like environment for windows. The basic download installs a minimum environment, but you can add additional packages at any time. Go to http://cygwin.com/packages/ for a list of Cygwin packages available for download.

Apple’s OS X is based on Unix. Other than the MACH kernel, the OS is BSD-derived. Their Java package is usually not the latest as Apple has to port Java due to differences such as the graphics portion.

All Things Software – Documenting and Archiving

I’ve run into all sorts of approaches to program code documentation in my career. A lead engineer demanded that every line of assembler code be documented. A senior programmer insisted that code should be self-documenting.

By that, she used variable names such as save_the_file_to_the_home_directory, and so on. Debugging these programs was a real pain. The first thing you had to do was set up aliases for all the unwieldy names.

The FORTRAN programmers cried when variable names longer than 6 characters were allowed in version 77 of VAX FORTRAN.. Personally, I thought it was great. The same with IMPLICIT NONE.

In the ancient times, FORTRAN integers variables had to start with i thru n. Real variables could use the other letters. The IMPLICIT NONE directive told the compiler to shut that off.

All FORTRAN variables had to be in capital letters. But you could stuff strings into integer variables which I found extremely useful. All FORTRAN statements had to begin with a number. This number usually started at 10 and went up in increments of 10.

At one time Microsoft used Hungarian notation (http://en.wikipedia.org/wiki/Hungarian_notation) for variables in most of their documentation. In this method, the name of the variable indicated it’s use. For example, lAccountNumber was a long integer.

The IDEs (Eclipse, NetBeans, and others) will automatically create the header comment with a list of variables. The user just adds the proper definitions. (If you’re using Java, the auto comment is JavaDoc compatible, etc.)

Otherwise, Java supports the JavaDoc tool, Python has PyDoc, and Ruby has RDoc.

Personally, I feel that software programs should be read like a book, with documentation providing the footnotes, such as an overview of what the code in question does and a definition of the main variables for both input and output. Module/Object documentation should also note who uses the function and why. Keep variable names short but descriptive and make comments meaningful.

Keep code clean, but don’t go overboard. I worked with one programmer who stated, “My code is so clean you could eat off it.” I found that a little too obnoxious, not to mention overly optimistic as a number of bugs popped out as time went by.

Archiving Code

Version Control Systems (VCS) have evolved as source code projects became larger and more complex.

RCS (Revision Control System) meant that the days of the keeping the Emacs numbered files (e.g. foo.~1~) as backups were over. RCS used the diff concept (just kept a list of the changes make to a file as a backup strategy).

I found this unsuited for what I had to do – revert to an old version in a matter of seconds.

CVS was much, much better. CVS was replaced by Subversion. But they’re centralized repository structure can create problems. You basically check out what you want to work on from a library and check it back in when you’re done. This can be a slow process depending on network usage or central server available.

The current favorite is Git. Git was created by Linus Torvalds (of Linux fame). Git is a free, open source distributed version control system. (http://git-scm.com/).

Everyone on the project has a copy of all project files complete with revision histories and tracking capabilities. Permissions allow exchanges between users and merging to a central location is fast.

The IDE’s (Eclipse and NetBeans) will have CVS and Subversion plug ins already configured for accessing those repositories. NetBeans also supports Mercurical. Plug ins for the other versioning software modules are available on the web. The Eclipse plug in for Git is available at http://git.wiki.kernel.org/index.php/EclipsePlugin.

System Backup

Always have a plan B. My plan A had IT backup my systems on a weekly to monthly basis based on usage. A natural disaster completely decimated my systems. No problem, I thought, I have system backup. Imagine how I felt when I heard that IT had not archived a single on of my systems in over three years! Well, I had a plan B. I had a mirror of the most important stuff on an old machine and other media. We were back up almost immediately.

The early Tandem NonStop systems (now known as HP Integrity NonStop) automatically mirrored your system in real-time, so down time was not a problem.

Real-time backup is expensive and unless you’re a bank or airline, it’s not necessary.

Snapshot Backup on Linux with rsync

If you’re running Linux, Mac, Solaris, or any Unix-based system, you can use rsync for generating automatic rotating “snapshot” style back-ups. These systems generally have rsync already installed. If not, the source is available at – http://rsync.samba.org/.

This website - http://www.mikerubel.org/computers/rsync_snapshots/ will tell you everything you need to know to implement rsync based backups, complete with sample scripts.

Properly configured, the method can also protect against hard disk failure, root compromises, or even back up a network of heterogeneous desktops automatically.

Acknowledgment – Thanks, Bill!

I want to thank Bill Eaton for his assistance with these blog entries on Effective Bioinformatics Programming. He filled in a lot of the technical details, performed product analysis, and gave me direction in writing these blog entries.

To Be Continued - Part 4

Part 4 will cover relational database management systems (RDBMS), HPC (high performance computing) - parallel processing, FPGC, clusters, grids, and other topics.

Computer System Configurations No comments yet

The most complex system I’ve configured was the airborne data acquisition and ground support systems.  However, not many people have to or want do anything that large or complex.  Some labs will need info from thermocouples, strain gauges, or other instrumentation, but most of you will be satisfied with a well-configured system that can handle today’s data without a large cash outlay that  can be expanded at minimum cost to handle the data of tomorrow.

This week’s guest blogger, Bill Eaton, provides some guidelines for  the configuration  of a Database Server,  a Web Server, and a Compute Node, the three most requested configurations.

(Bill Eaton)

General Considerations
Choice of 32-bit or 64-bit Operating System on standard PC hardware

  • A 32-bit operating system limits the maximum memory usage of a program to 4 GB or less, and may limit maximum physical memory.
    • Linux:  for most kernels, programs are limited to 3 GB.  Physical memory can usually exceed 4 GB.
    • Windows :The stock settings limit a program to 2 GB, and physical memory to 4 GB.
      The server versions have a /3GB boot flag to allow 3 GB programs and a /PAE flag to enable more than 4 GB of physical memory.
      Other operating systems usually have a 2 or 3 GB program memory limit.
  • A 64-bit operating system removes these limits.  It also enables some additional CPU registers and instructions that may improve performance. Most will allow running older 32-bit program files.

Database Server:
Biological databases are often large, 100 GB or more, often too large to fit on a single physical disk drive. A database system needs fast disk storage and a large memory to cache frequently-used data.  These systems tend to be I/O bound.

Disk storage:

  • Direct-attached storage:  disk array that appears as one or more physical disk drives, usually connected using a standard disk interface such as SCSI.
  • Network-attached storage:  disk array connected to one or more hosts by a standard network.  These may appear as network file systems using NFS, CIFS, or similar, or physical disks using iSCSI.
  • SAN:  includes above cases, multiple disk units sharing a network dedicated to disk I/O.  Fibre Channel is usually used for this.
  • Disk arrays for large databases need high I/O bandwidth, and must properly handle flush-to-disk requests.

Databases:

  • Storage overhead:  data repositories may require several times the amount of disk space required by the raw data.  Adding an index to a table can double its size.  A test using a simple mostly numeric table with one index gave these overheads for some common databases.
    • MySQL using MyISAM 2.81
    • MySQL using InnoDB 3.28
    • Apache Derby       5.88
    • PostgreSQL         7.02
  • Data Integrity support:  The server and disk system should handle failures and power loss as cleanly as possible.  A UPS with clean shutdown support is recommended.

Web Server and middleware hosts:
A web server needs high network bandwidth, and should have a large memory to cache frequently-used content.

Web Service Software Considerations:

  • PHP:  Thread support still has problems.  PHP applications running on a Windows system under either Apache httpd or IIS may encounter these.  We had seen a case where WordPress run under Windows IIS and Apache httpd gave error messages, but worked without problems under Apache httpd on Linux.  IIS FastCGI made the problem worse. PHP acceleration systems may be needed to support large user bases.
  • Perl:  similar thread support and scaling issues may be present. For large user bases, use of mod_perl or FastCGI can help.
  • Java-based containers:  (Apache Tomcat, JBoss, GlassFish, etc) These run on almost anything without problems, and usually scale quite well.

Compute nodes:
Requirements depend upon the expected usage.  Common biological applications tend to be memory-intensive.  A high-bandwidth network between the nodes is recommended, especially for large clusters.  Network attached storage is often used to provide a shared file system visible to all the nodes.

  • Classical “Beowulf” cluster:  used for parallel tasks that require frequent communication between nodes.  These usually use the MPI communication model, and often have a communication network tuned for this use such as Myrinet.  One master “head” node controls all the others, and is usually the only one connected to the outside world. The cluster may have a private internal Ethernet network as well.
  • Farm:  used where little inter-node communication is needed.  Nodes usually just attach to a conventional Ethernet network, and may be visible to the outside world.

The most important thing is to have a plan from the beginning that addresses all the system’s needs for storage today and is scalable for tommorrow’s unknowns.

Programming Practices to Live By No comments yet

Programming Practices

I’ve been privy to all sorts of coding adventures.   I’ve had a website and supporting components dropped in my lap with little overview other than the directory structure.  I’ve had to plow through the methodology and software written for one aircraft certification program to determine if any of it was relevant for the next certification project.
In either case, it wasn’t a lot of fun.   I spent lots of time reading code and debugging applications in addition to talking to vendors, customer support, technical staff, etc.

There were days when I would have killed for a well-documented program.  Instead, I had to spend weeks in debug, poking around, learning how things worked.  In both cases, the developers of said projects were no longer available for consultation.

Here are few programming practices that I’ve tried to adhere to when writing code.

Document, Document, Document

I am a big proponent of Javadoc.  Javadoc is a tool for generating API documentation in HTML format from doc comments in source code. It can be downloaded only as part of the Java 2 SDK. To see documentation generated by the Javadoc tool, go to J2SE 1.5.0 API Documentation at
http://java/sun/com/j2se/1.5.0/docs/api/index.html. Go to “How to Write Doc Comments for the Javadoc Tool at http://java.sun.com/j2se/javadoc/writingdoccomments for more information.

Other languages have similar markup languages for source code documentation.

Perl has perlpod - http://perldoc.perl.org/perlpod.html
Python has pydoc - http://docs.python.org/library/pydoc.html
Ruby has RDoc - http://rdoc.sourceforge.net

I usually start each program I write with a comment header that lists: who developed the program, when it was developed, why it was developed, for whom it was developed, and what the program is supposed to accomplish.  It’s also helpful to list the version number of the development language.  List any dependencies such as support modules that are not part of the main install that were downloaded for the application.

Each class, method, module (etc.) should be headed by a short doc comment containing a description followed by definitions of the parameters utilized by the entity, both input and output.

Coding practices

I think all code should be read like a book.  Otherwise -

- Code should flow from a good design.
- The design should be evolutionary.
- Code should be modular.
- Re-use should be a main concern.
- Each stage of development should be functional.
- Review code on a daily or at least a weekly basis.
I’ve mostly found that peer review can be a great time-waster.

There are  several design methodologies, such as Extreme Programming, that are the flavor de jour.  None have been completely successful in producing perfect software.

To-Do Lists - Use them!

There are project management and other tools available for this, but a plain text To-Do file that lists system extension, enhancements, and fixes is a good thing.  It’s simple, you don’t need access to or have to learn how to navigate a complex piece of software. 

Find Out Who Knows What and GO ASK THEM

I worked on a project and shared a cube with a guy named Al.  Al was not the most pleasant person (he was the resident curmudgeon), but we got along.  Al had been working on the huge CAD (Computer Aided Design) project since the first line of code was written.  If I couldn’t understand something, a brief conversation with Al was all I needed. 

Every time a programmer on that project complained to me about not understanding something, I told them to go ask Al.  However, I ended up as the “Go Ask Al” person.  I didn’t mind, as we became the top development group in that environment.

Use Code Repositories

The determination of which code repository to use - SVN (Subversion), Mercurial, and GIT are the big three - has become something of a religious issue.  Any of these have their pros and cons.  JUST USE ONE!

Integrated Development Environments (IDE)

There are several of these available.  My favorite is Eclipse (www.eclipse.org).  I’ve also used SunOne which became Creator which is rolled into Net.Beans which was out there all along.

There are a lot of plug ins available for Eclipse and Net.Beans that enable you to develop for almost any environment.  The ability to map an SQL data record directly to a web page and automatically generate the SELECT statement to populate that page has to be at the top of my list.

I’ve used VisualStudio for C++ development in a Windows environment for a few applications, but most of my development as been on Unix, Linux, and Mac platforms.

The problem with most IDE’s is that they are complex and it takes awhile for a programmer to become adept at using them.  You’re also locked into that development methodology which may become inflexible, due to the applications under development.

We used EMACS on Linux for all LARTS development (LifeFormulae ASN.1 Reader Tool Set, my current project), although my favorite editor is vim/vi.  I can do things faster in vi, mainly because I’ve used it for so long. 

Which Language?

My favorite language of all time is Smalltalk.  If things had worked out, we would all be doing Smalltalk instead of Java. 

Perl is a good scripting language for text manipulation. It’s the language of choice for spot programming or Perl in a panic.  Spot programming used infrequently is okay.  However, if everything you are doing is panic programming, your department needs to re-think its software development practices.

Lately, I’ve been working in Java.  Java is powerful, but it also has its drawbacks. 

We will always have C.  According to slashdot.org, most open source projects submitted in 2008 were in C, and this was by a very wide margin. C has a degenerative partner, C++.  C++ does not clean up after itself.  You have to use delete. Foo(); can either be a function declaration or a call to a constructor, depending on the type of Foo.

Fortran is another one that will always be around.  I’ve done a lot of Fortran programming.  It is used quite extensively in engineering, as is Assembly Language.  I have been called a “bit-twiddler” because I knew how to use assembler.  

Variable Names

This is a touchy subject.  I’ve been around programmers who have said code should be self documenting.  Okay, but long variable names can be a headache, especially if one is trying to debug an application with a lot of long, unwieldy variable lanes.

Let’s just say variable names should be descriptive.

Debugging 

This should probably go under the IDE section, but I’ll make my case for debugging here.
Every programmer should know how to debug an application.

The simplest form of debugging is the print() or println() statement to examine variables at a particular stage of the application.

Some debugging efforts can become quite complex, so as debugging a process that utilizes several machines, or pieces of equipment such as an airplane. 

Sun Solaris has an application truss that lets you debug things at the system level and lets you follow the system calls.  The Linux version is strace.

I am most familiar with gdb - The GNU Debugger.   I’ve also used dbx on Unix.

The IDE’s offer a rich debugging experience and plug-ins let you debug almost anything.  

Closing Thoughts

Always program for those coming behind you.  They will appreciate your effort.

It’s best to keep it simple.  Especially the user interface. 

Speaking of users, talk to them.  Get their feedback on everything you do to the user interface.  I spent my lunch hour teaching the Flight Test Tool Crib folks (for whom I was creating a database inventory system) how to dance the Cotton-Eyed Joe. The moral, Keep your users happy.

 Technology is wonderful, but technology for technology’s sake (”because we can!”) is usually overkill.
The guiding principle should be — Is it necessary?

By the way, going back to that certification project.  I found that instead of spending $15K for a disk drive, the company was conned into spending $750K for custom developed software that I had to throw in the trash, except for one tiny piece. Was it necessary? The point is that the best answer is not always the most expensive one.

Top of page / Subscribe to new Entries (RSS)