The PLOS Computational Biology website recently published “A Quick Guide for Developing Effective Bioinformatics Programming Skills” by Joel T. Dudley and Atul J. Butte (http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1000589).
This article is a good that survey covers all the latest topics and mentions all the currently-popular buzzwords circulating above, around, and through the computing ionosphere. It’s a good article, but I can envision readers’ eyes glazing over about page 3. It’s a lot of computer-speak in a little space.
I’ll add in a few things they skipped or merely skimmed over to give a better overview of what’s out there and how it pertains to bioinformatics.
They state that a biologist should put together a Technology Toolbox. They continue, “The most fundamental and versatile tools in your technology toolbox are programming languages.”
Programming Concepts
Programming languages are important, but I think that Programming Concepts are way, way more important. A good grasp of programming concepts will enable you to understand any programming language.
To get a good handle on programming concepts, I recommend at book. This book, Structure and Implementation of Computer Programs from MIT Press (http://mitpress.mit.edu/sicp/),is the basis for an intro to computer science at MIT. It’s called the Wizard Book or the Purple Book.
I got the 1984 version of the book which used the LISP language. The current 1996 version is based on LISP/Scheme. Scheme is basically a cleaned-up LISP, in case you’re interested.
Best of all course (and the down loadable book) are freely available from MIT through the MIT OpenCourseWare website – http://ocw.mit.edu/OcwWeb/Electrical-Engineering-and-Computer-Science/6-001Spring-2005/CourseHome/index.htm.
There’s a blog entry - http://onlamp.com/pub/wlg/8397 - that goes into further explanation about the course and the book..
And just because you can program, it doesn’t mean you know (or even need to know) all the concepts. For instance, my partner for a engineering education extension course was an electrical engineer who was programming microprocessors. When the instructor mentioned the term “scope” in reference to some topic, he turned to me and asked, “What’s scope?”
According to MIT’s purple book –” In a procedure definition, the bound variables declared as the formal parameters of the procedure have the body of the procedure as their scope.”
You don’t need to know about scope to program in assembler, because everything you need is right there. (In case you’re wondering, I consider assembler programmers to be among the programming elites.)
Programming Languages
The article mentions Perl, Python, and Ruby as the “preferred and most prudent choices” in which to seek mastery for bioinformatics.
These languages are selected because “they simplify the programming process by obviating the need to manage many lower level details of program execution (e.g. memory management), affording the programmer the ability to focus foremost on application logic…”
Let me add the following. There are differences in programming languages. By that, I mean compiled vs scripted. Languages such as C, C++, and Fortran are compiled. Program instructions written in these languages are parsed and translated into object code, or a language specific to the computer architecture the code is to run on. Compiled code has a definite speed advantage, but if the code is the main or any supporting module is changed, the entire project must be recompiled. Since the program is compiled into the machine code of a specific computer architecture, portability of the code is limited.
Perl, Python, and Ruby are examples of scripted or interpreted languages. These languages are translated into byte code which is optimized and compressed, but is not machine code. This byte code is then interpreted by a virtual machine (or byte code interpreter) usually written in C.
An interpreted program runs more slowly than a compiled program. Every line of an interpreted program must be analyzed as it is read. But the code isn’t particularly tied to one machine architecture making portability easier (provided the byte code interpreter is present). Since code is only interpreted at run time, extensions and modifications to the code base is easier, making these languages great for beginning programmers or rapid prototyping.
But, let’s get back to the memory management. This, and processing speed will be a huge deal in next gen data analysis and management.
Perl automatic memory management has a problem with circularity, as Perl (and Ruby and Python) count references.
If object 1 points to object 2 and object 2 points back to 1 , but nothing else in the program points to either object 1 or object 2 (this is a weak reference), these objects don’t get destroyed. They remain in memory. If these objects get created again and again, it’s called a memory leak.
I also have to ask – What about C/C++ , Fortran, and even Turbo Pascal? The NCBI Toolkit is written in C/C++. If you work with foreign scientists, you will probably see a lot Fortran.
Debugging
You can’t mention programming with mentioning debugging. I consider the act of debugging code an art form any serious programmer should doggedly pursue.
Here’s a link to a ebook, The Art of Debugging – http://www.circlemud.org/cdp/hacker/. It’s mainly Unix-based, C-centric and a little dated. But good stuff never goes out of style.
Chapter 4, Debugging: Theory explains various debugging techniques. Chapter 5 – Profiling talks about profiling your code, or determining where your program is spending most of its time.
He also mentions core dumps. A core is what happens when your C/C++/Fortran program crashes in Unix/Linux. You can examine this core to determine where your program went wrong. (It gives you a place to start.)
The Linux Foundation Developer Network has an on-line tutorial – Zen and the Art of Debugging C/C++ in Linux with GDB – http://ldn.linuxfoundation.org/article/zen-and-art-debugging-cc-linux-with-gdb. They write a C program (incorporating a bug), create a make file, compile, and then use gdb to find the problem. You are also introduced to several Unix/Linux commands in the process.
You can debug Perl by invoking it with the -d switch. Perl usually crashes at the line number that caused the problem and some explanation of what went wrong.
The -d option also turns on parser debugging output for Python.
Object Dumps
One of the most useful utilities in Unix/Linux is od (object dump). You can examine files in octal (default), hex, or ASCII characters
od is very handy for examining data structures, finding hidden characters, and reverse engineering.
If you think you’re code is right, the problem may be in what you are trying to read. Use od to get a good look at the input data.
That’s it for Part 1. Part 2 will cover Open Source, project management, archiving source code and other topics.

I was going through a box of textbooks last week and stumbled upon a copy of the Enron Code of Ethics. I have another one stored away with a form, signed by Ken Lay, that states I have read and will comply with the Enron Code of Ethics.
I was employed at Enron from 2000 through 2002 and was there when the wheels came off. Our department was left intact. Otherwise, whole floors of the Enron building were vacated. It really was a shame, because Enron was a great place to work. Several friends and acquaintances lost most of what they had because of the malfeasance of a greedy few.
This had to be the most blatant example of unethical conduct in the workplace I encountered. There were others, that appeared seemingly minor, ended up costing companies money and talent. Most of these losses were mostly the result of mismanagement and not outright unethical behavior. But, then again, is mismanagement itself unethical?
I book I read recently entitled “A Small Treatise of the Great Virtues, The Uses of Philosophy in Everyday Life” by Andre Comte-Sponville (Metropolitan Books), talks about truth as “Good Faith”.
He states on page 196, that “at the very least that one speaks the truth about what one believes, and this truth, even if what one believes is false, is less true for all that. Good faith, in this sense, is what we cal sincerity (or truthfulness or candor) and is the opposite of mendacity, hypocrisy, and duplicity, in short, the opposite of bad faith in all its private and public forms.”
In my position at a major hardware/software developer I was told that I “didn’t need to know about a product to sell it.”
At another position, I found that a few fraudulent claims by a contractor caused a company to fork over three quarters of a million dollars for custom software when a fifteen thousand dollar piece of hardware would have an enabled an already existing piece of commercial software to do the job. With a more accurate accountability of the data, I might add.
In fact, the whole program was completely mismanaged, to the detriment of the company, not the contractor. In fact, he was ready for the next program as he had one of his engineers hired in to head up that project. An engineer who didn’t have the slightest idea about our system, much less its theory. Thankfully, we got him transferred out of there and back to design where he belonged. The contractor was kicked out of the company.
These are straight-forward examples of bad faith. The following are a little harder to classify.
Beware the ulterior motive, especially if the new system you are proposing will impose on someone’s fiefdom.
Data analysis for the existing program consisted of placing a request with the a data analysis group and waiting up to 3 days for results. The system proposed (and later deployed) would give each and every engineer access to an analysis application that they could use to inspect the data one and a half hours after a particular test cycle was completed. A little training and they were ready to go.
Countless hours were spent in useless meetings defending the system. Everybody shut up when the system came up on day one and stayed up through months of testing.
This test/record/analysis cycle fits perfectly into the Laboratory Information Management Systems (LIMS) cycle of genomics research. A successful LIMS implementation in one lab aroused the ire of yet another lab attempting to develop their own solution. Let’s just say a lot of bad faith erupted.
The real loser in the above examples is the company. Money is wasted and talented people go elsewhere.
Biotechnology is a hot commodity right now. Stimulus funding bringing fresh capital to many projects. Companies are leveraging existing corporate products by repackaging them as biotech ready.
National Instruments LabView is one of these. I used it a lot in engineering. Now it’s a big player in the lab, incorporating interfaces for research lab instrumentation.
What is a LIMS (Laboratory Information Management System)? Is it an inventory management system? Is it a data pipeline? Can one size fit all?
Some companies have taken existing Inventory Management Systems and relabeled them as a Laboratory Information Management Systems. (At least the acronym fits.) Most of these systems don’t distinguish between research and manufacturing environments. They also don’t support basic validation of the LIMS application for its intended purpose. No wonder some 80% of LIMS users are dissatisfied.
At a recent conference I talked with researchers from various pharmaceutical companies and they were thoroughly dissatisfied with their LIMS systems. One scientist stated that they had a problem with their LIMS. When they went to report the problem, they found the company was no longer in business.
The latest IT (Information Technology) trends – SaaS, Cloud computing – may work in a business environment , but they won’t translate well to a pharmaceutical research area where they want everything safe behind the firewall.
There are many, many factors that go into developing biotechnology applications. Getting the right people, controlling the political environment, finding or developing the right software – it’s a jungle out there.
Keep to Good Faith and please be careful.
Today, one in ten engineers is a woman – http://www.dol.gov/wb/factsheets/hitech02.htm In avionics, it’s fewer than that.
This is really a shame, because I find that women are extremely well suited for jobs in high tech careers.
Here’s a short list of why I think this is true along with explanations as to why I think this is so.
- Women are more patient and determined
- Women can juggle a lot of tasks simultaneously
- Women can attend to small details and see the big picture at the same time
- Women don’t get derailed by the small stuff
- Women have a better support system.
- Women are more sympathetic and understanding
I’ll stop at this group of six, although I could add a few more. They are not true of all women, but that’s probably because they haven’t had the experience.
Just take a look at what current society expects of women and I think you’ll see why I think women are more patient and determined! Case in point, I just got an email on “How to Create Perfect Eyes” through makeup application. Can you imagine a heterosexual male having the patience to take the time to apply all the goop we women have to put on our faces to be seen in public? Also, remember how determined we were to walk in high heels so we could pretend we were grown-ups?
Programming, system design and integration requires patience and determination. It’s a step-by-step process. All the pieces have to work together to produce the correct outcome. It’s no different that making a food dish from a recipe, although in most cases you’ll have only your experience to formulate the list of ingredients and right steps to finish the job.
Think about getting the family ready for school/work in the morning. How many things are you trying to do at once? Multi-tasking is standard operating procedure for most women, who can adapt to chaos in the blink of an eye.
I know chaos. Other than being the oldest of nine children (5 girls, 4 boys), I drove a school bus for about 4 years while I was attending college. I was given a long, country route that paid well and gave me enough hours to qualify for health insurance. After I had driven the route for about six weeks, my supervisor asked me how i was doing and what did I think of the kids. I said I thought I was doing okay and the kids were a little rowdy, but we got that under control. Otherwise, I said the kids were a bright bunch and generally inquisitive about everything. (“Miss Pam, what’s a hickey? Our teacher says it’s something you get in dominoes.”)
I found out later that these kids had been through 4 bus drivers in 4 weeks. The last day of that school year the kids on the route gave me a plaque that said “World’s Best School Bus Driver”. I was impressed, even though they misspelled my name.
I’ve discovered that women, as a whole, performed better on mission critical tasks that required a lot on concentration and coordination of several activities that had to occur simultaneously.
I couldn’t make a practice session for a particular field test, so the guys were going to fill in for me. I heard that it took them an extra long time to get started, because they couldn’t figure out how to calibrate the instrumentation. (They took the same training class that I did!) Let’s just say that they were more than happy to let me take over the operation after they were introduced to all the steps involved in the pre and post fly-over operations.
Lots of tasks mean lots of details to keep track of with almost no time to double-check anything. Women do this sort of thing all the time. Think about putting together a meal, folding clothes fresh from the dryer, putting on makeup. You don’t really think about it, you just do it. Juggling home, family, and career by itself is one big accomplishment.
We took two years to perfect all the pieces that made up the testing for the 727QF certification. We worked out the weather station in Roswell, NM. We took the acoustic analyzer to Moses Lake, WA ( to work out the routine we needed for testing. (Desert dust at 35 knots in no fun, but it can’t hold a candle to the volcanic ash from Mt. St. Helen’s that we ran into in Moses Lake. They got about a foot of ash from that explosion and the ash was dumped at the airport. Right where we were working!)
The only missing piece was the data download from the data logger on the meteorological (met) plane.
I sat under the wing of the small Cessna in the hot Texas August heat with a laptop atop my crossed legs, dodging fire ants, as I worked out the best method for our technician to save the data acquired after each run of the met plane. I got it down to a few steps, ran through it with him, and we had the met data canned.
All those pieces, met plane, weather station, acoustic analyzer and DAT (digital audio tape) data, were part of the big picture that was noise testing. The other parts were the group support systems – data download, availability, and analysis, There was so much data flowing through the pipeline, we held a meeting every morning to discuss who needed what, how much, how they wanted it, and what data could be taken to archive.
The next-gen sequencing efforts are producing an astronomical amount of raw data. Data that has to be stored, analyzed, and archived, creating one complex system. It’s a massive task and one I can sympathize with.
Women don’t get derailed by the small stuff.
Maybe this wasn’t so small, and sometimes it hit close to home, but a lot of the things I did got satirized via a cartoon or paste-up on bulletin boards all over the plant on the 727QF program.
For instance, I developed this relational database model that would store measurement information for the two aircraft we were testing.
One of the technicians had started his own local database, but he had no understanding of relational data concepts. So he had thermocoupleA and thermocoupleB, where A represented on aircraft and B represented the other. The thermocouple in question was the same on both aircraft, causing duplicate records for the same part info.
At a informal meeting we were having in the instrumentation lab, I said that his database design was stupid because we didn’t need more than one copy of the part’s basic attributes. The next day there was a flyer on the bulletin boards with a picture of the tech with a bubble over his head that said, “I stupid.”
There was some other verbiage, “Coming soon son of stupid. When relational is not enough.”

Stupid Database Flyer
Since the technician was a friend, this was funny. There were others that weren’t so entertaining.
I think women are more sympathetic and understanding of other people. The problem is to not be so understanding that you are taken for a ride.
As a support system, we have probably the best weapon in the arsenal – we can cry. Not in public, not on the job, but we can got somewhere private and cry. Sometimes this is the only way to get it our of your system.
I put a lot of dents in a lot of old hardware and ran miles and miles, but. sometimes. even that did not cover it.
I will end by saying that I was pleasantly surprised at the number of women involved in the life sciences. By this, I mean as directors, P.I.’s, or other positions of power. However, men in the field still earn one-third more than the women.
Maybe one day, women will wield as much power in all branches of technology, and their paychecks will actually reflect this status.
BioCamp 2009 at Rice University
Bill and I attended BioCamp 2009 at Rice University on Saturday, Sept. 12. There were several presentations followed by lively question and answer sessions.
The atttendance consisted of entrepreneurs, those seeking guidance on turning their ideas and research into viable products, consultants searching for marketable products, and members of the legal profession offering advice on intellectual property, patents, trademarks, and the like.
The End of Bioinformatics?!
I read with some interest the announcement of the Wolfram Alpha. Wolfram intends to be the end all and be all data mining systems and some say, makes bioinformatics obsolete.
Wolfram’s basis is a formal Mathematica representation. It’s inference engine is a large number of hand-written scripts that access data that has been accumulated and curated. The developers stress that the system is not Artificial Intelligence and is not aiming to be. For instance, a sample query,
“List all human genes with significant evidence of positive selection since the human-chimpanzee common ancestor, where either the GO category or OMIM entry includes ‘muscle’”
could currently be executed with SQL, provided the underlying data is there.
Wolfram won’t replace bioinformatics. What it will do is make it easier for a neophyte to get answers to his or her questions because they can be asked in a simpler format.
I would guess Wolfram uses one or more these scripts to address a specific data set in conjunction with a natural language parser. These scripts would move this data to a common model that could then be modeled on a web page.
But why not AI? Why not replace all those “hand-written” scripts, etc. with a real inference engine.
I rode the first AI wave. I was a member of the first of 25 engineers selected to be a part of the McAir AI Initiative at McDonnell Aircraft Company. (”There is AI in McAir”). In all, 100 engineers were chosen from engineering departments to attend courses leading to a Certificate in Artificial Intelligence from Washington University in St. Louis.
One of the neat things about the course was the purchase of at least 30 workstations (maybe as many as 60) for a young company called Sun that were loaned to Washington University for the duration of the course. Afterwards, we got a few Symbolics machines for our CADD project.
Other than Lisp and Prolog, the software we used was called KEE (Knowledge Engineering Environment). Also, there was a DEC (Digital Equipment Company) language called OPS5.
The course was quite fast-paced but very extensive. We had the best AI consultants available at the time lecture and give assignments in epistemology, interviewing techniques, and so on. I had a whole stack of books.
The only problem was that no money was budgeted (or so I was told) for AI development for the departments for the engineers when they returned from the course eager to AI everything. A lot of people left.
Anyway, my group of three developed a “Battle Damage Repair” system that basically “patched up” the composite wing skins of combat aircraft. Given the size and location of the damage, the system would certify whether the aircraft would be able to return to combat, and would output the size and substance of the patch if the damage wasn’t that bad.
One interesting tidbit: We wanted to present our system at a conference in San Antonio and had a picture of a battle-damaged F-15 we wanted to use. Well, we were told that the picture was classified and, as such, we couldn’t use it. Well, about that same time, a glossy McAir brochure featuring our system and that photo were distributed at the AAAI (American Assn. of Artificial Intelligence) to thousands of people.
Another system I developed dealt with engineering schematics. These schematics were layered. Some layers and circuits were classified. Still another system scheduled aircraft for painting and yet another charted a path for aircraft through hostile territory, activating electronic counter measures as necessary.
I guess the most sophisticated system I worked on was with the B-2 program. The B-2 skin is a composite material. This material has to be removed from a freezer, molded into a final shape and cooked in a huge autoclave before it completely thawed.
We had to schedule materials, and the behavior of that material under various circumstances, as well as people and equipment. The purpose was to avoid “bottlenecks” in people and equipment. I was exposed to the Texas Instruments Explorer and Smalltalk-80 on an Apple. I’ve been in love with Smalltalk ever since.
The system was developed, but it was never used. The problem was that we had to rank workers by expertise. That’s union workers and that wasn’t allowed.
It was a nice system that integrated a lot of systems and worked well. Our RFP (Request for Proposals) went out to people like Carnegie-Mellon. We had certain performance and date requirements that we wanted to see in the final system. We were told that the benchmarks would be difficult, in not impossible, to attain. Well, we did it, on our own without their help.
We also had a neural net solution that inspected completed composite parts. The parts were submerged in water and bombarded with sound waves. The echoes were used by the system to determine part quality.
AI promised the world, and then it couldn’t really deliver. So it kind of went to the back burner.
One problem with the end and be all. It will only be as good as your model. It will only be as good as the developers can determine the behavior of the parts and how they interact with the whole. Currently, this is a moving target and is changing day to day. Good luck.
Links -
Will Wolfram Make Bioinformatics Obsolete? - http://johnhawks.net/weblog/reviews/genomics/bioinformatics/wolfram-alpha-bioinformatics-2009.html
The most complex system I’ve configured was the airborne data acquisition and ground support systems. However, not many people have to or want do anything that large or complex. Some labs will need info from thermocouples, strain gauges, or other instrumentation, but most of you will be satisfied with a well-configured system that can handle today’s data without a large cash outlay that can be expanded at minimum cost to handle the data of tomorrow.
This week’s guest blogger, Bill Eaton, provides some guidelines for the configuration of a Database Server, a Web Server, and a Compute Node, the three most requested configurations.
(Bill Eaton)
General Considerations
Choice of 32-bit or 64-bit Operating System on standard PC hardware
- A 32-bit operating system limits the maximum memory usage of a program to 4 GB or less, and may limit maximum physical memory.
- Linux: for most kernels, programs are limited to 3 GB. Physical memory can usually exceed 4 GB.
- Windows :The stock settings limit a program to 2 GB, and physical memory to 4 GB.
The server versions have a /3GB boot flag to allow 3 GB programs and a /PAE flag to enable more than 4 GB of physical memory.
Other operating systems usually have a 2 or 3 GB program memory limit.
- A 64-bit operating system removes these limits. It also enables some additional CPU registers and instructions that may improve performance. Most will allow running older 32-bit program files.
Database Server:
Biological databases are often large, 100 GB or more, often too large to fit on a single physical disk drive. A database system needs fast disk storage and a large memory to cache frequently-used data. These systems tend to be I/O bound.
Disk storage:
- Direct-attached storage: disk array that appears as one or more physical disk drives, usually connected using a standard disk interface such as SCSI.
- Network-attached storage: disk array connected to one or more hosts by a standard network. These may appear as network file systems using NFS, CIFS, or similar, or physical disks using iSCSI.
- SAN: includes above cases, multiple disk units sharing a network dedicated to disk I/O. Fibre Channel is usually used for this.
- Disk arrays for large databases need high I/O bandwidth, and must properly handle flush-to-disk requests.
Databases:
- Storage overhead: data repositories may require several times the amount of disk space required by the raw data. Adding an index to a table can double its size. A test using a simple mostly numeric table with one index gave these overheads for some common databases.
- MySQL using MyISAM 2.81
- MySQL using InnoDB 3.28
- Apache Derby 5.88
- PostgreSQL 7.02
- Data Integrity support: The server and disk system should handle failures and power loss as cleanly as possible. A UPS with clean shutdown support is recommended.
Web Server and middleware hosts:
A web server needs high network bandwidth, and should have a large memory to cache frequently-used content.
Web Service Software Considerations:
- PHP: Thread support still has problems. PHP applications running on a Windows system under either Apache httpd or IIS may encounter these. We had seen a case where WordPress run under Windows IIS and Apache httpd gave error messages, but worked without problems under Apache httpd on Linux. IIS FastCGI made the problem worse. PHP acceleration systems may be needed to support large user bases.
- Perl: similar thread support and scaling issues may be present. For large user bases, use of mod_perl or FastCGI can help.
- Java-based containers: (Apache Tomcat, JBoss, GlassFish, etc) These run on almost anything without problems, and usually scale quite well.
Compute nodes:
Requirements depend upon the expected usage. Common biological applications tend to be memory-intensive. A high-bandwidth network between the nodes is recommended, especially for large clusters. Network attached storage is often used to provide a shared file system visible to all the nodes.
- Classical “Beowulf” cluster: used for parallel tasks that require frequent communication between nodes. These usually use the MPI communication model, and often have a communication network tuned for this use such as Myrinet. One master “head” node controls all the others, and is usually the only one connected to the outside world. The cluster may have a private internal Ethernet network as well.
- Farm: used where little inter-node communication is needed. Nodes usually just attach to a conventional Ethernet network, and may be visible to the outside world.
The most important thing is to have a plan from the beginning that addresses all the system’s needs for storage today and is scalable for tommorrow’s unknowns.
Data Stewardship -
The Conducting, Supervising, and Management of Data
Next-gen sequencing promises to unload reams and reams of data on the world. Pieces of that data will prove relevant to one or the other of specific research projects in your enterprise. At the same time, your lab may produce more data by annotation or simple research. How do you handle it all?
First, you should appoint a data steward. This person must understand where the data comes from, how it is modeled, who uses what parts of it, and any results this data may produce, such as forms, etc. Most importantly, they must be able to verify the integrity of that data.
Data, Data, Data
I’ve handled lots of engineering and bioinformatics data in my time…
In engineering, I had to be sure all instrumentation was calibrated correctly and production data was representative or correct. Every morning at 7 a.m., I held a meeting with data analysts, system administrators, database representatives, etc. focused on who was doing what to which data, what data could be archived, what data should be recovered from archive, and so on. This data inventory session proved to be extremely useful as there were terabytes of data swept through the system on a weekly basis.
For bioinformatics, I had to locate and merge data from disparate sources into one whole and run that result against several analysis programs to isolate the relevant data. That data was then uploaded to a local database for access by various applications. As the amount of available sequence data grew, culling the data, storage of this data, and archiving of the initial and final data became something of a headache.
My biggest bioinformatics problem was NCBI data, as that was how we got most of our data.
I spent weeks/months/years plowing though the NCBI toolkit, mostly in debug. Grep became my friend.
I tried downloading complete GenBank reports from the NCBI ftp website but that took too much space. I used keywords with the Entrez eutils, but the granularity wasn’t fine enough, and I ended up with way too much data. Finally, I resorted to the NCBI Toolkit on NCBI ASN.1 binary files.
LARTS would have made this part so much easier.
The Data Steward should also be familiar with data maintenance and storage strategies.
Our guest blogger, Bill Eaton, explains the difference between backup and archiving of data, and lists the pros and cons of various storage technologies.
Bill Eaton: Data Backup and Archival Storage
Backups are usually kept for a year or so, then the storage media is reused.
Archives are kept forever. Retrievals are usually infrequent for both.
Storage Technologies
Tape: suitable for backup, not as good for archiving.
Pro: Current tape cartridge capacities are around 800 GB uncompressed.
Cost per bit is roughly the same as for hard disks.
Con: Tape hardware compression is ineffective on already-compressed data.
Tapes and tape drives wear out with use.
Software is usually required to retrieve tape contents. (tar, cpio, etc)
Tape technology changes frequently, formats have a short life.
Optical: better for archiving than backup
Pro: DVD 8.5 GB, Blu-Ray 50 GB
DVD contents can be a mountable file system, so that no special software is needed for retrieval.
Unlimited reading, no media wear.
Old formats are readable in new drives.
Con: Limited number of write cycles.
Hard Disks: could replace tape
Pro: Simple: Use removable hard disks as backup/archive devices.
Disk interfaces are usually supported for several years.
Con: Drives may need to be spun up every few months and contents
rewritten every few years.
MAID: Massive Array of Idle Disks
Disk array in which most disks are powered down when
not in active use.
Pro: The array controller manages disk health,
spinning up and copying disks as needed.
The array usually appears as a file system. Some can emulate a tape drive.
Con: Expensive.
Classical: the longest-life archival formats are those known
to archaeologists.
Pro: Symbols carved into a granite slab are
often still readable after thousands of years.
Con: Backing up large amounts of data this way could take hundreds of years.
asn2xml
Jim Ostell, speaking at the observance of the 25th anniversary of NCBI, stated something along the lines of, “then they wanted XML, but nah..”.
While working on the filters for the LARTS product, most specifically, the GenBank-like report, I realized how tightly-coupled the NCBI ASN.1/XML is to the toolkit.
Basically, you’ve got to understand the toolkit code in order to translate what the XML is saying. The infinite extendability and recursive structure of the ASN.1 data model is another conundrum. This is especially true of the of the ASN.1 data structures supporting GenBank data - Bioseq-set. For example, a phy-set (phylogeny set) can include sets of Bioseq-sets nested to several levels. Most Bioseq-sets are the usual nuc-prot (DNA and translating protein), but others are pop-sets, eco-sets, segmented sequences with sets of sequence parts, etc.
After we developed LARTS, I wrote the GB filter as a Java object. It was an interesting experience.
NCBI ASN.1 rendered as XML, either our version or the NCBI asn2xml version, is very dependent on the NCBI toolkit code for proper interpretation.
The two most glaring examples are listed below.
Sequence Locations
Determing the location of sequence features for a GenBank data report, is a prime example. Here are a few simple examples:
primer_bind order(complement(1..19), 332..350)
gene complement(join(1560..2030, 3304..3321))
CDS complement(join(3492..3593, 3941..4104, 4203..4364, 4457..4553, 4655..4792))
rRNA join(<1..156, 445..478, 1199..>1559) 5231, 76582..76767, 77517..77720, 78409..78490))
primer_bind order(complement(1..19), 1106..1124)
For Segmented-sequences:
CDS join(162922:124..144; 162923: 647..889, 1298..1570)
CD regions locations have frames, bonds have points (that can be packed), strand minus denotes a complement (reverse order), a set of sequence locations for a sequence feature (packed-seqint) denotes a join, and locations can be “order(”ed, or “one-of”, and fuzz-from and fuzz-to has to taken into account for points and sequence intervals.
Sequence Format
DNA sequences are stored in a packed 2-bit or 4-bit per letter format (ncbi2na and ncbi4na). 2na is used if the sequence does not contain ambiguity, otherwise 4na is the format of choice. The sequence must be unpacked to be useful. This takes a basic understanding of Hex(adecimal).
Toolkit
The NCBI Toolkit contains all of the code necessary to render a GenBank report from the ASN.1 binary or ASCII data file. (The code is there, but you have to figure out how to compile it into an executable.)
We took the toolkit code and converted it to Java to produce the GenBank-style output format. It differs from the actual NCBI GenBank Report in that the LARTS report lists a FASTA-formatted sequence instead of the 10-base pairs per column that the NCBI GenBank Report produces.
The Many Variations of LARTS
GenBankReportFilter.java is provided as an example with Stand-Alone LARTS. The LARTS Reader enables the GenBank-style report.
Using LARTS Online, the user can select the GenBank-style report as the desired Output Format.
A third option, would entail using LARTS Online to obtain the keyword or keyword/element-path data wanted in XML format. This data is then downloaded to a local machine via the Thick Client option. Finally, Stand-Alone LARTS would process the dowloaded XML data into a GenBank-style report.
Stand-Alone LARTS provides example filters and SQL for processing XML and loading the relevant data into a local SQL database. This includes sample code for the BLOB and CLOB objects.
The filter for FASTA-formatting sequence data is also available as an example with Stand-Alone LARTS.
These options provide ready access to NCBI data for your research.
Programming Practices
I’ve been privy to all sorts of coding adventures. I’ve had a website and supporting components dropped in my lap with little overview other than the directory structure. I’ve had to plow through the methodology and software written for one aircraft certification program to determine if any of it was relevant for the next certification project.
In either case, it wasn’t a lot of fun. I spent lots of time reading code and debugging applications in addition to talking to vendors, customer support, technical staff, etc.
There were days when I would have killed for a well-documented program. Instead, I had to spend weeks in debug, poking around, learning how things worked. In both cases, the developers of said projects were no longer available for consultation.
Here are few programming practices that I’ve tried to adhere to when writing code.
Document, Document, Document
I am a big proponent of Javadoc. Javadoc is a tool for generating API documentation in HTML format from doc comments in source code. It can be downloaded only as part of the Java 2 SDK. To see documentation generated by the Javadoc tool, go to J2SE 1.5.0 API Documentation at
http://java/sun/com/j2se/1.5.0/docs/api/index.html. Go to “How to Write Doc Comments for the Javadoc Tool at http://java.sun.com/j2se/javadoc/writingdoccomments for more information.
Other languages have similar markup languages for source code documentation.
Perl has perlpod - http://perldoc.perl.org/perlpod.html
Python has pydoc - http://docs.python.org/library/pydoc.html
Ruby has RDoc - http://rdoc.sourceforge.net
I usually start each program I write with a comment header that lists: who developed the program, when it was developed, why it was developed, for whom it was developed, and what the program is supposed to accomplish. It’s also helpful to list the version number of the development language. List any dependencies such as support modules that are not part of the main install that were downloaded for the application.
Each class, method, module (etc.) should be headed by a short doc comment containing a description followed by definitions of the parameters utilized by the entity, both input and output.
Coding practices
I think all code should be read like a book. Otherwise -
- Code should flow from a good design.
- The design should be evolutionary.
- Code should be modular.
- Re-use should be a main concern.
- Each stage of development should be functional.
- Review code on a daily or at least a weekly basis.
I’ve mostly found that peer review can be a great time-waster.
There are several design methodologies, such as Extreme Programming, that are the flavor de jour. None have been completely successful in producing perfect software.
To-Do Lists - Use them!
There are project management and other tools available for this, but a plain text To-Do file that lists system extension, enhancements, and fixes is a good thing. It’s simple, you don’t need access to or have to learn how to navigate a complex piece of software.
Find Out Who Knows What and GO ASK THEM
I worked on a project and shared a cube with a guy named Al. Al was not the most pleasant person (he was the resident curmudgeon), but we got along. Al had been working on the huge CAD (Computer Aided Design) project since the first line of code was written. If I couldn’t understand something, a brief conversation with Al was all I needed.
Every time a programmer on that project complained to me about not understanding something, I told them to go ask Al. However, I ended up as the “Go Ask Al” person. I didn’t mind, as we became the top development group in that environment.
Use Code Repositories
The determination of which code repository to use - SVN (Subversion), Mercurial, and GIT are the big three - has become something of a religious issue. Any of these have their pros and cons. JUST USE ONE!
Integrated Development Environments (IDE)
There are several of these available. My favorite is Eclipse (www.eclipse.org). I’ve also used SunOne which became Creator which is rolled into Net.Beans which was out there all along.
There are a lot of plug ins available for Eclipse and Net.Beans that enable you to develop for almost any environment. The ability to map an SQL data record directly to a web page and automatically generate the SELECT statement to populate that page has to be at the top of my list.
I’ve used VisualStudio for C++ development in a Windows environment for a few applications, but most of my development as been on Unix, Linux, and Mac platforms.
The problem with most IDE’s is that they are complex and it takes awhile for a programmer to become adept at using them. You’re also locked into that development methodology which may become inflexible, due to the applications under development.
We used EMACS on Linux for all LARTS development (LifeFormulae ASN.1 Reader Tool Set, my current project), although my favorite editor is vim/vi. I can do things faster in vi, mainly because I’ve used it for so long.
Which Language?
My favorite language of all time is Smalltalk. If things had worked out, we would all be doing Smalltalk instead of Java.
Perl is a good scripting language for text manipulation. It’s the language of choice for spot programming or Perl in a panic. Spot programming used infrequently is okay. However, if everything you are doing is panic programming, your department needs to re-think its software development practices.
Lately, I’ve been working in Java. Java is powerful, but it also has its drawbacks.
We will always have C. According to slashdot.org, most open source projects submitted in 2008 were in C, and this was by a very wide margin. C has a degenerative partner, C++. C++ does not clean up after itself. You have to use delete. Foo(); can either be a function declaration or a call to a constructor, depending on the type of Foo.
Fortran is another one that will always be around. I’ve done a lot of Fortran programming. It is used quite extensively in engineering, as is Assembly Language. I have been called a “bit-twiddler” because I knew how to use assembler.
Variable Names
This is a touchy subject. I’ve been around programmers who have said code should be self documenting. Okay, but long variable names can be a headache, especially if one is trying to debug an application with a lot of long, unwieldy variable lanes.
Let’s just say variable names should be descriptive.
Debugging
This should probably go under the IDE section, but I’ll make my case for debugging here.
Every programmer should know how to debug an application.
The simplest form of debugging is the print() or println() statement to examine variables at a particular stage of the application.
Some debugging efforts can become quite complex, so as debugging a process that utilizes several machines, or pieces of equipment such as an airplane.
Sun Solaris has an application truss that lets you debug things at the system level and lets you follow the system calls. The Linux version is strace.
I am most familiar with gdb - The GNU Debugger. I’ve also used dbx on Unix.
The IDE’s offer a rich debugging experience and plug-ins let you debug almost anything.
Closing Thoughts
Always program for those coming behind you. They will appreciate your effort.
It’s best to keep it simple. Especially the user interface.
Speaking of users, talk to them. Get their feedback on everything you do to the user interface. I spent my lunch hour teaching the Flight Test Tool Crib folks (for whom I was creating a database inventory system) how to dance the Cotton-Eyed Joe. The moral, Keep your users happy.
Technology is wonderful, but technology for technology’s sake (”because we can!”) is usually overkill.
The guiding principle should be — Is it necessary?
By the way, going back to that certification project. I found that instead of spending $15K for a disk drive, the company was conned into spending $750K for custom developed software that I had to throw in the trash, except for one tiny piece. Was it necessary? The point is that the best answer is not always the most expensive one.
As promised, given below is a list of bioinformatics “horror stories”. These are a few of the situations and people I have encountered through the years. Names have been changed to protect the innocent.
People Issues
The following are “people” problems. (Just about every discipline has these denizens.)
1) Make sure the person selected for the task has the skills required for the task, or at least the desire to learn those skills.
We were sent a programmer, Joseph, who was underwritten by another lab, but who would be using our equipment. We were to assist him in supporting a research lab’s website. His duties consisted of capturing genomics data from various sites and data resulting from analysis of the research lab’s data. This data was then to be displayed on the lab’s web site.
He was given a workstation and access to our support literature and introduced to another programmer, Jay, that he could turn to for assistance.
I realized he was in trouble a couple of days later when Jay came to me and said that Joseph was having problems. He had referred him to the books and manuals we had in the lab that gave him exact examples of what he was trying to do, but Joseph said he couldn’t understand them and that they were basically a waste of time. Jay even worked out the code that Joseph needed, but Joseph said it didn’t work. The code consisted of about 10 lines of pretty straight-forward Java.
It didn’t work because Joseph had several “typos” when he typed in the code and tried to compile it.
I tried to help him with batching URL retrievals, but he didn’t understand even after he said he had read the Perl books we had. I ended up writing the five line of Perl code needed for the Simple HTTP retrieval because the job had to get done.
I talked to the P.I. of Joseph’s home lab. Seems that Joseph had a little experience programming on the Windows platform and wanted to learn more about bioinformatics. The P.I. did acknowledge that Joseph really did have a tendency to get other people to “help” him do his tasks.
Joseph was sent back to his home lab.
2) Find out everything you can about the person you are considering
My P.I. hired a programmer that had sterling referrals from a previous labs. Didn’t take long to find out that the lab was giving out glowing references because they wanted to dump this person, Jane, because they said she couldn’t program among other things. It seems this was the case in all of the other labs she had been with.
I asked my P.I. why, why?? He said that he thought we would be ones that would finally be able to develop her skills.
I said that I didn’t think so. She didn’t know the basics, and she tried to cover by saying the other programmers in the lab were out to undermine her. Consequentially, this caused a lot of unrest in the lab.
She was given a support role and eventually went to another lab.
Project Management
1) Set goals
There was this five-year grant that was in it’s last year. The first four years had been spent working out the program design, very little coding had been accomplished other than a demo or two. I got involved in the final year because they needed to produce something that could be referred to as a product.
The group was situated in a small area of some four offices and three cubicles. Every wall was covered with white board for my arrival– where we could design the final model!
Design is good, but know when to say enough is enough.
The project was never really finished. The grant was not renewed.
2) Let those who can help you, help you
I got involved in a project whose purpose was to accumulate data from various sources for storage in an Oracle database.
After determining the data required and gathering that data, I generated the six tables in UML (Universal Modeling Language) and subsequently the SQL that could handle the data. One table was a sequence identification table, or a table that held the id associated with that sequence in various databases such as GenBank and ENSEMBL.
One project member, a P.I. of another lab involved in the project, stated that she had read a book on SQL and she knew what to do.
Needless to say, she didn’t understand relational databases at all.
Instead of six tables, the database finally evolved into over 200 tables under her oversight. Most of these tables were of two entries — an index and a sequence identification tag.
3) Ask around, someone might have a better way
I was asked to help by a lab who was having trouble with some code developed by a programmer who had moved on. The lab technician who used the software said that it took 19 hours to assemble the data required to define the wells on a micro array plate.
I took a look at the code. By using the NCBI toolkit, several Perl scripts, and a database, I was able to reduce 19 hours to about 20 minutes.
The previous programmer used this elaborate system of indexed GenBank reports. By using the toolkit I was able to process the NCBI ASN.1 files directly.
Software Issues
1) Software has its limits
One lab was using FileMaker Pro for data storage. This was okay at first, but at 500 files growing beyond a 2-Gb file limit, FileMaker was struggling.
Data access proved more timely ported to an Oracle database.
2) Read the Manual
A sequence is a string of letters. As such, there is only so much you can do in searching strings. The word size of the search is limited.
One lab was analyzing a sequence against the entire genome of a selected organism using open source software. This software wasn’t intended to search the entire genome, just short pieces of it.
After the process took some 5 days to partly analyze just one sequence, the lab technician decided that this widely utilized open source program had to be rewritten.
The request was declined.
3) Document your code
We were called in to save a pharmacology database developed in Access. The original developer used Access because he “sort of” knew how to develop input screens in Access. The lab ran into trouble when the developer left. No one in the lab was able to take over the application and everyone else they asked to look at the project, left shaking their heads. There was no documentation of record.
The data was ported to an Oracle database with web-enabled user input and reporting functions.
4) Verify that the process completed
One research group created a process that was to automatically archive the day’s research data to backup. They assumed everything was okay, until they lost a hard drive and found out the the automatic nightly backup never happened because the filename, which explicitly listed the physical location of the data file, was too long for the archiving software. The backup failed with an error message, but no one ever checked.
Some things you just can’t help
One morning I arrived at the lab and found everyone on the floor waiting for me. They couldn’t access the server to read mail, etc.
I opened the lab, looked in the direction of the server and found an electrical plug pulled out of a socket.
It seems that the nightly housekeeping need an electrical outlet for the vacuum cleaner and the one that was used by the server was the handiest.
One More…
Our lab paid the institution’s IT department for a monthly back-up of our computers.
One morning, I came in, and everything was dead. I called our lab sys admin, told him to investigate.
Well, turns out IT hadn’t really done a back-up of our system in over 3 years. Apparently, they tried over the weekend. (Our lab sys admin wasn’t involved in the process, as he was subsidized by our department and not IT.)
At the start of the process, the date command produced the proper output. At the end of the process, the data command produced the output — no date command found - anywhere.
I forget exactly what got deleted or screwed up, but everything had to be rebuilt.
Luckily, I had used one of the seldom-used machines to mirror our data, etc. on a daily basis. So, once the machines were back (2 days), we were okay and didn’t lose much.
At this time, the average life span of a sys admin in IT was around 6 months.
Theses are just a few of my encounters in field of the life sciences. I won’t go into the ones from engineering, but I’ve got some beauts — especially as a woman in engineering.
(I’m delaying the “horror stories” until next week, because I want to fully document them all.)
I ran across the phrase “computer science wild” at a recent conference. I’ve got my own thought, especially since the top 25 coding errors was released yesterday. The link to the article is - http://www.computerworld.com/action/article.do?command=viewArticleBasic&articleId=9125678&source=NLT_SIT&nlid=91.
I think any programmer should have the opportunity to write software that might kill someone, blow up an extremely expensive piece of equipment, or cause a waste of thousands of dollars because the system is down. Maybe then they would think, write better code, and debug the software thoroughly before they released it into the wild.
The Ariadne V rocket blew up on take-off because the software didn’t contain an exception handler for buffer overflow! (This translates to something like an array overflow. An overflow would trigger a programming mechanism that would write out the buffer contents and clear it. The usual device is to transfer data capture to another buffer while the full buffer is written to i/o.)
The excuse for the disaster was that the specifications didn’t spell out the need for that programming mechanism. An exception handler is a very basic mechanism for catching and correcting errors. There is no excuse for this oversight.
One major project I worked on was acoustical (noise) testing of aircraft engines. Our crew would go to some really great places like Roswell, NM, Moses Lake, WA, or Uvalde, TX. We would record and analyze the noise of the engines as the aircraft flew over at different altitudes with variable loads in various approach patterns.
There were several pieces of software that had to work in tandem. The airborne system, the ground-based weather station, the meteorological (met) plane, the accoustic data analyzer, and the analysis station all had to work together to get the required results.
There was no room for error. Measuresments had to be exact, even out to 16 places after the decimal point.
Modeling techniques, programming languages and IDEs (Interactive Development Environments) have become very sophisticated and complex. A programmer today can “gee whiz” just about anything.
“Because we can” has become the norm.
This is great, but I’ve run into lab techs, etc. who were just this side of computer illiterate. Like my dad, they adhere to a limited number of computer applications, accessed by a few key strokes or mouse clicks they have memorized.
And don’t think that engineers are immune. They had to be drawn “screaming and kicking” away from their sliderules.
I’m for simple to start. You can always add more “bells and whistles” as the system (and its users) matures.