LifeFormulae Blog » Posts for tag 'Linux'

A Valuable Lesson No comments yet

There once was a web portal called the Search Launcher to which I dedicated 4 years of my life.

It sort of fell into my lap. The only orientation I got was the basic directory structure. The supporting csh scripts, programs, and databases (and the programs that created those databases) I had to discover on my own.

It took me about a year to get it all under control. I think I actually made things better. I was used to dealing with tons of data, being on call at all hours, and minor things like distributed and parallel processing. I had the welcoming server off-loading compute-intensive processes to another server at a faster machine, as NFS was proving too flaky. I had huge databases distributed across many disks according to their size. I had website client programs on the Mac and PC that a researcher could run from a local machine.

I would work late and late on weekends during the off-hours when the load was lighter. If I had something really crucial to do, I telecommuted because I didn’t want any distractions.

Things ran smoothly, but there was the occasional hiccup. Mostly I monitored things and planned for improvements.

People would come by my cubicle and remark on how it looked as if I had nothing to do. If only you knew, I thought to myself.

One day, out of the blue, I was asked to interview someone. I was told he would be joining our group.

Okay, so I interviewed him. His main exposure to computing was Windows PC base. He had little to no Unix/Linux experience, much less programming, dealing with websites, huge amounts of data, and distributed anything. He did, however, have “wet lab” experience.

After he was hired, he told me that he was going to “rework things from top to bottom and make it easy for me.”

As he tried to “rework things”, he decided that everything had to be on one machine on one disk. He said this was to make things easier, but I really knew he didn’t understand the current configuration.

I decided it was time to bow out. I didn’t want to held responsible when his “improvements” came crashing down.

So I left. Before I did, however, he had me move everything to the one disk and then asked me to create a tarball. Well, tar wouldn’t work because the directory structure became so convoluted that the directory names were so long that tar couldn’t relate. Not to mention the size of what had to be tar’d!

This was a warning of future activities as the department decided to invest in a Linux farm at his suggestion. Needless to say, the system was improperly configured and crashed about two to three times a week.

Remember those long directory names that tar couldn’t comprehend? Well, the Linux farm was configured to continue this wonderful tradition with the result that everybody thought that tar was archiving the data when it really wasn’t. Nobody was reading the error files generated by the tar process, so nobody paid attention until sometime later. Then they hired a sysadmin whose sole duty was to watch the tar archive and make sure it took.

I was off to other venues, one of which was Enron. Yes, I was there when it all came down, but the 200 people in my department were not affected. It was a circus getting to work for awhile. Reporters were hiding under the stairways in the parking garage hoping for news tidbits.

I heard that my old bioinformatics department had hired an “efficiency expert” to advise them on the problems they were encountering and advice on how to fix them.

They got tired of hearing how everything was broken, so they let the expert go. Business continued as usual.

Next they hired a really good sys admin to take care of things, but he spent most of his time keeping the farm going by scrounging parts from the backup system. He left after awhile.

The moral of the story is this – know what you are hiring. Don’t give them power to instrument something they don’t know about or understand. This will upset the people that know what they are doing and that valuable experience will move on.

A lot of good people with a much better plan than a misconfigured Linux farm went on to better pastures, putting the department in the position of trying to recover what was lost.

With all the data now forth coming and the latest news that most sequences are wrongly annotated, it’s time for experts.

As an aside – A talk given on Thursday, Jan. 6th a fascinating talk on deep transcriptome analysis by Chris Mason, Assistant Professor, at the Institute for Computational Biomedicine at Cornell University listed the following observations that next-gen sequencing is bringing to light.

Some of the most interesting points from Mason’s talk were:

  • A large fraction of the existing genome annotation is wrong.
  • We have far more than 30,000 genes, perhaps as many as 88,000.
  • About ten thousand genes use over 6 different sites for polyadenylation.
  • 98% of all genes are alternatively spliced.
  • Several thousand genes are transcribed from the “anti-sense”strand.
  • Lots of genes don’t code for proteins.  In fact, most genes don’t code for proteins.

Mason also described the discovery of 26,187 new genes that were present in at least two different tissue types.

For more, see – http://scienceblogs.com/digitalbio/2011/01/next_gene_sequencing_results_a.php.

Genetics and biology are concrete sciences. Computer science and engineering entail a lot of abstract thinking which is desperately needed for the underlying structure to support the analysis of the masses of sequence data currently amassing.

Get the right people for the job and you won’t find trouble.

Effective Bioinformatics Programming - Part 1 No comments yet

The PLOS Computational Biology website recently published “A Quick Guide for Developing Effective Bioinformatics Programming Skills” by Joel T. Dudley and Atul J. Butte (http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1000589).

This article is a good that survey covers all the latest topics and mentions all the currently-popular buzzwords circulating above, around, and through the computing ionosphere. It’s a good article, but I can envision readers’ eyes glazing over about page 3. It’s a lot of computer-speak in a little space.

I’ll add in a few things they skipped or merely skimmed over to give a better overview of what’s out there and how it pertains to bioinformatics.

They state that a biologist should put together a Technology Toolbox. They continue, “The most fundamental and versatile tools in your technology toolbox are programming languages.”

Programming Concepts

Programming languages are important, but I think that Programming Concepts are way, way more important. A good grasp of programming concepts will enable you to understand any programming language.

To get a good handle on programming concepts, I recommend at book. This book, Structure and Implementation of Computer Programs from MIT Press (http://mitpress.mit.edu/sicp/),is the basis for an intro to computer science at MIT. It’s called the Wizard Book or the Purple Book.

I got the 1984 version of the book which used the LISP language. The current 1996 version is based on LISP/Scheme. Scheme is basically a cleaned-up LISP, in case you’re interested.

Best of all course (and the down loadable book) are freely available from MIT through the MIT OpenCourseWare website – http://ocw.mit.edu/OcwWeb/Electrical-Engineering-and-Computer-Science/6-001Spring-2005/CourseHome/index.htm.

There’s a blog entry - http://onlamp.com/pub/wlg/8397 - that goes into further explanation about the course and the book..

And just because you can program, it doesn’t mean you know (or even need to know) all the concepts. For instance, my partner for a engineering education extension course was an electrical engineer who was programming microprocessors. When the instructor mentioned the term “scope” in reference to some topic, he turned to me and asked, “What’s scope?”

According to MIT’s purple book –” In a procedure definition, the bound variables declared as the formal parameters of the procedure have the body of the procedure as their scope.”

You don’t need to know about scope to program in assembler, because everything you need is right there. (In case you’re wondering, I consider assembler programmers to be among the programming elites.)

Programming Languages

The article mentions Perl, Python, and Ruby as the “preferred and most prudent choices” in which to seek mastery for bioinformatics.

These languages are selected because “they simplify the programming process by obviating the need to manage many lower level details of program execution (e.g. memory management), affording the programmer the ability to focus foremost on application logic…”

Let me add the following. There are differences in programming languages. By that, I mean compiled vs scripted. Languages such as C, C++, and Fortran are compiled. Program instructions written in these languages are parsed and translated into object code, or a language specific to the computer architecture the code is to run on. Compiled code has a definite speed advantage, but if the code is the main or any supporting module is changed, the entire project must be recompiled. Since the program is compiled into the machine code of a specific computer architecture, portability of the code is limited.

Perl, Python, and Ruby are examples of scripted or interpreted languages. These languages are translated into byte code which is optimized and compressed, but is not machine code. This byte code is then interpreted by a virtual machine (or byte code interpreter) usually written in C.

An interpreted program runs more slowly than a compiled program. Every line of an interpreted program must be analyzed as it is read. But the code isn’t particularly tied to one machine architecture making portability easier (provided the byte code interpreter is present). Since code is only interpreted at run time, extensions and modifications to the code base is easier, making these languages great for beginning programmers or rapid prototyping.

But, let’s get back to the memory management. This, and processing speed will be a huge deal in next gen data analysis and management.

Perl automatic memory management has a problem with circularity, as Perl (and Ruby and Python) count references.

If object 1 points to object 2 and object 2 points back to 1 , but nothing else in the program points to either object 1 or object 2 (this is a weak reference), these objects don’t get destroyed. They remain in memory. If these objects get created again and again, it’s called a memory leak.

I also have to ask – What about C/C++ , Fortran, and even Turbo Pascal? The NCBI Toolkit is written in C/C++. If you work with foreign scientists, you will probably see a lot Fortran.

Debugging

You can’t mention programming with mentioning debugging. I consider the act of debugging code an art form any serious programmer should doggedly pursue.

Here’s a link to a ebook, The Art of Debugging http://www.circlemud.org/cdp/hacker/. It’s mainly Unix-based, C-centric and a little dated. But good stuff never goes out of style.

Chapter 4, Debugging: Theory explains various debugging techniques. Chapter 5 – Profiling talks about profiling your code, or determining where your program is spending most of its time.

He also mentions core dumps. A core is what happens when your C/C++/Fortran program crashes in Unix/Linux. You can examine this core to determine where your program went wrong. (It gives you a place to start.)

The Linux Foundation Developer Network has an on-line tutorial – Zen and the Art of Debugging C/C++ in Linux with GDB – http://ldn.linuxfoundation.org/article/zen-and-art-debugging-cc-linux-with-gdb. They write a C program (incorporating a bug), create a make file, compile, and then use gdb to find the problem. You are also introduced to several Unix/Linux commands in the process.

You can debug Perl by invoking it with the -d switch. Perl usually crashes at the line number that caused the problem and some explanation of what went wrong.

The -d option also turns on parser debugging output for Python.

Object Dumps

One of the most useful utilities in Unix/Linux is od (object dump). You can examine files in octal (default), hex, or ASCII characters

od is very handy for examining data structures, finding hidden characters, and reverse engineering.

If you think you’re code is right, the problem may be in what you are trying to read. Use od to get a good look at the input data.

That’s it for Part 1. Part 2 will cover Open Source, project management, archiving source code and other topics.

Top of page / Subscribe to new Entries (RSS)