LifeFormulae Blog » Posts for tag 'MIT'

What is Old Is New Again - The Cloud No comments yet

Cloud computing is the current IT rage, said to cure all information management skills.

Cloud computing is just a new name for timeshare, a system in which various entities shared a centralized computing facility. A giant piece or two of big iron and floors of tape decks provided information processing and storage capabilities for a price.

The user was connected to the mainframe by a dumb terminal and later on by PC’s. The advantage (said the sales jargon), was that the user didn’t need to buy any additional hardware, worry about software upgrades or data backup and recovery. They would only pay for the time and space their processes required. Resources would be pooled and connected by a high speed network and could be accessed as demanded. The user wouldn’t really know what computing resources were in use, they just got results. Everything depended on the network communications between the use and centralized computing source.

What is New

Cloud computing is more powerful today because the communications network is the Internet. Some Cloud platforms also offer Web access to the tools – programming language, database, web utilities needed to create the cloud application.

The most important aspect I believer the Cloud offers is instant elasticity. A process can be upgraded almost instantaneously to use more nodes and obtain more computing power.

There are quite a few blog entries out there concerning the “elastic” cloud. For thoughts on “spin up” and “spin down” elasticity see For thoughts on “how elasticity could make you go broke, or On-demand IT overspending” see

And finally, an article that spawned the “elasticity is a myth” connotation or “over-subscriptionand over-capacity are two different things, see –

A good article that covers elasticity, hypervisors, and cloud security in general is located at The site is maintained by the Association for Computing Machinery. There are lots of articles on all sorts of computing topics including, “Why Cloud Computing Will Never Be Free” (

The Clouds

The most notable Clouds are Amazon’s Elastic Cloud, Google’s App Engine, and Microsoft’s Azure.

The three Cloud delivery models include:

    • Software as a service (SaaS), applications running on a cloud are accessed via a web browser

    • Platform as a service (PaaS), cloud-developed user applications such as databases

    • Infrastructure as a service (IaaS), provides computing resources to users on an as-needed basis

Pros and Cons

There are pros and cons for Cloud Computing. Microsoft’s Bill Ballmer is a proponent of Cloud computing.

In a recent email ( to Microsoft’s employees, Ballmer make the following case for Cloud Computing. He advises his employees to watch a video ( in which he makes the following points.

In my speech, I outlined the five dimensions that define the way people use and realize value in the cloud:

  • The cloud creates opportunities and responsibilities

  • The cloud learns and helps you learn, decide and take action

  • The cloud enhances your social and professional interactions

  • The cloud wants smarter devices

  • The cloud drives server advances that drive the cloud

Some very notable people are anti-cloud.

Richard Stallman, GNU software founder, said in recent interview for the London Guardian ( that Cloud computing is a trap.

The Web-based programs like Google’s Gmail will force people to buy into locked, proprietary systems that will cost more and more over time, according to the free software campaigner.

‘It’s stupidity. It’s worse than stupidity: it’s a marketing hype campaign,’ he told The Guardian. ‘Somebody is saying this is inevitable — and whenever you hear somebody saying that, it’s very likely to be a set of businesses campaigning to make it true.’”

Aside from all that, what should a potential user be wary of in the Cloud? I’ll try to answer that below.

Security in the Cloud

Security in the cloud is a major concern. Hackers are salivating because everything – applications, data, are all in the same place.

How do you know the node your process is accessing is real or virtual? The Hypervisor (in Linux, a special version of the kernel) owns the hardware and spawns virtual nodes. If the Hypervisor is hacked, the hacker owns all the nodes created by it. http://www.linux-kvm. org has further explanations and discussions of virtual node creators/creations.

Data separation is a big concern. Could your data become contaminated by data in other environments in the cloud.? What access restrictions are in place to protect sensitive data?

Can a user in another cloud environment inadvertently or intentionally get access to your data?

Data interoperability is another question mark. A company cannot transfer data from a public cloud provider, such as Amazon, Microsoft, or Google, put it in a private IaasP that a private cloud provider develops for a company, and then copy that data from its private cloud to another cloud provider, public or private. This is difficult because there are no standards for operating in this hybrid environment.

Data Ownership

Who is the custodian and who controls data if your company uses cloud providers, public and private?

Ownership concerns have no been resolved by the cloud computing industry. At the same time, the industry has no idea when a standard will emerge to handle information exchanges.

W3C –, is sponsoring workshops and publishing proposals concerning standards for the Cloud. You can subscribe to their weekly newsletter and stay up on all sorts of web-based technologies.

Also, the Distributed Management Task Force Inc.( is a consortium ofof IT companies focusing on, “Developing management standards & promoting
interoperability for enterprise & Internet environments”.

The DMTF Open Cloud Standards Incubator was launched to address management interoperability for Cloud Systems ( The DMTF leadership board currently includes AMD, CA Technologies, Cisco, Citrix Systems, EMC, Fujitsu, HP, Hitachi, IBM, Intel, Microsoft, Novell, Rack Space, RedHat, Savvis, Sun Guard, Sun Microsystems, and VMWare.

Working with the Cloud

Working with the Cloud can be intimidating. One suggestion is to build a private cloud in-house before moving on to the public cloud.

However, even that has its difficulties. Not to worry, there are several tools available to ease the transition.

There is a Cloud programming language – Bloom, developed at UC Berkeley by Dr. Joseph Hellerstein. HPC In The Cloud has published an interview with Dr. Hellerstein at

Bloom is based on Hadoop ( which is open source software for High Performace Computing (HPC) from Apache..

For ease of inter connectivity, Apache has released Apache libcloud, a standard client library written in python for many popular cloud providers – But libcloud doesn’t cover data standards, just connectivity.

MIT StarCluster– , is an open source utility for creating and managing general purpose computing clusters hosted on Amazon’s Elastic Compute Cloud (EC2). StarCluster minimizes the administrative overhead associated with obtaining, configuring, and managing a traditional computing cluster used in research labs or for general distributed computing applications.

All that’s needed to get started with your own personal computing cluster on EC2 is an Amazon AWS account and StarCluster.

HPC presents use cases as a means to understanding cloud computing.

BCM Bioinformatics has a new methodology article – Cloud Computing for Comparative Genomics that includes a cost analysis of using the cloud. Download the .pdf at

I hope this will get you started.  Once again, a big thanks to Bill for his assistance.

Effective Bioinformatics Programming - Part 1 No comments yet

The PLOS Computational Biology website recently published “A Quick Guide for Developing Effective Bioinformatics Programming Skills” by Joel T. Dudley and Atul J. Butte (

This article is a good that survey covers all the latest topics and mentions all the currently-popular buzzwords circulating above, around, and through the computing ionosphere. It’s a good article, but I can envision readers’ eyes glazing over about page 3. It’s a lot of computer-speak in a little space.

I’ll add in a few things they skipped or merely skimmed over to give a better overview of what’s out there and how it pertains to bioinformatics.

They state that a biologist should put together a Technology Toolbox. They continue, “The most fundamental and versatile tools in your technology toolbox are programming languages.”

Programming Concepts

Programming languages are important, but I think that Programming Concepts are way, way more important. A good grasp of programming concepts will enable you to understand any programming language.

To get a good handle on programming concepts, I recommend at book. This book, Structure and Implementation of Computer Programs from MIT Press (,is the basis for an intro to computer science at MIT. It’s called the Wizard Book or the Purple Book.

I got the 1984 version of the book which used the LISP language. The current 1996 version is based on LISP/Scheme. Scheme is basically a cleaned-up LISP, in case you’re interested.

Best of all course (and the down loadable book) are freely available from MIT through the MIT OpenCourseWare website –

There’s a blog entry - - that goes into further explanation about the course and the book..

And just because you can program, it doesn’t mean you know (or even need to know) all the concepts. For instance, my partner for a engineering education extension course was an electrical engineer who was programming microprocessors. When the instructor mentioned the term “scope” in reference to some topic, he turned to me and asked, “What’s scope?”

According to MIT’s purple book –” In a procedure definition, the bound variables declared as the formal parameters of the procedure have the body of the procedure as their scope.”

You don’t need to know about scope to program in assembler, because everything you need is right there. (In case you’re wondering, I consider assembler programmers to be among the programming elites.)

Programming Languages

The article mentions Perl, Python, and Ruby as the “preferred and most prudent choices” in which to seek mastery for bioinformatics.

These languages are selected because “they simplify the programming process by obviating the need to manage many lower level details of program execution (e.g. memory management), affording the programmer the ability to focus foremost on application logic…”

Let me add the following. There are differences in programming languages. By that, I mean compiled vs scripted. Languages such as C, C++, and Fortran are compiled. Program instructions written in these languages are parsed and translated into object code, or a language specific to the computer architecture the code is to run on. Compiled code has a definite speed advantage, but if the code is the main or any supporting module is changed, the entire project must be recompiled. Since the program is compiled into the machine code of a specific computer architecture, portability of the code is limited.

Perl, Python, and Ruby are examples of scripted or interpreted languages. These languages are translated into byte code which is optimized and compressed, but is not machine code. This byte code is then interpreted by a virtual machine (or byte code interpreter) usually written in C.

An interpreted program runs more slowly than a compiled program. Every line of an interpreted program must be analyzed as it is read. But the code isn’t particularly tied to one machine architecture making portability easier (provided the byte code interpreter is present). Since code is only interpreted at run time, extensions and modifications to the code base is easier, making these languages great for beginning programmers or rapid prototyping.

But, let’s get back to the memory management. This, and processing speed will be a huge deal in next gen data analysis and management.

Perl automatic memory management has a problem with circularity, as Perl (and Ruby and Python) count references.

If object 1 points to object 2 and object 2 points back to 1 , but nothing else in the program points to either object 1 or object 2 (this is a weak reference), these objects don’t get destroyed. They remain in memory. If these objects get created again and again, it’s called a memory leak.

I also have to ask – What about C/C++ , Fortran, and even Turbo Pascal? The NCBI Toolkit is written in C/C++. If you work with foreign scientists, you will probably see a lot Fortran.


You can’t mention programming with mentioning debugging. I consider the act of debugging code an art form any serious programmer should doggedly pursue.

Here’s a link to a ebook, The Art of Debugging It’s mainly Unix-based, C-centric and a little dated. But good stuff never goes out of style.

Chapter 4, Debugging: Theory explains various debugging techniques. Chapter 5 – Profiling talks about profiling your code, or determining where your program is spending most of its time.

He also mentions core dumps. A core is what happens when your C/C++/Fortran program crashes in Unix/Linux. You can examine this core to determine where your program went wrong. (It gives you a place to start.)

The Linux Foundation Developer Network has an on-line tutorial – Zen and the Art of Debugging C/C++ in Linux with GDB – They write a C program (incorporating a bug), create a make file, compile, and then use gdb to find the problem. You are also introduced to several Unix/Linux commands in the process.

You can debug Perl by invoking it with the -d switch. Perl usually crashes at the line number that caused the problem and some explanation of what went wrong.

The -d option also turns on parser debugging output for Python.

Object Dumps

One of the most useful utilities in Unix/Linux is od (object dump). You can examine files in octal (default), hex, or ASCII characters

od is very handy for examining data structures, finding hidden characters, and reverse engineering.

If you think you’re code is right, the problem may be in what you are trying to read. Use od to get a good look at the input data.

That’s it for Part 1. Part 2 will cover Open Source, project management, archiving source code and other topics.

Top of page / Subscribe to new Entries (RSS)