LifeFormulae Blog » Posts for tag 'XML'

Effective Bioinformatics Programming - Part 5 No comments yet

First, a little irony. In the late ’90’s I interviewed with BMC software in Houston. At that time, BMC was a supporter of big iron, providing report facilities, etc.

When asked what software I currently used, I replied with “GNU software”. The interviewer asked, “What is GNU? I’ve never heard of it.”

I explained that it was free software that you could download from the web, etc. But they weren’t really interested.

Anyway, eWEEK.com had a feature this week - ‘MindTouch Names 20 Most Powerful Open-Source Voices of 2010. The first name mentioned was William Hurley. The chief architect of Open Source strategy at BMC. (http://www.eweek.com/c/a/IT-Management/OSBC-Names-20-Most-Powerful-Open-Source-Voices-of-2010-758420/?kc=EWKNLEDP03232010A).

I guess they’re interested now.

Data Standards

There are any number of sequence data formats. This link at EBI – http://www.ebi.ac.uk/2can/tutorials/formats.html describes several.

What is really astounding is that most of these formats have remained to same over the years. The tab-delimited and CSV (comma separated values) format is as prolific as ever, as is the GenBank report.

And equally astonishing is the fact that manipulating the data (e.g. parsing GenBank reports) is still the same.

True, the Bio libraries such as BioPerl, BioJava, BioRuby, now provide modules that make this easier, (if you can install them) but it is still the same old download and parse.

There are also several groups trying to standardize sequence data. The SO (Sequence Ontology) group (http://www.sequenceontology.org) is trying to do for sequence annotations what GO (Gene Ontology - http://www.geneontology.org) did for genes and gene product attributes.

MIGS (Minimum Information About A Genome Sequence spec at http://nora.nerc.ac.uk/5548/) is following the course of the MAGE MIAME Standard (Minimum Information About a Microarray Experiment at http://www.mged.org/Workgroups/MIAME/miame.html). Good luck with that, as many scientists have openly voiced objections to that standard.

XML and the Web

XML (eXtensible Markup Language) and WSDL (Web Services Description Language) are one method of easing the interchange of data. Links at – http://en.wikipedia.org/wiki/XML and http://en.wikipedia.org/wiki/Web_Services_Description_Language.

There are a number of drawbacks to this setup.

Not all of the sequence data is available in XML or well-formed XML.

Some XML, such as NCBI XML, needs further interpretation. For example, the sequence feature (annotation) locations must be “translated” for further use.

XSLT has performance issues, and is size-delimited. We tried processing LARTS converted NCBI ASN.1 GenBank XML data to XSLT and found there were definite size limitations.

Using WSDL means exposing yourself to the world via the web.

Javascript has too many security questions to consider seriously.

Software Development

Software development takes time and the right people. True, there is a lot of open source software out there, but I’ve mentioned the perils of that method in a previous blog.

A scientist with a grant to produce results dependent on computer analysis is only going to write code that is good enough to create code (or find someone (read post-doc) who can create that code very cheaply) that will back up those findings.

Has the code been extensively tested? Are the results produced by the code valid? Can the code be used by future projects? Is the software portable? Is it robust? Can it be ported to different hardware environments?

There is a great article – “Are we taking supercomputing code seriously?” at (http://www.zdnet.co.uk/news/it-strategy/2010/01/28/are-we-taking-supercomputing-code-seriously-40004192/). This article, in turn, has links to other articles on methods and algorithms, and error behavior, for example. This one on scientific software considers how multi-processing has influenced algorithm development and the problem of different multi-processors co-existing on the same machine (http://www.scientific-computing.com/features/feature.php?feature_id=262).

He states that in the rush to do science, scientists fail to spot software for what it is: the analogue of the experimental instrument. Therefore the software must be treated with the same respect that a physical experiment would.

When I started my career, I worked on a system that was a totally integrated database system for hospitals. It was one of those systems that was so very ahead of its time (mid-80’s), that a corporation bought the product and squashed it.

Anyway, our Systems and Extensions group supported the 6 compilers that comprised the system software that made the system function. The tailoring group wrote the code that created the screens that drove the system.

At the inception of the system, a decision was to be made over the make up of the tailoring group: should they be programmers that would be taught medical jargon, terms, etc; or should they be medical personnel – doctors, nurses, techs, that would be taught programming?

The decision was to go with medical personnel, as it was surmised they would understand hospitals better.

At the same time, a decision to limit the number of screens a hospital could request (called tailoring) to 500 was discussed. The decision was to let the hospital have however many screens it wanted.

The tailoring group got their training and set in to programming. After a period of time, it was realized that the group had, in essence, created one bad program and copied it thousands of times.

It was so bad, we did two things. We created a program profiler that produced a performance summary of the programming aspects of that program. (We were immediately asked to remove it by the tailoring group, as it was too confusing.) Two, we created an automated programming module that would create the code from the display widgets on the screen designed by the tailoring group.

This approach was helping, but people were abandoning ship as talk of an acquisition was surfacing. Our junior programmer went from new-hire to senior team member in 30 days.

I think we would have done a lot better with programmers learning medical terms.

As for the hospital screen limit, we had hospitals with 10,000 individual screens. We should have stuck with 500.

One last thing. When looking at any piece of scientific programming, please realize that in the Authors accreditation usually starts with the PI. The people who did the actual work are generally listed at the end of the line. The PI may have had the idea, but likely as not could not code it.

Effective Bioinformatics Programming Part 4 No comments yet

All Things HPC

Traditionally, High Performance Computing (HPC) means using high-end hardware like super computers to perform complex computational tasks.

A new definition of HPC (“High Productivity Computing”) means the entire processing and data handling infrastructure. This includes software tools, platforms (computer hardware and operation systems), and data management software.

Parallel or Multicore Processing

I think just about everybody has performed some sort of parallel programming. Starting two processes at once on the same machine is parallelism. If the program runs by itself and doesn’t need input from another program or product output for another program to use, it’s loosely coupled. It’s tightly coupled if one program feeds another.

PC architecture today supports multicore processors. A two-core CPU is, in essence, two CPUs on the same chip. These cores may share memory cache (tightly coupled) or not (loosely coupled). They may implement a method of message passing – intercore communication.

Cilk is a language for multi-threaded parallel processing based on ANSI C. MIT was the initial developer of the Cilk technology. The link to their page is at – http://supertech.csail.mit.edu/cilk/.

MIT had licensed Cilk to Cilk Arts, Inc. Cilk Arts added support for C++, parallel loops, and interoperability with serial interfaces. The product has since been acquired by Intel and will be incorporated into the Intel C++ compiler. The Intel page is at  http://software.intel.com/en-us/articles/intel-cilk/.

Cilk++ makes multicore processing easy. CILK++ uses keywords to adapt existing C++ code to multicore processing. (You will need a multi-core processor).

Cilk++ is currently is a technical preview state.   This means they want you to use it and give them feedback. Download the Intel CILK++ SDK at http://software.intel.com/en-us/articles/download-intel-cilk-sdk/. You will need to sign a license agreement.

The page also presents download links for 32-bit and 64-bit Linux Cilk++. (You will need an Intel processor for the Linux apps.)

There is an e-book on Multi-Processor programming available from Intel. The link is - http://software.intel.com/en-us/articles/e-book-on-multicore-programming/.

The book contains a lot of information on multicore programming, parallelism, scheduling theory, shared memory hardware, concurrency platforms, race conditions, divide and conquer recurrences, and others.

Grid computing is distributed, large scale, cluster computing. Two of the most famous grid projects are SETI@home and Folding@home (http://folding.stanford.edu).

SETI (the search for extra-terrestrial Intelligence) at home uses internet connected computers hosted by Space Sciences Laboratory at the UC, Berkeley. Folding at home focuses on how proteins (biology’s workhorses) fold or assemble themselves to carry out important functions.

Other lesser known grids are einstein@home (http://www.einsteinathome.org - “Grab a wave from Space”) processing data from gravational wave detectors, and MilkyWay@home (http://milkyway.cs.rpi.edu/milkyway) creating a highly accurate 3-D model of the Milky Way Galaxy.

Communication

The clusters mentioned above use the internet to exchange messages. If fast messaging is not required, plain old ethernet should be sufficient for your messaging needs, The problem with ethernet is latency. It takes a long time to set up and get that first message out there. After that, it’s solid.

But if you’re looking for constant speed try Infiniband (http://en.wikipedia.org/wiki/Infiniband), Myriet (www.myri.com), or QsNet (http://en.wikipedia.org/wiki/QsNet).

Gamers

Oh, those gamers. Without their demand for faster, bigger, better, where would we be?

For example, do not overlook the gaming console. NCSA (National Center for Supercomputing Applications) has a cluster of Sony PlayStations. The PlayStation 3 runs Yellow Dog Linux. The average PS3 retails for around $600. The Folding@home grid runs on PS3s and PCs.

Then we come to the GPU (Graphics Processing Unit). GPU computing means using the GPU to do general purpose scientific and engineering computing. The model for GPU computing couples a CPU with a GPU, with the GPU performing the heavy processing. (http://www.nvidia.com/object/GPU_Computing.html)

One of the hottest GPU’s is the NVIDIA Tesla GPU which is based on the CUDA GPU architecture code-named the “Fermi”.

FPGAs (Field Programmable Gate Arrays)

Technological devices keep getting smaller and smaller, and the machinery gets buried under tons of software burdened with the menu systems connected to the development environment from hell.

FPGAs take you back to the schematic level. (I was known as a “bit-twiddler” at IBM.)

My old friends at National Instruments (http://www.ni.com/fpga/) have NI LabView FPGA. LabView FPGA provides graphical programming of FPGAs.

Their video on FPGA Technology is a good intro to FPGA. (http://www.ni.com/fpga_technology/). Several other videos are available at this same sight go into further detail. For more info on the FPGA hardware see http://en.wikipedia.org/wiki/FPGA.

(I still haven’t forgiven NI for nuking my data acquisition PC with their demo. I lost a lot of stuff. All was backed up, but re-installing was not fun.)

FYI -The industry is desperately seeking parallel and FPGA programmers.

Data Representation in Database Design

The most recent programming languages are object-oriented. However, the most efficient databases are relational. There are object-oriented database systems, but for the most part they are very expensive and very, very slow. Postgres is a RDBS (Relational Database Management System) that does implement a form of inheritance where one table may extend (inherit) another table.

Then you have XML. XML Schemas are adding another dimension to this complexity. XML is popular for communication (SOAP) and representation (XSLT). Data comes from an RDMS, gets stuffed into objects, translated to XML on one end and sent as XML, translated to objects, and stored in a RDMS at the other end.

The mapping of objects to RDBMS is known as object relational (O/R) impedance mismatch. See this link for a discussion (http://www.agiledata.org./) of software development processes and a link to a recent book on database techniques for mapping objects to relational datbases – http://www.agiledata.org/essays/mappingObjects.html.

But beware, as most of these ORMs (Object to Relational Mapping) sometimes produce a schema that wouldn’t be completely relational and therefore suffer in performance. Also, the SQL produced by ORMs may not be optimal.

To effectively design and develop a RDBMS , learn UML (Universal Modeling Language). The Objects By Design web site (http://www.objectsbydesign.com) covers UML and a lot of other object-oriented topics and is worth a look.

Rational Rose is the UML design tool that I use. It’s now been purchased by IBM. Rational uses what is known as the Rational Unified method,

Speaking of XML, some of the UML design tools can now output XML directly from the data record definitions.

See this link for a list of current UML products – http://www.objectsbydesign.com/tools/umltools_byCompany.html.

The End of SQL

The ComputerWorld blog sight has an interesting 3-part series entitled – The End of SQL and relational databases?

Part 1 covers Relational Methodology and SQL.The link to part I is here – http://blogs.computerworld.com/15510/the_end_of_sql_and_relational_databases_part_1_of_3.

Part 2 is a list of current NoSQL databases. The link to part 2 is here – http://blogs.computerworld.com/15556/the_end_of_sql_and_relational_databases_part_2_of_3

Part 3 is a list of links to NoSQL sites, articles, and blog posts. The link to part 3 is here - http://blogs.computerworld.com/15641/the_end_of_sql_and_relational_databases_part_3_of_3

In short, the “NoSQL” (http://en.wikipedia.org/wiki/NoSQL) movement and cloud-based data stores are striving to completely remove developers from a reliance on SQL and relational databases.

In a post-relational world, they argue that a distributed, context-free key-value store is probably the way to go. This makes sense when are can be thousands of sequence searchers, but only one updater. A transactional database would be overkill.

Part 5 of Effective Bioinformatics Programming coming soon..

Top of page / Subscribe to new Entries (RSS)