Effective Bioinformatics Programming Part 4
All Things HPC
Traditionally, High Performance Computing (HPC) means using high-end hardware like super computers to perform complex computational tasks.
A new definition of HPC (“High Productivity Computing”) means the entire processing and data handling infrastructure. This includes software tools, platforms (computer hardware and operation systems), and data management software.
Parallel or Multicore Processing
I think just about everybody has performed some sort of parallel programming. Starting two processes at once on the same machine is parallelism. If the program runs by itself and doesn’t need input from another program or product output for another program to use, it’s loosely coupled. It’s tightly coupled if one program feeds another.
PC architecture today supports multicore processors. A two-core CPU is, in essence, two CPUs on the same chip. These cores may share memory cache (tightly coupled) or not (loosely coupled). They may implement a method of message passing – intercore communication.
Cilk is a language for multi-threaded parallel processing based on ANSI C. MIT was the initial developer of the Cilk technology. The link to their page is at – http://supertech.csail.mit.edu/cilk/.
MIT had licensed Cilk to Cilk Arts, Inc. Cilk Arts added support for C++, parallel loops, and interoperability with serial interfaces. The product has since been acquired by Intel and will be incorporated into the Intel C++ compiler. The Intel page is at http://software.intel.com/en-us/articles/intel-cilk/.
Cilk++ makes multicore processing easy. CILK++ uses keywords to adapt existing C++ code to multicore processing. (You will need a multi-core processor).
Cilk++ is currently is a technical preview state. This means they want you to use it and give them feedback. Download the Intel CILK++ SDK at http://software.intel.com/en-us/articles/download-intel-cilk-sdk/. You will need to sign a license agreement.
The page also presents download links for 32-bit and 64-bit Linux Cilk++. (You will need an Intel processor for the Linux apps.)
There is an e-book on Multi-Processor programming available from Intel. The link is - http://software.intel.com/en-us/articles/e-book-on-multicore-programming/.
The book contains a lot of information on multicore programming, parallelism, scheduling theory, shared memory hardware, concurrency platforms, race conditions, divide and conquer recurrences, and others.
Grid computing is distributed, large scale, cluster computing. Two of the most famous grid projects are SETI@home and Folding@home (http://folding.stanford.edu).
SETI (the search for extra-terrestrial Intelligence) at home uses internet connected computers hosted by Space Sciences Laboratory at the UC, Berkeley. Folding at home focuses on how proteins (biology’s workhorses) fold or assemble themselves to carry out important functions.
Other lesser known grids are einstein@home (http://www.einsteinathome.org - “Grab a wave from Space”) processing data from gravational wave detectors, and MilkyWay@home (http://milkyway.cs.rpi.edu/milkyway) creating a highly accurate 3-D model of the Milky Way Galaxy.
The clusters mentioned above use the internet to exchange messages. If fast messaging is not required, plain old ethernet should be sufficient for your messaging needs, The problem with ethernet is latency. It takes a long time to set up and get that first message out there. After that, it’s solid.
Oh, those gamers. Without their demand for faster, bigger, better, where would we be?
For example, do not overlook the gaming console. NCSA (National Center for Supercomputing Applications) has a cluster of Sony PlayStations. The PlayStation 3 runs Yellow Dog Linux. The average PS3 retails for around $600. The Folding@home grid runs on PS3s and PCs.
Then we come to the GPU (Graphics Processing Unit). GPU computing means using the GPU to do general purpose scientific and engineering computing. The model for GPU computing couples a CPU with a GPU, with the GPU performing the heavy processing. (http://www.nvidia.com/object/GPU_Computing.html)
One of the hottest GPU’s is the NVIDIA Tesla GPU which is based on the CUDA GPU architecture code-named the “Fermi”.
FPGAs (Field Programmable Gate Arrays)
Technological devices keep getting smaller and smaller, and the machinery gets buried under tons of software burdened with the menu systems connected to the development environment from hell.
FPGAs take you back to the schematic level. (I was known as a “bit-twiddler” at IBM.)
My old friends at National Instruments (http://www.ni.com/fpga/) have NI LabView FPGA. LabView FPGA provides graphical programming of FPGAs.
Their video on FPGA Technology is a good intro to FPGA. (http://www.ni.com/fpga_technology/). Several other videos are available at this same sight go into further detail. For more info on the FPGA hardware see http://en.wikipedia.org/wiki/FPGA.
(I still haven’t forgiven NI for nuking my data acquisition PC with their demo. I lost a lot of stuff. All was backed up, but re-installing was not fun.)
FYI -The industry is desperately seeking parallel and FPGA programmers.
Data Representation in Database Design
The most recent programming languages are object-oriented. However, the most efficient databases are relational. There are object-oriented database systems, but for the most part they are very expensive and very, very slow. Postgres is a RDBS (Relational Database Management System) that does implement a form of inheritance where one table may extend (inherit) another table.
Then you have XML. XML Schemas are adding another dimension to this complexity. XML is popular for communication (SOAP) and representation (XSLT). Data comes from an RDMS, gets stuffed into objects, translated to XML on one end and sent as XML, translated to objects, and stored in a RDMS at the other end.
The mapping of objects to RDBMS is known as object relational (O/R) impedance mismatch. See this link for a discussion (http://www.agiledata.org./) of software development processes and a link to a recent book on database techniques for mapping objects to relational datbases – http://www.agiledata.org/essays/mappingObjects.html.
But beware, as most of these ORMs (Object to Relational Mapping) sometimes produce a schema that wouldn’t be completely relational and therefore suffer in performance. Also, the SQL produced by ORMs may not be optimal.
To effectively design and develop a RDBMS , learn UML (Universal Modeling Language). The Objects By Design web site (http://www.objectsbydesign.com) covers UML and a lot of other object-oriented topics and is worth a look.
Rational Rose is the UML design tool that I use. It’s now been purchased by IBM. Rational uses what is known as the Rational Unified method,
Speaking of XML, some of the UML design tools can now output XML directly from the data record definitions.
See this link for a list of current UML products – http://www.objectsbydesign.com/tools/umltools_byCompany.html.
The End of SQL
The ComputerWorld blog sight has an interesting 3-part series entitled – The End of SQL and relational databases?
Part 1 covers Relational Methodology and SQL.The link to part I is here – http://blogs.computerworld.com/15510/the_end_of_sql_and_relational_databases_part_1_of_3.
Part 2 is a list of current NoSQL databases. The link to part 2 is here – http://blogs.computerworld.com/15556/the_end_of_sql_and_relational_databases_part_2_of_3
Part 3 is a list of links to NoSQL sites, articles, and blog posts. The link to part 3 is here - http://blogs.computerworld.com/15641/the_end_of_sql_and_relational_databases_part_3_of_3
In short, the “NoSQL” (http://en.wikipedia.org/wiki/NoSQL) movement and cloud-based data stores are striving to completely remove developers from a reliance on SQL and relational databases.
In a post-relational world, they argue that a distributed, context-free key-value store is probably the way to go. This makes sense when are can be thousands of sequence searchers, but only one updater. A transactional database would be overkill.
Part 5 of Effective Bioinformatics Programming coming soon..