The End of Bioinformatics?!
I read with some interest the announcement of the Wolfram Alpha. Wolfram intends to be the end all and be all data mining systems and some say, makes bioinformatics obsolete.
Wolfram’s basis is a formal Mathematica representation. It’s inference engine is a large number of hand-written scripts that access data that has been accumulated and curated. The developers stress that the system is not Artificial Intelligence and is not aiming to be. For instance, a sample query,
“List all human genes with significant evidence of positive selection since the human-chimpanzee common ancestor, where either the GO category or OMIM entry includes ‘muscle’”
could currently be executed with SQL, provided the underlying data is there.
Wolfram won’t replace bioinformatics. What it will do is make it easier for a neophyte to get answers to his or her questions because they can be asked in a simpler format.
I would guess Wolfram uses one or more these scripts to address a specific data set in conjunction with a natural language parser. These scripts would move this data to a common model that could then be modeled on a web page.
But why not AI? Why not replace all those “hand-written” scripts, etc. with a real inference engine.
I rode the first AI wave. I was a member of the first of 25 engineers selected to be a part of the McAir AI Initiative at McDonnell Aircraft Company. (”There is AI in McAir”). In all, 100 engineers were chosen from engineering departments to attend courses leading to a Certificate in Artificial Intelligence from Washington University in St. Louis.
One of the neat things about the course was the purchase of at least 30 workstations (maybe as many as 60) for a young company called Sun that were loaned to Washington University for the duration of the course. Afterwards, we got a few Symbolics machines for our CADD project.
Other than Lisp and Prolog, the software we used was called KEE (Knowledge Engineering Environment). Also, there was a DEC (Digital Equipment Company) language called OPS5.
The course was quite fast-paced but very extensive. We had the best AI consultants available at the time lecture and give assignments in epistemology, interviewing techniques, and so on. I had a whole stack of books.
The only problem was that no money was budgeted (or so I was told) for AI development for the departments for the engineers when they returned from the course eager to AI everything. A lot of people left.
Anyway, my group of three developed a “Battle Damage Repair” system that basically “patched up” the composite wing skins of combat aircraft. Given the size and location of the damage, the system would certify whether the aircraft would be able to return to combat, and would output the size and substance of the patch if the damage wasn’t that bad.
One interesting tidbit: We wanted to present our system at a conference in San Antonio and had a picture of a battle-damaged F-15 we wanted to use. Well, we were told that the picture was classified and, as such, we couldn’t use it. Well, about that same time, a glossy McAir brochure featuring our system and that photo were distributed at the AAAI (American Assn. of Artificial Intelligence) to thousands of people.
Another system I developed dealt with engineering schematics. These schematics were layered. Some layers and circuits were classified. Still another system scheduled aircraft for painting and yet another charted a path for aircraft through hostile territory, activating electronic counter measures as necessary.
I guess the most sophisticated system I worked on was with the B-2 program. The B-2 skin is a composite material. This material has to be removed from a freezer, molded into a final shape and cooked in a huge autoclave before it completely thawed.
We had to schedule materials, and the behavior of that material under various circumstances, as well as people and equipment. The purpose was to avoid “bottlenecks” in people and equipment. I was exposed to the Texas Instruments Explorer and Smalltalk-80 on an Apple. I’ve been in love with Smalltalk ever since.
The system was developed, but it was never used. The problem was that we had to rank workers by expertise. That’s union workers and that wasn’t allowed.
It was a nice system that integrated a lot of systems and worked well. Our RFP (Request for Proposals) went out to people like Carnegie-Mellon. We had certain performance and date requirements that we wanted to see in the final system. We were told that the benchmarks would be difficult, in not impossible, to attain. Well, we did it, on our own without their help.
We also had a neural net solution that inspected completed composite parts. The parts were submerged in water and bombarded with sound waves. The echoes were used by the system to determine part quality.
AI promised the world, and then it couldn’t really deliver. So it kind of went to the back burner.
One problem with the end and be all. It will only be as good as your model. It will only be as good as the developers can determine the behavior of the parts and how they interact with the whole. Currently, this is a moving target and is changing day to day. Good luck.
Links -
Will Wolfram Make Bioinformatics Obsolete? - http://johnhawks.net/weblog/reviews/genomics/bioinformatics/wolfram-alpha-bioinformatics-2009.html
The most complex system I’ve configured was the airborne data acquisition and ground support systems. However, not many people have to or want do anything that large or complex. Some labs will need info from thermocouples, strain gauges, or other instrumentation, but most of you will be satisfied with a well-configured system that can handle today’s data without a large cash outlay that can be expanded at minimum cost to handle the data of tomorrow.
This week’s guest blogger, Bill Eaton, provides some guidelines for the configuration of a Database Server, a Web Server, and a Compute Node, the three most requested configurations.
(Bill Eaton)
General Considerations
Choice of 32-bit or 64-bit Operating System on standard PC hardware
- A 32-bit operating system limits the maximum memory usage of a program to 4 GB or less, and may limit maximum physical memory.
- Linux: for most kernels, programs are limited to 3 GB. Physical memory can usually exceed 4 GB.
- Windows :The stock settings limit a program to 2 GB, and physical memory to 4 GB.
The server versions have a /3GB boot flag to allow 3 GB programs and a /PAE flag to enable more than 4 GB of physical memory.
Other operating systems usually have a 2 or 3 GB program memory limit.
- A 64-bit operating system removes these limits. It also enables some additional CPU registers and instructions that may improve performance. Most will allow running older 32-bit program files.
Database Server:
Biological databases are often large, 100 GB or more, often too large to fit on a single physical disk drive. A database system needs fast disk storage and a large memory to cache frequently-used data. These systems tend to be I/O bound.
Disk storage:
- Direct-attached storage: disk array that appears as one or more physical disk drives, usually connected using a standard disk interface such as SCSI.
- Network-attached storage: disk array connected to one or more hosts by a standard network. These may appear as network file systems using NFS, CIFS, or similar, or physical disks using iSCSI.
- SAN: includes above cases, multiple disk units sharing a network dedicated to disk I/O. Fibre Channel is usually used for this.
- Disk arrays for large databases need high I/O bandwidth, and must properly handle flush-to-disk requests.
Databases:
- Storage overhead: data repositories may require several times the amount of disk space required by the raw data. Adding an index to a table can double its size. A test using a simple mostly numeric table with one index gave these overheads for some common databases.
- MySQL using MyISAM 2.81
- MySQL using InnoDB 3.28
- Apache Derby 5.88
- PostgreSQL 7.02
- Data Integrity support: The server and disk system should handle failures and power loss as cleanly as possible. A UPS with clean shutdown support is recommended.
Web Server and middleware hosts:
A web server needs high network bandwidth, and should have a large memory to cache frequently-used content.
Web Service Software Considerations:
- PHP: Thread support still has problems. PHP applications running on a Windows system under either Apache httpd or IIS may encounter these. We had seen a case where WordPress run under Windows IIS and Apache httpd gave error messages, but worked without problems under Apache httpd on Linux. IIS FastCGI made the problem worse. PHP acceleration systems may be needed to support large user bases.
- Perl: similar thread support and scaling issues may be present. For large user bases, use of mod_perl or FastCGI can help.
- Java-based containers: (Apache Tomcat, JBoss, GlassFish, etc) These run on almost anything without problems, and usually scale quite well.
Compute nodes:
Requirements depend upon the expected usage. Common biological applications tend to be memory-intensive. A high-bandwidth network between the nodes is recommended, especially for large clusters. Network attached storage is often used to provide a shared file system visible to all the nodes.
- Classical “Beowulf” cluster: used for parallel tasks that require frequent communication between nodes. These usually use the MPI communication model, and often have a communication network tuned for this use such as Myrinet. One master “head” node controls all the others, and is usually the only one connected to the outside world. The cluster may have a private internal Ethernet network as well.
- Farm: used where little inter-node communication is needed. Nodes usually just attach to a conventional Ethernet network, and may be visible to the outside world.
The most important thing is to have a plan from the beginning that addresses all the system’s needs for storage today and is scalable for tommorrow’s unknowns.
Data Stewardship -
The Conducting, Supervising, and Management of Data
Next-gen sequencing promises to unload reams and reams of data on the world. Pieces of that data will prove relevant to one or the other of specific research projects in your enterprise. At the same time, your lab may produce more data by annotation or simple research. How do you handle it all?
First, you should appoint a data steward. This person must understand where the data comes from, how it is modeled, who uses what parts of it, and any results this data may produce, such as forms, etc. Most importantly, they must be able to verify the integrity of that data.
Data, Data, Data
I’ve handled lots of engineering and bioinformatics data in my time…
In engineering, I had to be sure all instrumentation was calibrated correctly and production data was representative or correct. Every morning at 7 a.m., I held a meeting with data analysts, system administrators, database representatives, etc. focused on who was doing what to which data, what data could be archived, what data should be recovered from archive, and so on. This data inventory session proved to be extremely useful as there were terabytes of data swept through the system on a weekly basis.
For bioinformatics, I had to locate and merge data from disparate sources into one whole and run that result against several analysis programs to isolate the relevant data. That data was then uploaded to a local database for access by various applications. As the amount of available sequence data grew, culling the data, storage of this data, and archiving of the initial and final data became something of a headache.
My biggest bioinformatics problem was NCBI data, as that was how we got most of our data.
I spent weeks/months/years plowing though the NCBI toolkit, mostly in debug. Grep became my friend.
I tried downloading complete GenBank reports from the NCBI ftp website but that took too much space. I used keywords with the Entrez eutils, but the granularity wasn’t fine enough, and I ended up with way too much data. Finally, I resorted to the NCBI Toolkit on NCBI ASN.1 binary files.
LARTS would have made this part so much easier.
The Data Steward should also be familiar with data maintenance and storage strategies.
Our guest blogger, Bill Eaton, explains the difference between backup and archiving of data, and lists the pros and cons of various storage technologies.
Bill Eaton: Data Backup and Archival Storage
Backups are usually kept for a year or so, then the storage media is reused.
Archives are kept forever. Retrievals are usually infrequent for both.
Storage Technologies
Tape: suitable for backup, not as good for archiving.
Pro: Current tape cartridge capacities are around 800 GB uncompressed.
Cost per bit is roughly the same as for hard disks.
Con: Tape hardware compression is ineffective on already-compressed data.
Tapes and tape drives wear out with use.
Software is usually required to retrieve tape contents. (tar, cpio, etc)
Tape technology changes frequently, formats have a short life.
Optical: better for archiving than backup
Pro: DVD 8.5 GB, Blu-Ray 50 GB
DVD contents can be a mountable file system, so that no special software is needed for retrieval.
Unlimited reading, no media wear.
Old formats are readable in new drives.
Con: Limited number of write cycles.
Hard Disks: could replace tape
Pro: Simple: Use removable hard disks as backup/archive devices.
Disk interfaces are usually supported for several years.
Con: Drives may need to be spun up every few months and contents
rewritten every few years.
MAID: Massive Array of Idle Disks
Disk array in which most disks are powered down when
not in active use.
Pro: The array controller manages disk health,
spinning up and copying disks as needed.
The array usually appears as a file system. Some can emulate a tape drive.
Con: Expensive.
Classical: the longest-life archival formats are those known
to archaeologists.
Pro: Symbols carved into a granite slab are
often still readable after thousands of years.
Con: Backing up large amounts of data this way could take hundreds of years.
asn2xml
Jim Ostell, speaking at the observance of the 25th anniversary of NCBI, stated something along the lines of, “then they wanted XML, but nah..”.
While working on the filters for the LARTS product, most specifically, the GenBank-like report, I realized how tightly-coupled the NCBI ASN.1/XML is to the toolkit.
Basically, you’ve got to understand the toolkit code in order to translate what the XML is saying. The infinite extendability and recursive structure of the ASN.1 data model is another conundrum. This is especially true of the of the ASN.1 data structures supporting GenBank data - Bioseq-set. For example, a phy-set (phylogeny set) can include sets of Bioseq-sets nested to several levels. Most Bioseq-sets are the usual nuc-prot (DNA and translating protein), but others are pop-sets, eco-sets, segmented sequences with sets of sequence parts, etc.
After we developed LARTS, I wrote the GB filter as a Java object. It was an interesting experience.
NCBI ASN.1 rendered as XML, either our version or the NCBI asn2xml version, is very dependent on the NCBI toolkit code for proper interpretation.
The two most glaring examples are listed below.
Sequence Locations
Determing the location of sequence features for a GenBank data report, is a prime example. Here are a few simple examples:
primer_bind order(complement(1..19), 332..350)
gene complement(join(1560..2030, 3304..3321))
CDS complement(join(3492..3593, 3941..4104, 4203..4364, 4457..4553, 4655..4792))
rRNA join(<1..156, 445..478, 1199..>1559) 5231, 76582..76767, 77517..77720, 78409..78490))
primer_bind order(complement(1..19), 1106..1124)
For Segmented-sequences:
CDS join(162922:124..144; 162923: 647..889, 1298..1570)
CD regions locations have frames, bonds have points (that can be packed), strand minus denotes a complement (reverse order), a set of sequence locations for a sequence feature (packed-seqint) denotes a join, and locations can be “order(”ed, or “one-of”, and fuzz-from and fuzz-to has to taken into account for points and sequence intervals.
Sequence Format
DNA sequences are stored in a packed 2-bit or 4-bit per letter format (ncbi2na and ncbi4na). 2na is used if the sequence does not contain ambiguity, otherwise 4na is the format of choice. The sequence must be unpacked to be useful. This takes a basic understanding of Hex(adecimal).
Toolkit
The NCBI Toolkit contains all of the code necessary to render a GenBank report from the ASN.1 binary or ASCII data file. (The code is there, but you have to figure out how to compile it into an executable.)
We took the toolkit code and converted it to Java to produce the GenBank-style output format. It differs from the actual NCBI GenBank Report in that the LARTS report lists a FASTA-formatted sequence instead of the 10-base pairs per column that the NCBI GenBank Report produces.
The Many Variations of LARTS
GenBankReportFilter.java is provided as an example with Stand-Alone LARTS. The LARTS Reader enables the GenBank-style report.
Using LARTS Online, the user can select the GenBank-style report as the desired Output Format.
A third option, would entail using LARTS Online to obtain the keyword or keyword/element-path data wanted in XML format. This data is then downloaded to a local machine via the Thick Client option. Finally, Stand-Alone LARTS would process the dowloaded XML data into a GenBank-style report.
Stand-Alone LARTS provides example filters and SQL for processing XML and loading the relevant data into a local SQL database. This includes sample code for the BLOB and CLOB objects.
The filter for FASTA-formatting sequence data is also available as an example with Stand-Alone LARTS.
These options provide ready access to NCBI data for your research.
The line “you can’t do bioinformatics if you haven’t worked in a wet lab”, has been used as the basis for the “you need to know where the data comes from” argument time and time again. I actually saw this in print in a slide presentation at the Next-Generation Sequencing Data Analysis conference in Providence, RI, in September 2008.
I can sympathize with this viewpoint, but I don’t agree with it. For instance, I designed the data system, compiled the data, and did the field testing that certified a re-engined aircraft, but I can’t pilot a plane. I did do a lot of field laboratory work and it was “wet” - if snow, sleet, and rain count, along with desert dust and volcanic ash.
Knowing where the data comes from is very important, but what is of more importance is whether or not the data is actually measuring what it is supposed to measure — data validity (are your instruments correctly calibrated and is the sampling rate sufficient), what is the format of the data, what is the size of the data, and to what sort of analysis will the data be subjected.
If the lab experience is so very important, a simple systems analysis is a very good tool to use. As I’ve done it, the observer/programmer/engineer would “live” in the lab for a period of time — usually two to four weeks, or until they have a good grasp of the processes involved, taking copious notes and asking lots of questions. That person may actually perform some of the work involved if desired.
This person should have some understanding of molecular biology, etc. to fully appreciate the lab experience.
This activity has the potential of illuminating possible bottlenecks or methods that may need modification or fine tuning. If more that one site is involved, so much the better, as discrepancies in processes will be made obvious.
My biological wet lab experience got me a “you have excellent lab technique” and a job offer, which I declined.
Bioinformatics training also comes into question. Many courses just help the student determine which internet site to go to for information, or how to construct a FASTA-formatted sequence, or parse a BLAST output or a GenBank report. They can’t do much except offer a survey of things “bioinformatic”. Not much time is spent on information management or engineering approaches.
I jumped from engineering to bioinformatics in the early 90’s. The object-oriented data model I presented apparently found an audience. I did some reading up on genetics, etc. before the interview, but most of the knowledge used to answer interview questions such as, “what are the four basic building block of life”, came from watching X-Files. Things have gotten a lot more complicated (the textbooks have gotten heavier), and keeping up with new discoveries can become quite a task.
Next week I will offer a series of “horror stories”, or some of my experiences in the bioinformatics arena.
I found an article in the December 2008 issue of Nature Methods to be of particular interest, not in the least that I personally know the authors.
The article, under CORRESPONDENCE on page 991, surveyed a series of papers from the 2007 issues of 20 journals. The purpose was to refute the Nature Methods editorial of March 2008 which asserted that the deposition of supporting raw microarray datasets is “routine”. Data cited in the article was compared to that currently available in public databases. They found that the rate of deposition of datasets was less than 50%. Only half of the discovery data that was the basis of the articles was available to the public.
They further cited that the fault of the MIAME (Minimum Information about a Microarray Experiment) standard. They assert that “owing to their highly contextual nature, have a more complex metadata structure than sequence data.”
The MIAME standard was forged by the MGED (Microarray and Gene Expression Data) and published in Nature Genetics, 29, 365-371 (2001). The MGED also house the the Microarray and Expression (MAGE) Object Model which defines the entire environment of the experiment (e.g. organism, array design, etc.). MIAME is the standard and MAGE adheres to the MIAME standard and suggests formats for representation and submission of microarray data.
The premier microarray data repository is ArrayExpress located at http://www.ebi.ac.uk/microarray-as/ae/.
ArrayExpress is a public repository for transcriptomics data, which is aimed at storing and MINSEQE for high-throughput data (http://www.mged.org/minseqe/) - is compliant data in accordance with MGED (http://www.mged.org/recommendations). The ArrayExpress Warehouse stores gene-indexed expression profiles from a curated subset of experiments in the repository.
Other sites are GEO (Gene Expression Omnibus - http://www.ncbi.nlm.nih.gov/geo/) and CIBIX (Center for Information Biology gene Expression database - http://cibex.nig.ac.jp/index.jsp).
MIAMExpress (a MIAME compliant microarray data submission tool) is currently available at http://sourceforge.net/projects/miamexpress/ and is the submission tool for microarray experiments. It is downloaded to your local system and must be made executable (compiled) on your system in order to use it. A local of the MySQL database is required as well as the Perl programming language.
The MIAME/MAGE meta-data model is described in UML (Universal Modelling Language).
They suggest mark-up languages for data submissions. They provide MAGE-ML which is XML dtd. In addition, there is a tabular format (MAGE-TAB) that has just been announced. It is a spreadsheet-like tabular format.
This data model is difficult to interpret. Fitting your data to this model can be a real trick. I know, I’ve tried. And I’ve got years of work with formal data specifications behind me. For the average lab tech it is almost impossible to interpret. A bioinformatics programmer with exposure to MS Word and MS Excel (which I have read are the two most important requirements to succeed in bioinformatics (!)) would be in the same boat.
I have nothing against models and standards. Standards bring order to chaos — if they are simple enough to interpret and implement.
The article goes on to call for the interpretation of the microarray data in the GenBank format.
Just about every everybody in the biosciences field is familiar with this format. Most important, they know how to submit data that will be interpreted as GenBank data.
GenBank data is stored internally at NCBI in ASN.1. The ASN.1 format is extensively used in telecommunications and other areas. After years of working with ASN.1 and especially NCBI ASN.1, I have to say that it is ideal for the storage of sequence and other data.
ASN.1 is infinitely extensible through its recursive abilities. This is great in that it can encompass all the data for a particular data object. However, the nesting nature of the ASN.1 construct can cause one to literally pull out one’s hair.
ASN.1 doesn’t gracefully translate into SQL. It is possible, but not very pretty and the queries are ridiculously complex. �
�
Using NCBI toolkit code to access ASN.1 data works if one knows C/C++ and has lots of experience working with suites of large complex software.
Our product (LARTS) was developed to make working with NCBI ASN.1 data a little easier and to create a new paradigm of searching NCBI ASN.1 data.
NCBI ASN.1 was distilled into a grammar that is parsed much like a programming language or the way a sentence is parsed for English class. That grammar translates the ASN.1 into XML Schema. This XML can then filtered for specific values or formatted for specific output such as a Genbank-like report.
The new paradigm means that the serious user should become somewhat familiar with the NCBI ASN.1 data structures. By serious, I mean someone who wants to go beyand the currently offered output formats.
Our ncbixref link (http://www.lifeformulae.com/lartsonline/docs/ncbixref/NCBI-Seqset.html#Bioseq-set) provides a way to traverse these structures, starting with the top-level Bioseq-set.
In some instances, the ASN.1 data structure names don’t really describe the data they define. For example, the ASN.1 data structure for dbSNP is ExchangeSet (http://www.lifeformulae.com/lartsonline/docs/ncbixref/Docsum-3-0.html#ExchangeSet).
Yet Another Standard
The Genomics Standards Consortium has a suggested format for next-generation sequencing experiments called MIGS (http://gensc.org/gc_wiki/index.php/Main_Page), or miminum information about a genome sequence. It’s extension is MIMS - Minimum Information about a Metagenomic Sequence. The MIGS/MIMS data models are expressed in GCDML — Genomic Contextual Data Markup Language - http://gensc.org/gc_wiki/index.php/GCDML. GCDML is implemented using XML Schema.
Let’s hope the meta-data is kept to that “minimum”, but looking at http://www.nature.com/nbt/journal/v26/n5/box/nbt1360_BX1.html, it doesn’t seem so.
At any rate, the move toward XML Schema is a good thing and fits in well with our thinking.
Events of particular note this week –
The HSEMB Conference –
The 26th Annual Houston Conference on Biomedical
Engineering Research (http://www.hsemb.org) on 19-20 March 2009 at the University of Houston Hilton Hotel and Convention Center.
HSEMB has established the John Halter Award for Professional Achievement in Bioinformatics and Computational Biology. The late Dr. John Halter is the founder of LifeFormulae, LLC. http://www.lifeformulae.com/pages/about_jah_memorial.aspx is the link to our memorial to John.
Super Computing 2008
SC08 - Super Computing 2008, the International Conference for High
Performance Computing, Networking, Storage and Analysis. November
15-21, Austin Convention Center, Austin, Texas.
http://sc08.supercomputing.org/
And GenBank Release 168.0. –
Genbank Release 168.- flat files require roughtly 3871 GB for the sequence files only or 396 GB if you include the ’short directory’, ‘index’ and the *.txt files. The ASN.1 data files require approximately 338 GB.
Recent statistics for non-WGS (Whole Genome Sequence), non-CON (Contig) sequences are given below.
Release Date Base Pairs Entries
167 Aug 2008 95033791652 92748599
168 Oct 2008 97381682336 96400790
Recent statistics for WGS sequences:
Release Date Base Pairs Entries
167 Aug 2008 118593509342 40214247
168 Oct 2008 136085973423 46108952
During the 69 days between the close dates for GenBank Releases 167.0
and 168.0, the non-WGS/non-CON portion of GenBank grew by 2,347,890,684 basepairs and by 3,652,191 sequence records.
During that same period,1,111,311 records were updated. An average of about 69,036 non-WGS/non-CON records were added and/or updated per day.
Between releases 167.0 and 168.0, the WGS component of GenBank grew by 17,492,464,081 basepairs and by 5,894,705 records.
The combined WGS/non-WGS single-release increase of 19.84 Gbp for
Release 168.0 is the largest that GenBank has experienced, to date.
That’s a lot data. It’s a long, long way from the set of CDs that came out 4 times a year back in the late 90’s. Somewhere there are drawers and drawers of old Entrez CDs! (Entrez is the engine used to search NCBI Life Sciences data.)
GenBank is considered an archive of information about sequences. The nine-digit GI number, once the unique sequence identifier, have been supplanted by the Accession Number.
Speaking of NCBI data, we now have the complete set of the human (Homo Sapiens) data from NCBI’s dbSNP available through our LARTS product. Currently, the files are not searchable by keyword, or keyword/element path. This capability should be available to you the early part of next week.
Which brings me to the question. Is GenBank data as important today as it was, say, 5 years ago? If not GenBank, what NCBI data is considered critical to your current research and bioinformatics methods, and, if I might also ask, what are you doing with it?