LifeFormulae Blog » Page 'Data Project Development Pointers'

Data Project Development Pointers

A recent Bioinform (www.bioinform.com) poll asked “What are the biggest informatics challenges for the next generation sequencing data?” The poll results are listed as follows: 57% Functional Interpretation; 24% Data Management; 9 % Assembly and Alignment; 4% Variant Calling; and 4% Storage.

As a former Data Engineer entrusted throughout my career with obscene amounts of various kinds of data, I am appalled that data management and storage ranked so low. Where’s your Functional Interpretation without the data?

I’ve worked with all sorts of data. Data, that in some instances, was obtained under adverse conditions and could not be duplicated had to be protected, more or less, by my very skin (or so I was threatened).

Next-gen sequencing is producing files of short run data that are amplifying the errors inherent in first-gen sequence data. These next-gen files are being produced at a phenomenal rate, sometimes surpassing the petabyte count.

Data managers can be thankful that data storage has been developed that provides a lot of bang for the buck, as 4T drives as just about standard and the 2G file limit has been eliminated.

Having worked in the field with the first Compaq lap tops and the later Zenith with a 40M hard drive, this is very heartening news.

I’ve put together a list of data pointers that anyone attempting to work with data of any kind needs to read.

What are you trying to measure or analyze is the most fundamental question.

Close on the heels of this one is - How to Acquire the Data. Is there a system in place that can produce the data stream necessary. If not, is there a system that can be modified that will produce the data you need. If not, what will it take – hardware and software, to produce what you want.

This data acquisition phase can be extremely costly if you don’t have a overall idea of the complete system – acquisition, storage, and analysis.

Next question – How much data are we talking about? Is it limited to a file, a system, or a cluster of devices?

Where are we going to store this data? Do we have the storage equipment at hand? If we do have the equipment, can we add on what we need without reinventing the wheel?

The reader will probably instantly of the “cloud.” However, as of late (i.e.Amazon’s EC2 cloud outage), tech blogs are stating that a cloud hack is just a thought away (http://tech.blorge.com/Structure:%20/2011/04/28/data-security-in-the-cloud-sucks-as-witness-sony-psn-hack/).

Another question, will the data be stored in its raw format or will it need massaging. Raw data vs. manipulated or converted data (i.e. binary converted to engineering units or test) can easily quadruple your storage needs and costs.

Will data from various hardware sources need integration into the data stream? How is this integration occur? Will additional software be necessary? Is a data model required? In some instances, more than one data model may be necessary. Is a database reflecting these models needed? Who will develop the data model and administer the database?

And, while we’re talking about it, How easy or difficult would it be to take archived data and have it available for processing – a few minutes, a day, a week?

If its stored as binary or other basic (raw) form, how long to pull that data from the archive, convert it, and have it available for analysis?

How are you going to certify that the raw data is correct and the conversion utility created a true conversion of that raw data?

Just the term “archived data” has its own implications. What do you mean by “archived” vs. “active” data. What raises the flag that says this active data can now be archived? Are there several phases in archiving that data? How long will it take?

Some of the tests I’ve been involved with involved acquisition of live data in the field, performing spot analysis of the data as it was acquired. This live data was subsequently saved to digital tape or hard drive for further detailed analysis.

A three and a half week field test turned into 3 to 4 months of analysis at home base. The archived data had to perfectly mirror the live data and data analysis obtained in the field.

Could you do this with your data? Rerunning a field test is an expensive proposition – many thousands of dollars could be involved.

Speaking of analysis – who will be analyzing the data. What hardware and software do the have or need? Will further software development be in the picture along with hardware upgrades?

Are different platforms involved? Is the data representation on each platform consistent?

Little Endian to Big Endian was a major problem at one time, followed by 32 and 64 bit system representations. Ask the end users questions and don’t be blind-sided by system differences.

Another analysis question concerns subsets of data. Can you subset your data store? (I hope you’ve developed data models to support effort.)

A final question concerns manpower and experience. Do you have the staff that has the experience to support the endeavor. Saying you know SQL because you read a text defining SQL isn’t going to get it.

I can’t stress how important the proper, experienced staff can be. The hardest staff position to fill is that of project manager. A really, good project manager should come equipped with a CV replete with a list of incremental project management experience. You will probably have to pay through the nose for a good one, but it will be worth it in the end.

I had to choose between a person with a biology background and little to no programming versus one with a background in computer science, mathematics, or engineering, I’d choose the latter. They can pick up the biology. Of course, this depends on the person under consideration.

First question I ask myself is - Could this person help me get a plane off the ground? Can they handle stress? Do they think on their feet? How organized are they? How do they do in ill-defined environments? Do they fit in? Will their personality get in the way?

In any case, look beyond that paper resume and the list of provided references. You don’t want someone whose only experience consists of “Perl Scripts Done in a Panic”.

There is a lot to consider in the development of a system that turns on a piece of data. Ask questions. No matter how naive they may sound, I guarantee you will save time, and time means money.

For a little humor regarding software development check out – http://davidlongstreet.wordpress.com/category/software-development/humor/.

You may need it.

Like this post? Spread the word!
delicious digg google
stumbleupon technorati Yahoo!

Leave a comment

You need to log in to comment.

Top of page / Subscribe to new Entries (RSS)