Data Stewardship (including Archiving) No comments yet
Data Stewardship -
The Conducting, Supervising, and Management of Data
Next-gen sequencing promises to unload reams and reams of data on the world. Pieces of that data will prove relevant to one or the other of specific research projects in your enterprise. At the same time, your lab may produce more data by annotation or simple research. How do you handle it all?
First, you should appoint a data steward. This person must understand where the data comes from, how it is modeled, who uses what parts of it, and any results this data may produce, such as forms, etc. Most importantly, they must be able to verify the integrity of that data.
Data, Data, Data
I’ve handled lots of engineering and bioinformatics data in my time…
In engineering, I had to be sure all instrumentation was calibrated correctly and production data was representative or correct. Every morning at 7 a.m., I held a meeting with data analysts, system administrators, database representatives, etc. focused on who was doing what to which data, what data could be archived, what data should be recovered from archive, and so on. This data inventory session proved to be extremely useful as there were terabytes of data swept through the system on a weekly basis.
For bioinformatics, I had to locate and merge data from disparate sources into one whole and run that result against several analysis programs to isolate the relevant data. That data was then uploaded to a local database for access by various applications. As the amount of available sequence data grew, culling the data, storage of this data, and archiving of the initial and final data became something of a headache.
My biggest bioinformatics problem was NCBI data, as that was how we got most of our data.
I spent weeks/months/years plowing though the NCBI toolkit, mostly in debug. Grep became my friend.
I tried downloading complete GenBank reports from the NCBI ftp website but that took too much space. I used keywords with the Entrez eutils, but the granularity wasn’t fine enough, and I ended up with way too much data. Finally, I resorted to the NCBI Toolkit on NCBI ASN.1 binary files.
LARTS would have made this part so much easier.
The Data Steward should also be familiar with data maintenance and storage strategies.
Our guest blogger, Bill Eaton, explains the difference between backup and archiving of data, and lists the pros and cons of various storage technologies.
Bill Eaton: Data Backup and Archival Storage
Backups are usually kept for a year or so, then the storage media is reused.
Archives are kept forever. Retrievals are usually infrequent for both.
Storage Technologies
Tape: suitable for backup, not as good for archiving.
Pro: Current tape cartridge capacities are around 800 GB uncompressed.
Cost per bit is roughly the same as for hard disks.
Con: Tape hardware compression is ineffective on already-compressed data.
Tapes and tape drives wear out with use.
Software is usually required to retrieve tape contents. (tar, cpio, etc)
Tape technology changes frequently, formats have a short life.
Optical: better for archiving than backup
Pro: DVD 8.5 GB, Blu-Ray 50 GB
DVD contents can be a mountable file system, so that no special software is needed for retrieval.
Unlimited reading, no media wear.
Old formats are readable in new drives.
Con: Limited number of write cycles.
Hard Disks: could replace tape
Pro: Simple: Use removable hard disks as backup/archive devices.
Disk interfaces are usually supported for several years.
Con: Drives may need to be spun up every few months and contents
rewritten every few years.
MAID: Massive Array of Idle Disks
Disk array in which most disks are powered down when
not in active use.
Pro: The array controller manages disk health,
spinning up and copying disks as needed.
The array usually appears as a file system. Some can emulate a tape drive.
Con: Expensive.
Classical: the longest-life archival formats are those known
to archaeologists.
Pro: Symbols carved into a granite slab are
often still readable after thousands of years.
Con: Backing up large amounts of data this way could take hundreds of years.