LifeFormulae Blog » Archive of 'Mar, 2009'

Computer System Configurations No comments yet

The most complex system I’ve configured was the airborne data acquisition and ground support systems.  However, not many people have to or want do anything that large or complex.  Some labs will need info from thermocouples, strain gauges, or other instrumentation, but most of you will be satisfied with a well-configured system that can handle today’s data without a large cash outlay that  can be expanded at minimum cost to handle the data of tomorrow.

This week’s guest blogger, Bill Eaton, provides some guidelines for  the configuration  of a Database Server,  a Web Server, and a Compute Node, the three most requested configurations.

(Bill Eaton)

General Considerations
Choice of 32-bit or 64-bit Operating System on standard PC hardware

  • A 32-bit operating system limits the maximum memory usage of a program to 4 GB or less, and may limit maximum physical memory.
    • Linux:  for most kernels, programs are limited to 3 GB.  Physical memory can usually exceed 4 GB.
    • Windows :The stock settings limit a program to 2 GB, and physical memory to 4 GB.
      The server versions have a /3GB boot flag to allow 3 GB programs and a /PAE flag to enable more than 4 GB of physical memory.
      Other operating systems usually have a 2 or 3 GB program memory limit.
  • A 64-bit operating system removes these limits.  It also enables some additional CPU registers and instructions that may improve performance. Most will allow running older 32-bit program files.

Database Server:
Biological databases are often large, 100 GB or more, often too large to fit on a single physical disk drive. A database system needs fast disk storage and a large memory to cache frequently-used data.  These systems tend to be I/O bound.

Disk storage:

  • Direct-attached storage:  disk array that appears as one or more physical disk drives, usually connected using a standard disk interface such as SCSI.
  • Network-attached storage:  disk array connected to one or more hosts by a standard network.  These may appear as network file systems using NFS, CIFS, or similar, or physical disks using iSCSI.
  • SAN:  includes above cases, multiple disk units sharing a network dedicated to disk I/O.  Fibre Channel is usually used for this.
  • Disk arrays for large databases need high I/O bandwidth, and must properly handle flush-to-disk requests.

Databases:

  • Storage overhead:  data repositories may require several times the amount of disk space required by the raw data.  Adding an index to a table can double its size.  A test using a simple mostly numeric table with one index gave these overheads for some common databases.
    • MySQL using MyISAM 2.81
    • MySQL using InnoDB 3.28
    • Apache Derby       5.88
    • PostgreSQL         7.02
  • Data Integrity support:  The server and disk system should handle failures and power loss as cleanly as possible.  A UPS with clean shutdown support is recommended.

Web Server and middleware hosts:
A web server needs high network bandwidth, and should have a large memory to cache frequently-used content.

Web Service Software Considerations:

  • PHP:  Thread support still has problems.  PHP applications running on a Windows system under either Apache httpd or IIS may encounter these.  We had seen a case where WordPress run under Windows IIS and Apache httpd gave error messages, but worked without problems under Apache httpd on Linux.  IIS FastCGI made the problem worse. PHP acceleration systems may be needed to support large user bases.
  • Perl:  similar thread support and scaling issues may be present. For large user bases, use of mod_perl or FastCGI can help.
  • Java-based containers:  (Apache Tomcat, JBoss, GlassFish, etc) These run on almost anything without problems, and usually scale quite well.

Compute nodes:
Requirements depend upon the expected usage.  Common biological applications tend to be memory-intensive.  A high-bandwidth network between the nodes is recommended, especially for large clusters.  Network attached storage is often used to provide a shared file system visible to all the nodes.

  • Classical “Beowulf” cluster:  used for parallel tasks that require frequent communication between nodes.  These usually use the MPI communication model, and often have a communication network tuned for this use such as Myrinet.  One master “head” node controls all the others, and is usually the only one connected to the outside world. The cluster may have a private internal Ethernet network as well.
  • Farm:  used where little inter-node communication is needed.  Nodes usually just attach to a conventional Ethernet network, and may be visible to the outside world.

The most important thing is to have a plan from the beginning that addresses all the system’s needs for storage today and is scalable for tommorrow’s unknowns.

Data Stewardship (including Archiving) No comments yet

Data Stewardship -
The Conducting, Supervising, and Management of Data

Next-gen sequencing promises to unload reams and reams of data on the world.  Pieces of that data will prove relevant to one or the other of specific research projects in your enterprise.  At the same time, your lab may produce more data by annotation or simple research.  How do you handle it all?

First, you should appoint a data steward.  This person must understand where the data comes from, how it is modeled, who uses what parts of it, and any results this data may produce, such as forms, etc. Most importantly, they must be able to verify the integrity of that data.

Data, Data, Data

I’ve handled lots of engineering and bioinformatics data in my time…

In engineering, I had to be sure all instrumentation was calibrated correctly and production data was representative or correct.  Every morning at 7 a.m., I held a meeting with data analysts, system administrators, database representatives, etc. focused on who was doing what to which data, what data could be archived, what data should be recovered from archive, and so on.   This data inventory session proved to be extremely useful as there were terabytes of data swept through the system on a weekly basis.

For bioinformatics, I had to locate and merge data from disparate sources into one whole and run that result against several analysis programs to isolate the relevant data.  That data was then uploaded to a local database for access by various applications.  As the amount of available sequence data grew, culling the data, storage of this data, and archiving of the initial and final data became something of a headache.

My biggest bioinformatics problem was NCBI data, as that was how we got most of our data.
I spent weeks/months/years plowing though the NCBI toolkit, mostly in debug. Grep became my friend. 

I tried downloading complete GenBank reports from the NCBI ftp website but that took too much space.  I used keywords with the Entrez eutils, but the granularity wasn’t fine enough, and I ended up with way too much data.  Finally, I resorted to the NCBI Toolkit on NCBI ASN.1 binary files.
LARTS would have made this part so much easier. 

The Data Steward should also be familiar with data maintenance and storage strategies.

Our guest blogger, Bill Eaton, explains the difference between backup and archiving of data, and lists the pros and cons of various storage technologies.

Bill Eaton: Data Backup and Archival Storage

  Backups are usually kept for a year or so, then the storage media is reused.
  Archives are kept forever.  Retrievals are usually infrequent for both.

Storage Technologies

Tape:  suitable for backup, not as good for archiving.

Pro: Current tape cartridge capacities are around 800 GB uncompressed.

Cost per bit is roughly the same as for hard disks.

Con: Tape hardware compression is ineffective on already-compressed data.
      Tapes and tape drives wear out with use.
      Software is usually required to retrieve tape contents. (tar, cpio, etc)
      Tape technology changes frequently, formats have a short life.

Optical:  better for archiving than backup

Pro:  DVD 8.5 GB, Blu-Ray  50 GB
      DVD contents can be a mountable file system, so that no special software is needed for retrieval.
      Unlimited reading, no media wear.
      Old formats are readable in new drives.
Con:  Limited number of write cycles.
     

Hard Disks:  could replace tape

Pro:   Simple:  Use removable hard disks as backup/archive devices.
        Disk interfaces are usually supported for several years.
Con: Drives may need to be spun up every few months and contents
          rewritten every few years.

MAID:  Massive Array of Idle Disks
        Disk array in which most disks are powered down when
        not in active use.

Pro: The array controller manages disk health,
        spinning up and copying disks as needed.
        The array usually appears as a file system. Some can emulate a tape drive.

Con: Expensive.

Classical:  the longest-life archival formats are those known
      to archaeologists. 

Pro:  Symbols carved into a granite slab are
      often still readable after thousands of years.
Con: Backing up large amounts of data this way could take hundreds of years.

 

Top of page / Subscribe to new Entries (RSS)