LifeFormulae Blog » Posts in 'Educational' category

The end of Bioinformatics?! No comments yet

The End of Bioinformatics?!

I read with some interest the announcement of the Wolfram Alpha.  Wolfram intends to be the end all and be all data mining systems and some say, makes bioinformatics obsolete.

Wolfram’s basis is a formal Mathematica representation.  It’s inference engine is a large number of hand-written scripts that access data that has been accumulated and curated.  The developers stress that the system is not Artificial Intelligence and is not aiming to be.  For instance,  a sample query,

“List all human genes with significant evidence of positive selection since the human-chimpanzee common ancestor, where either the GO category or OMIM entry includes ‘muscle’”

could currently be executed with SQL, provided the underlying data is there. 

Wolfram won’t replace bioinformatics.  What it will do is make it easier for a neophyte to get answers to his or her questions because they can be asked in a simpler format.

 I would guess Wolfram uses one or more these scripts to address a specific data set in conjunction with a natural language parser.  These scripts would move this data to a common model that could then be modeled on a web page.

But why not AI?  Why not replace all those “hand-written” scripts, etc.  with a real inference engine.

I rode the first AI wave.  I was a member of the first of 25 engineers selected to be a part of the McAir AI Initiative at McDonnell Aircraft Company.  (”There is AI in McAir”).  In all, 100 engineers were chosen from engineering departments to attend courses leading to a Certificate in Artificial Intelligence from Washington University in St. Louis.

One of the neat things about the course was the purchase of at least 30 workstations (maybe as many as 60) for a young company called Sun that were loaned to Washington University for the duration of the course.  Afterwards, we got a few Symbolics machines for our CADD project. 

Other than Lisp and Prolog, the software we used was called KEE (Knowledge Engineering Environment).  Also, there was a DEC (Digital Equipment Company) language called OPS5.

The course was quite fast-paced but very extensive.  We had the best AI consultants available at the time lecture and give assignments in epistemology, interviewing techniques, and so on. I had a whole stack of books.

The only problem was that no money was budgeted (or so I was told) for AI development for the departments for the engineers when they returned from the course eager to AI everything.  A lot of people left.

Anyway, my group of three developed a “Battle Damage Repair” system that basically “patched up” the composite wing skins of combat aircraft.   Given the size and location of the damage, the system would certify whether the aircraft would be able to return to combat, and would output the size and substance of the patch if the damage wasn’t that bad.

One interesting tidbit:  We wanted to present our system at a conference in San Antonio and had a picture of a battle-damaged F-15 we wanted to use.  Well, we were told that the picture was classified and, as such, we couldn’t use it.  Well, about that same time, a glossy McAir brochure featuring our system and that photo were distributed at the AAAI (American Assn. of Artificial Intelligence) to thousands of people. 

Another system I developed dealt with engineering schematics.  These schematics were layered.  Some layers and circuits were classified.   Still another system scheduled aircraft for painting and yet another charted a path for aircraft through hostile territory, activating electronic counter measures as necessary.

I guess the most sophisticated system I worked on was with the B-2 program.  The B-2 skin is a composite material.  This material has to be removed from a freezer, molded into a final shape and cooked in a huge autoclave before it completely thawed. 

We had to schedule materials, and the behavior of that material under various circumstances, as well as people and equipment.  The purpose was to avoid “bottlenecks” in people and equipment.  I was exposed to the Texas Instruments Explorer and Smalltalk-80 on an Apple.  I’ve been in love with Smalltalk ever since.

The system was developed, but it was never used.  The problem was that we had to rank workers by expertise.  That’s union workers and that wasn’t allowed. 

It was a nice system that integrated a lot of systems and worked well.  Our RFP (Request for Proposals) went out to people like Carnegie-Mellon.  We had certain performance and date requirements that we wanted to see in the final system.  We were told that the benchmarks would be difficult, in not impossible, to attain.  Well, we did it, on our own without their help.

We also had a neural net solution that inspected completed composite parts. The parts were submerged in water and bombarded with sound waves.  The echoes were used by the system to determine part quality.

AI promised the world, and then it couldn’t really deliver.  So it kind of went to the back burner.

One problem with the end and be all.  It will only be as good as your model.  It will only be as good as the developers can determine the behavior of the parts and how they interact with the whole.  Currently, this is a moving target and is changing day to day.  Good luck.

Links -

Will Wolfram Make Bioinformatics Obsolete? - http://johnhawks.net/weblog/reviews/genomics/bioinformatics/wolfram-alpha-bioinformatics-2009.html

Computer System Configurations No comments yet

The most complex system I’ve configured was the airborne data acquisition and ground support systems.  However, not many people have to or want do anything that large or complex.  Some labs will need info from thermocouples, strain gauges, or other instrumentation, but most of you will be satisfied with a well-configured system that can handle today’s data without a large cash outlay that  can be expanded at minimum cost to handle the data of tomorrow.

This week’s guest blogger, Bill Eaton, provides some guidelines for  the configuration  of a Database Server,  a Web Server, and a Compute Node, the three most requested configurations.

(Bill Eaton)

General Considerations
Choice of 32-bit or 64-bit Operating System on standard PC hardware

  • A 32-bit operating system limits the maximum memory usage of a program to 4 GB or less, and may limit maximum physical memory.
    • Linux:  for most kernels, programs are limited to 3 GB.  Physical memory can usually exceed 4 GB.
    • Windows :The stock settings limit a program to 2 GB, and physical memory to 4 GB.
      The server versions have a /3GB boot flag to allow 3 GB programs and a /PAE flag to enable more than 4 GB of physical memory.
      Other operating systems usually have a 2 or 3 GB program memory limit.
  • A 64-bit operating system removes these limits.  It also enables some additional CPU registers and instructions that may improve performance. Most will allow running older 32-bit program files.

Database Server:
Biological databases are often large, 100 GB or more, often too large to fit on a single physical disk drive. A database system needs fast disk storage and a large memory to cache frequently-used data.  These systems tend to be I/O bound.

Disk storage:

  • Direct-attached storage:  disk array that appears as one or more physical disk drives, usually connected using a standard disk interface such as SCSI.
  • Network-attached storage:  disk array connected to one or more hosts by a standard network.  These may appear as network file systems using NFS, CIFS, or similar, or physical disks using iSCSI.
  • SAN:  includes above cases, multiple disk units sharing a network dedicated to disk I/O.  Fibre Channel is usually used for this.
  • Disk arrays for large databases need high I/O bandwidth, and must properly handle flush-to-disk requests.

Databases:

  • Storage overhead:  data repositories may require several times the amount of disk space required by the raw data.  Adding an index to a table can double its size.  A test using a simple mostly numeric table with one index gave these overheads for some common databases.
    • MySQL using MyISAM 2.81
    • MySQL using InnoDB 3.28
    • Apache Derby       5.88
    • PostgreSQL         7.02
  • Data Integrity support:  The server and disk system should handle failures and power loss as cleanly as possible.  A UPS with clean shutdown support is recommended.

Web Server and middleware hosts:
A web server needs high network bandwidth, and should have a large memory to cache frequently-used content.

Web Service Software Considerations:

  • PHP:  Thread support still has problems.  PHP applications running on a Windows system under either Apache httpd or IIS may encounter these.  We had seen a case where WordPress run under Windows IIS and Apache httpd gave error messages, but worked without problems under Apache httpd on Linux.  IIS FastCGI made the problem worse. PHP acceleration systems may be needed to support large user bases.
  • Perl:  similar thread support and scaling issues may be present. For large user bases, use of mod_perl or FastCGI can help.
  • Java-based containers:  (Apache Tomcat, JBoss, GlassFish, etc) These run on almost anything without problems, and usually scale quite well.

Compute nodes:
Requirements depend upon the expected usage.  Common biological applications tend to be memory-intensive.  A high-bandwidth network between the nodes is recommended, especially for large clusters.  Network attached storage is often used to provide a shared file system visible to all the nodes.

  • Classical “Beowulf” cluster:  used for parallel tasks that require frequent communication between nodes.  These usually use the MPI communication model, and often have a communication network tuned for this use such as Myrinet.  One master “head” node controls all the others, and is usually the only one connected to the outside world. The cluster may have a private internal Ethernet network as well.
  • Farm:  used where little inter-node communication is needed.  Nodes usually just attach to a conventional Ethernet network, and may be visible to the outside world.

The most important thing is to have a plan from the beginning that addresses all the system’s needs for storage today and is scalable for tommorrow’s unknowns.

Data Stewardship (including Archiving) No comments yet

Data Stewardship -
The Conducting, Supervising, and Management of Data

Next-gen sequencing promises to unload reams and reams of data on the world.  Pieces of that data will prove relevant to one or the other of specific research projects in your enterprise.  At the same time, your lab may produce more data by annotation or simple research.  How do you handle it all?

First, you should appoint a data steward.  This person must understand where the data comes from, how it is modeled, who uses what parts of it, and any results this data may produce, such as forms, etc. Most importantly, they must be able to verify the integrity of that data.

Data, Data, Data

I’ve handled lots of engineering and bioinformatics data in my time…

In engineering, I had to be sure all instrumentation was calibrated correctly and production data was representative or correct.  Every morning at 7 a.m., I held a meeting with data analysts, system administrators, database representatives, etc. focused on who was doing what to which data, what data could be archived, what data should be recovered from archive, and so on.   This data inventory session proved to be extremely useful as there were terabytes of data swept through the system on a weekly basis.

For bioinformatics, I had to locate and merge data from disparate sources into one whole and run that result against several analysis programs to isolate the relevant data.  That data was then uploaded to a local database for access by various applications.  As the amount of available sequence data grew, culling the data, storage of this data, and archiving of the initial and final data became something of a headache.

My biggest bioinformatics problem was NCBI data, as that was how we got most of our data.
I spent weeks/months/years plowing though the NCBI toolkit, mostly in debug. Grep became my friend. 

I tried downloading complete GenBank reports from the NCBI ftp website but that took too much space.  I used keywords with the Entrez eutils, but the granularity wasn’t fine enough, and I ended up with way too much data.  Finally, I resorted to the NCBI Toolkit on NCBI ASN.1 binary files.
LARTS would have made this part so much easier. 

The Data Steward should also be familiar with data maintenance and storage strategies.

Our guest blogger, Bill Eaton, explains the difference between backup and archiving of data, and lists the pros and cons of various storage technologies.

Bill Eaton: Data Backup and Archival Storage

  Backups are usually kept for a year or so, then the storage media is reused.
  Archives are kept forever.  Retrievals are usually infrequent for both.

Storage Technologies

Tape:  suitable for backup, not as good for archiving.

Pro: Current tape cartridge capacities are around 800 GB uncompressed.

Cost per bit is roughly the same as for hard disks.

Con: Tape hardware compression is ineffective on already-compressed data.
      Tapes and tape drives wear out with use.
      Software is usually required to retrieve tape contents. (tar, cpio, etc)
      Tape technology changes frequently, formats have a short life.

Optical:  better for archiving than backup

Pro:  DVD 8.5 GB, Blu-Ray  50 GB
      DVD contents can be a mountable file system, so that no special software is needed for retrieval.
      Unlimited reading, no media wear.
      Old formats are readable in new drives.
Con:  Limited number of write cycles.
     

Hard Disks:  could replace tape

Pro:   Simple:  Use removable hard disks as backup/archive devices.
        Disk interfaces are usually supported for several years.
Con: Drives may need to be spun up every few months and contents
          rewritten every few years.

MAID:  Massive Array of Idle Disks
        Disk array in which most disks are powered down when
        not in active use.

Pro: The array controller manages disk health,
        spinning up and copying disks as needed.
        The array usually appears as a file system. Some can emulate a tape drive.

Con: Expensive.

Classical:  the longest-life archival formats are those known
      to archaeologists. 

Pro:  Symbols carved into a granite slab are
      often still readable after thousands of years.
Con: Backing up large amounts of data this way could take hundreds of years.

 

Women in Flight Test No comments yet

Women in Flight Test

I spent some 10 plus years in engineering.  As a woman in engineering it was daunting.  As a woman in    Avionics Flight Test, it was even more so.

I was working as a systems programmer at McDonnell-Douglas (now Boeing) in St. Louis, Missouri.   Our project consisted of 6 compilers that supported a completely integrated database system for hospitals.  A woman on our team was married to a section chief who worked at McDonnell Aircraft.  McAir (as it was commonly called) manufactured the F-4, F-18, and F-15 aircraft.   As our project was in trouble, she said her husband said Flight Test was looking for someone who could develop database systems and do other programming for the Ground Support Systems (GSS) unit.

At my interview, I was told, “We manufacture high-tech war machines that might kill innocent women and children.  So, I don’t want any wimps or pacifists working for me.”

I wondered at the time if my interviewer was wearing cammo underwear.

I was accepted.   At the time I was one (if not the first) woman professional working at McAir Flight Test.

My desk was on the fourth floor on the West end of a hanger.  Flight Test had offices and labs on the East and West ends of the hanger. Since our unit was the first set of desks one encountered when accessing our area, I was assumed to be a secretary and asked all sorts of questions.  The fix was to turn my desk around so everybody was looking toward the interior and my desk was facing the window.  I got an up-close look at the planes taking off and landing as they were just clearing the building.  Take offs were bad because the fumes from the jet fuel were overwhelming.  I would have to ask the person I was talking to on the phone if they would hold because I couldn’t hear them over the noise.

The hush houses were less than 100 yards away.  You couldn’t hear the noise (that’s what hush houses do), but the thud-thud-thud vibration of the engines became a bit much at times.  Oh yeah, our hanger sat right on top of the Flight Test fuel dump.

The first project was to develop a database system to track the equipment used by Flight Test.  The original system consisted of a 6-ft. by 20-ft. rack of index cards in pull-down trays. Each piece of equipment had a card which stated what it was, where it was, etc.

Every time a Flight Test program was scheduled, all the parts for that program had to be tagged and their index cards updated.  This would take anywhere from 3 days to a week under the old card system.

The new system did it in a few hours and produced timely reports on all parts and their locations.

Next system was a database system for Flight Test electronics like Vishay resistors, etc.

Another project was providing the documentation and training aids for the Digital Data Acquisition System for the F-18.  This system directly connected the plane’s computer system and uploaded mission information and, subsequently, downloaded mission results.

The old system was called the “taco wagon”.  It was a large roll-about cart.  These carts cost about $300K and used a card reader to upload mission info.

Our system replaced the “taco wagon” with an early Compaq laptop that cost about $3K.

Then Flight Test submitted my name as one of two people for a special program.  The other person from Flight Test was an ex-Air Force major who flew F-15s.

I went through the program and upon returning to Flight Test was asked to make a presentation to our executives.

The major and I made our presentation and opened it up to questions.  Our VP asked the major a question that started with the phrase, “As Flight Test’s designated expert in this area…”.
Later, I told my section chief what happened.  He said and I quote, “Flight Test is not ready for a woman to be expert at anything.”

These are two of the most glaring examples.  There were lots of others.

However, when I left, the VP said, “We will tell your story around the campfires.”

I took that as the highest compliment.

After Flight Test, I worked on AI projects before returning to Flight Test.

The Dee Howard Company in San Antonio ran an ad in Aviation Week for Flight Test Engineers.  I answered the ad.  Alenia (the world’s largest aerospace company, headquartered in Italy) had a stake in Dee Howard.  They were taking on a new project, the UPS 727QF.  The FAA had mandated that all cargo aircraft had to cut their engine noise levels.  UPS decided to re-engine their 727 aircraft with new, quieter, Rolls-Royce Tay 650-powered engines.  Dee Howard was to do the work and conduct the testing.

At the time of the contract, the count of planes to be re-engined was given as 60 plus.  The number actually re-engined has been given as 44 and 48.

Previously, Dee Howard was known for customizing aircraft interiors.  The interior of the 747 that NASA uses to ferry the space shuttle was done by the company.  They also fitted an emir’s 747 with a movie studio, solid gold toilet fixtures, and a complete operating studio.  The tale was that the emir, who had a bad heart, had a living heart donor traveling with him at all times.  Anyway, it makes a nice story.

I was hired in and proceeded to work on the system for the new program.

We were the first to replace all 3 engines on the 727.  Previously, only the two external engines were replaced, the tail engine was left as is.  We were going to replace all three.

We were to have two planes.  The critical plane was to have a new data acquisition system.  The other plane was to use a system from Boeing - ADAS.  Originally designed in 1965, ADAS had 64K memory, filled half a good sized room, used 8-inch diskettes, and the measurements were programmed by way of EEPROM.

The new acquisition system was better.  We bought a ruggedized cabinet and started adding boards.   PC-on-a-chip wasn’t quite there, but we did have PC-on-a-board and we could set things up via a PC interface.  To analyze the PCM data stream I used BBN/Probe instead of the custom software that was used on the previous system.

First flight came.  The system came up and stayed up.  Except for the one time, the flight engineer turned the system on before power was switched from the APU to the aircraft (the 8=mm tape recorder died), it worked every time.

On the fighters, flight test equipment was mounted on pallets in the bomb bay.  It was neat to ride on the plane during a test flight.  An airplane, with all the seats and padding removed, is your basic tin can.

I always got along with the technicians.  They are the ones who do the real work.  They make the engineer’s design come to life or markedly point out the error of his ways.

It was really nice to ask for such and such a cable or gadget and have it brought to my desk. The best (and worst) part of the program was field testing.  I got to go to lively places like Roswell, NM, Moses Lake, WA, and Uvalde, TX with 35 guys.  The length of the test depended on flying conditions.  We were usually stuck there for 3-4 weeks.

We also did some testing at home.  For one ground test, we taped tiny microphones to different places on the engines. The microphones were connected to the acoustic analyzer and DAT recorders.  The engines were then run at various levels.  I ran the acoustic analyzer for a few seconds for one set of mikes and flipped to record another mike set for a few seconds more.  I had to wear headphones for the noise. We had to yell anyway.  It was really hot, because this was San Antonio in July. We were on a unused runway at the airport, next to a well-traveled road.  The test took several hours.  The guys took the door off the head (which was right across from the open front access door) so I could watch the traffic when I used it.  (Did I tell you I had to clean the head when the test was over?)

As the revs got higher, the airplane moaned and groaned.  One engine finally belched.  We were lucky it didn’t catch fire!

Other testing conditions were just as much fun.  Roswell has the desert.  Desert dust at 35 knots is awful.  Tumbling weeds have nasty stickers.  Moses Lake had volcanic ash.  Mt. St. Helen’s dumped about a foot of ash at the Moses Lake airport.  Airport officials dumped the collected ash on a spot at the airport that they thought was unused.  One of our trucks got stuck in it.

At Uvalde, we had heat and gnats.  You inhaled them and they flew down your throat.

A local asked where the women went to the bathroom because there weren’t any trees.

Other than the conditions, there was the schedule.  The equipment had to be set up, calibrated, and ready to go at sun-up.  If conditions were good, we worked all day with a break for lunch and put everything away after dark.

Wake-up was 3 or 4 a.m.   We usually got back to the motel at dark, after we prepped the plane.  It got to the point of going out to get dinner or getting an extra hour’s sleep.

The testing was fun, too.  The plan was to fly over at different altitudes carrying varying weights.  (We had to unload 14 thousand pounds of ballast at one point, consisting of 50-lb. round lead weights with handles on each site.  I took my place in line with the guys.  Same thing with the car batteries for the transponders and loading and unloading the generator from the truck.)

The locals thought we might be flying in drugs, so they called the law, and the local sheriff came to call.

The testing, when in progress, was intense.  After set-up, the microphones were calibrated.  We had mikes at center line and other mikes on the peripheral.  I ran the acoustic analyzer.

I set the analyzer to trigger on a signal from the aircraft and turn on the tape recorders.  After the fly-over I had to download the data, pass it off to an engineer who analyzed it via a curve-fit program, and reset everything for the next fly-over.

The fly-overs came one after the other about 5-7 minutes apart.  We had to re-calibrate the mikes after a few, so we got an extra 5-10 minutes.  We got a break for lunch (with the gnats).

It was hard, dirty work.  But it was fun - and dangerous.  One test consisted of engine stalls on a 30-year old aircraft at 19,000 ft.  (it was too turbulent down below).   Another test had the aircraft stall during takeoff with different loads.  I was on board and loving it.

My supervisor said that “field testing separates the men and women from the boys and girls.” He was right.

One day we had a visitor in the lab.  One of the techs was working on something and let lose a string of expletives.  The visitor said the tech should be quiet because there was a lady present.  The tech looked at the visitor and said, “That’s no lady, that’s Pam!”  (You had to know the guys. I took it as a compliment.)

If you haven’t guessed, as a woman working in this environment, you have to have a thick skin.  You have to work really hard because you have to be really, really good.   But I think it’s all worth it.

It’s too easy to become a “he-she” or a “shim” (dress and act like the guys), but I didn’t.  I wore my hair longish and always wore make-up.  Even in the field.  I laid out my clothes the night before.  I could be up and ready to go in 5 minutes, complete with mascara, eye liner, sunscreen and blush.  I always had my knitting nearby.

It was hard work, but it all paid off.  The UPS plane was certified the latter part of 1992.

I’ve got some memories, some good, some not, but I know I made the grade.

Addendum -

The group working on this project was international in scope and I worked closely with most of them.    We had several people from the British Isles representing Rolls-Royce, including J. who spoke with a find Scottish accent.

I worked with two engineers from Alenia on the acoustics aspect of the program.     They hailed from Naples, Italy.   M. spoke excellent English.  E.  didn’t, but engineering is universal, so we were able to make it work.  Another acoustic team member was a Russian, L.  His English was also excellent.

I can honestly say that I know Fortran in Russian and Italian.   I had to grab whatever Fortran text I could find in a pinch and the Russian or Italian text was usually the closest.

We communicated with facilities in Italy, England, and France on an almost daily basis. The time difference was the only snag.

It was interesting to see how our American ways are interpreted by other cultures.

ASN.1 to XML: The Process No comments yet

asn2xml

Jim Ostell, speaking at the observance of the 25th anniversary of NCBI, stated something along the lines of, “then they wanted XML, but nah..”.

While working on the filters for the LARTS product, most specifically, the GenBank-like report, I realized how tightly-coupled the NCBI ASN.1/XML is to the toolkit. 

Basically, you’ve got to understand the toolkit code in order to translate what the XML is saying. The infinite extendability and recursive structure of the ASN.1 data model is another conundrum. This is especially true of the of the ASN.1 data structures supporting GenBank data - Bioseq-set. For example, a phy-set (phylogeny set) can include sets of Bioseq-sets nested to several levels. Most Bioseq-sets are the usual nuc-prot (DNA and translating protein), but others are pop-sets, eco-sets, segmented sequences with sets of sequence parts, etc.

After we developed LARTS, I wrote the GB filter as a Java object.  It was an interesting experience. 

NCBI ASN.1 rendered as XML, either our version or the NCBI asn2xml version, is very dependent on  the NCBI toolkit code for proper interpretation.  

The two most glaring examples are listed below.

Sequence Locations

Determing the location of sequence features for a GenBank data report, is a prime example.  Here are a few simple examples:

primer_bind   order(complement(1..19), 332..350)
gene                complement(join(1560..2030, 3304..3321))
CDS               complement(join(3492..3593, 3941..4104, 4203..4364, 4457..4553, 4655..4792))
rRNA  join(<1..156, 445..478, 1199..>1559) 5231, 76582..76767, 77517..77720, 78409..78490))
primer_bind   order(complement(1..19), 1106..1124)

For Segmented-sequences:
CDS         join(162922:124..144; 162923: 647..889, 1298..1570)

CD regions locations have frames, bonds have points (that can be packed), strand minus denotes a complement (reverse order), a set of sequence locations for a sequence feature (packed-seqint) denotes a join, and locations can be “order(”ed, or “one-of”, and fuzz-from and fuzz-to has to taken into account for points and sequence intervals.

Sequence Format

DNA sequences are stored in a packed 2-bit or 4-bit per letter format (ncbi2na and ncbi4na).  2na is used if the sequence does not contain ambiguity, otherwise 4na is the format of choice. The sequence must be unpacked to be useful. This takes a basic understanding of Hex(adecimal).
Toolkit

The NCBI Toolkit contains all of the code necessary to render a GenBank report from the ASN.1 binary or ASCII data file.  (The code is there, but you have to figure out how to compile it into an executable.)

We took the toolkit code and converted  it to Java to produce the GenBank-style output format.  It differs from the actual NCBI GenBank Report in that the LARTS report lists a FASTA-formatted sequence instead of the 10-base pairs per column that the NCBI GenBank Report produces.

The Many Variations of LARTS

GenBankReportFilter.java is provided as an example with Stand-Alone LARTS.  The LARTS Reader enables the GenBank-style report.

Using LARTS Online, the user can select the GenBank-style report as the desired Output Format.

A third option, would entail using LARTS Online to obtain the keyword or keyword/element-path data wanted in XML format. This data is then downloaded to a local machine via the Thick Client option. Finally, Stand-Alone LARTS would process the dowloaded XML data into a GenBank-style report.

Stand-Alone LARTS provides example filters and SQL for processing XML and loading the relevant data into a local SQL database.  This includes sample code for  the BLOB and CLOB objects.

The filter for FASTA-formatting sequence data is also available as an example with Stand-Alone LARTS.

These options provide ready access to NCBI data for your research.

Programming Practices to Live By No comments yet

Programming Practices

I’ve been privy to all sorts of coding adventures.   I’ve had a website and supporting components dropped in my lap with little overview other than the directory structure.  I’ve had to plow through the methodology and software written for one aircraft certification program to determine if any of it was relevant for the next certification project.
In either case, it wasn’t a lot of fun.   I spent lots of time reading code and debugging applications in addition to talking to vendors, customer support, technical staff, etc.

There were days when I would have killed for a well-documented program.  Instead, I had to spend weeks in debug, poking around, learning how things worked.  In both cases, the developers of said projects were no longer available for consultation.

Here are few programming practices that I’ve tried to adhere to when writing code.

Document, Document, Document

I am a big proponent of Javadoc.  Javadoc is a tool for generating API documentation in HTML format from doc comments in source code. It can be downloaded only as part of the Java 2 SDK. To see documentation generated by the Javadoc tool, go to J2SE 1.5.0 API Documentation at
http://java/sun/com/j2se/1.5.0/docs/api/index.html. Go to “How to Write Doc Comments for the Javadoc Tool at http://java.sun.com/j2se/javadoc/writingdoccomments for more information.

Other languages have similar markup languages for source code documentation.

Perl has perlpod - http://perldoc.perl.org/perlpod.html
Python has pydoc - http://docs.python.org/library/pydoc.html
Ruby has RDoc - http://rdoc.sourceforge.net

I usually start each program I write with a comment header that lists: who developed the program, when it was developed, why it was developed, for whom it was developed, and what the program is supposed to accomplish.  It’s also helpful to list the version number of the development language.  List any dependencies such as support modules that are not part of the main install that were downloaded for the application.

Each class, method, module (etc.) should be headed by a short doc comment containing a description followed by definitions of the parameters utilized by the entity, both input and output.

Coding practices

I think all code should be read like a book.  Otherwise -

- Code should flow from a good design.
- The design should be evolutionary.
- Code should be modular.
- Re-use should be a main concern.
- Each stage of development should be functional.
- Review code on a daily or at least a weekly basis.
I’ve mostly found that peer review can be a great time-waster.

There are  several design methodologies, such as Extreme Programming, that are the flavor de jour.  None have been completely successful in producing perfect software.

To-Do Lists - Use them!

There are project management and other tools available for this, but a plain text To-Do file that lists system extension, enhancements, and fixes is a good thing.  It’s simple, you don’t need access to or have to learn how to navigate a complex piece of software. 

Find Out Who Knows What and GO ASK THEM

I worked on a project and shared a cube with a guy named Al.  Al was not the most pleasant person (he was the resident curmudgeon), but we got along.  Al had been working on the huge CAD (Computer Aided Design) project since the first line of code was written.  If I couldn’t understand something, a brief conversation with Al was all I needed. 

Every time a programmer on that project complained to me about not understanding something, I told them to go ask Al.  However, I ended up as the “Go Ask Al” person.  I didn’t mind, as we became the top development group in that environment.

Use Code Repositories

The determination of which code repository to use - SVN (Subversion), Mercurial, and GIT are the big three - has become something of a religious issue.  Any of these have their pros and cons.  JUST USE ONE!

Integrated Development Environments (IDE)

There are several of these available.  My favorite is Eclipse (www.eclipse.org).  I’ve also used SunOne which became Creator which is rolled into Net.Beans which was out there all along.

There are a lot of plug ins available for Eclipse and Net.Beans that enable you to develop for almost any environment.  The ability to map an SQL data record directly to a web page and automatically generate the SELECT statement to populate that page has to be at the top of my list.

I’ve used VisualStudio for C++ development in a Windows environment for a few applications, but most of my development as been on Unix, Linux, and Mac platforms.

The problem with most IDE’s is that they are complex and it takes awhile for a programmer to become adept at using them.  You’re also locked into that development methodology which may become inflexible, due to the applications under development.

We used EMACS on Linux for all LARTS development (LifeFormulae ASN.1 Reader Tool Set, my current project), although my favorite editor is vim/vi.  I can do things faster in vi, mainly because I’ve used it for so long. 

Which Language?

My favorite language of all time is Smalltalk.  If things had worked out, we would all be doing Smalltalk instead of Java. 

Perl is a good scripting language for text manipulation. It’s the language of choice for spot programming or Perl in a panic.  Spot programming used infrequently is okay.  However, if everything you are doing is panic programming, your department needs to re-think its software development practices.

Lately, I’ve been working in Java.  Java is powerful, but it also has its drawbacks. 

We will always have C.  According to slashdot.org, most open source projects submitted in 2008 were in C, and this was by a very wide margin. C has a degenerative partner, C++.  C++ does not clean up after itself.  You have to use delete. Foo(); can either be a function declaration or a call to a constructor, depending on the type of Foo.

Fortran is another one that will always be around.  I’ve done a lot of Fortran programming.  It is used quite extensively in engineering, as is Assembly Language.  I have been called a “bit-twiddler” because I knew how to use assembler.  

Variable Names

This is a touchy subject.  I’ve been around programmers who have said code should be self documenting.  Okay, but long variable names can be a headache, especially if one is trying to debug an application with a lot of long, unwieldy variable lanes.

Let’s just say variable names should be descriptive.

Debugging 

This should probably go under the IDE section, but I’ll make my case for debugging here.
Every programmer should know how to debug an application.

The simplest form of debugging is the print() or println() statement to examine variables at a particular stage of the application.

Some debugging efforts can become quite complex, so as debugging a process that utilizes several machines, or pieces of equipment such as an airplane. 

Sun Solaris has an application truss that lets you debug things at the system level and lets you follow the system calls.  The Linux version is strace.

I am most familiar with gdb - The GNU Debugger.   I’ve also used dbx on Unix.

The IDE’s offer a rich debugging experience and plug-ins let you debug almost anything.  

Closing Thoughts

Always program for those coming behind you.  They will appreciate your effort.

It’s best to keep it simple.  Especially the user interface. 

Speaking of users, talk to them.  Get their feedback on everything you do to the user interface.  I spent my lunch hour teaching the Flight Test Tool Crib folks (for whom I was creating a database inventory system) how to dance the Cotton-Eyed Joe. The moral, Keep your users happy.

 Technology is wonderful, but technology for technology’s sake (”because we can!”) is usually overkill.
The guiding principle should be — Is it necessary?

By the way, going back to that certification project.  I found that instead of spending $15K for a disk drive, the company was conned into spending $750K for custom developed software that I had to throw in the trash, except for one tiny piece. Was it necessary? The point is that the best answer is not always the most expensive one.

Common Sense to Live By No comments yet

As promised, given below is a list of bioinformatics “horror stories”.    These are a few of the situations and people I have encountered through the years.   Names have been changed to protect the innocent.

 

People Issues

 

The following are “people” problems.  (Just about every discipline has these denizens.)

 

1) Make sure the person selected for the task has the skills required for the task, or at least the desire to learn those skills.

 

We were sent a programmer, Joseph, who was underwritten by another lab, but who would be using our equipment.   We were to assist him in supporting a research lab’s website.  His duties consisted of capturing genomics data from various sites and data resulting from analysis of the research lab’s data.  This data was then to be displayed on the lab’s web site. 

He was given a workstation and access to our support literature and introduced to another programmer, Jay,  that he could turn to for assistance.  

I realized he was in trouble a couple of days later when Jay came to me and said that Joseph was having problems.  He had referred him to the books and manuals we had in the lab that gave him exact examples of what he was trying to do, but Joseph said he couldn’t understand them and that they were basically a waste of time.  Jay even worked out the code that Joseph needed, but Joseph said it didn’t work.  The code consisted of about 10 lines of pretty straight-forward Java.

It didn’t work because Joseph had several “typos” when he typed in the code and tried to compile it.

I tried to help him with batching URL retrievals, but he didn’t understand even after he said he had read the Perl books we had.  I ended up writing the five line of Perl code needed for the Simple HTTP  retrieval because the job had to get done.

I talked to the P.I. of Joseph’s home lab.  Seems that Joseph had a little experience programming on the Windows platform and wanted to learn more about bioinformatics.  The P.I. did acknowledge that Joseph really did have a tendency to get other people to “help” him do his tasks.

Joseph was sent back to his home lab.

 

2) Find out everything you can about the person you are considering

 

My P.I. hired a programmer that had sterling referrals from a previous labs.  Didn’t take long to find out that the lab was giving out glowing references because they wanted to dump this person, Jane, because they said she couldn’t program among other things.  It seems this was the case in all of the other labs she had been with.

I asked my P.I.  why, why??  He said that he thought we would be ones that would finally be able to develop her skills.

I said that I didn’t think so.  She didn’t know the basics, and she tried to cover by saying the other programmers in the lab were out to undermine her.  Consequentially, this caused a lot of unrest in the lab.

She was given a support role and eventually went to another lab.

 

Project Management

 

1)  Set goals

 

There was this five-year grant that was in it’s last year.  The first four years had been spent working out the program design, very little coding had been accomplished other than a demo or two.  I got involved in the final year because they needed to produce something that could be referred to as a product.

The group was situated in a small area of some four offices and three cubicles.  Every wall was covered with white board for my arrival– where we could design the final model!

Design is good, but know when to say enough is enough.

The project was never really finished. The grant was not renewed.

 

2) Let those who can help you, help you

 

I got involved in a project whose purpose was to accumulate data from various sources for storage in an Oracle database.

After determining the data required and gathering that data, I generated the six tables in UML (Universal Modeling Language) and subsequently the SQL that could handle the data.  One table was a sequence identification table, or a table that held the id associated with that sequence in various databases such as GenBank and ENSEMBL. 

One project member, a P.I. of another lab involved in the project, stated that she had read a book on SQL and she knew what to do.

Needless to say, she didn’t understand relational databases at all.

Instead of six tables, the database finally evolved into over 200 tables under her oversight.  Most of these tables were of two entries — an index and a sequence identification tag.

 

3) Ask around, someone might have a better way

 

I was asked to help by a lab who was having trouble with some code developed by a programmer who had moved on.  The lab technician who used the software said that it took 19 hours to assemble the data required to define the wells on a micro array plate.

I took a look at the code.  By using the NCBI toolkit, several Perl scripts, and a database,  I was able to reduce 19 hours to about 20 minutes.

The previous programmer  used this elaborate system of indexed GenBank reports.  By using the toolkit I was able to process the NCBI ASN.1 files directly.

 

Software Issues

 

1) Software has its limits

 

One lab was using FileMaker Pro for data storage.  This was okay at first, but at 500 files growing beyond a 2-Gb file limit, FileMaker was struggling.

Data access proved more timely ported to an Oracle database.

 

 2) Read the Manual

 

A sequence is a string of letters.  As such, there is only so much you can do in searching strings.  The word size of the search is limited.

One lab was analyzing a sequence against the entire genome of a selected organism using open source software.  This software wasn’t intended to search the entire genome, just short pieces of it. 

After the process took some 5 days to partly analyze just one sequence, the lab technician decided that this widely utilized open source  program had to be rewritten.

The request was declined.

 

3) Document your code

 

We were called in to save a pharmacology database developed in Access. The original developer used Access because he “sort of” knew how to develop input screens in Access.  The lab ran into trouble when the developer left. No one in the lab was able to take over the application and everyone else they asked to look at the project, left shaking their heads.  There was no documentation of record.

The data was ported to an Oracle database with web-enabled user input and reporting functions.

 

4) Verify that the process completed

 

One research group created a process that was to automatically archive the day’s research data to backup.  They assumed everything was okay, until they lost a hard drive and found out the the automatic nightly backup never happened because the filename, which explicitly listed the physical location of the data file, was too long for the archiving software.  The backup failed with an error message, but no one ever checked.

 

Some things you just can’t help

 

One morning I arrived at the lab and found everyone on the floor waiting for me.  They couldn’t access the server to read mail, etc.   

I opened the lab, looked in the direction of the server and found an electrical plug pulled out of a socket. 

It seems that the nightly housekeeping need an electrical outlet for the vacuum cleaner and the one that was used by the server was the handiest.

 

One More…

 

Our lab paid the institution’s IT department for a monthly back-up of our computers. 

One morning, I came in, and everything was dead.  I called our lab sys admin, told him to investigate.

Well, turns out IT hadn’t really done a back-up of our system in over 3 years.  Apparently, they tried over the weekend.  (Our lab sys admin wasn’t involved in the process, as he was subsidized by our department and not IT.)

At the start of the process, the date command produced the proper output. At the end of the process, the data command produced the output — no date command found  - anywhere. 

I forget exactly what got deleted or screwed up, but everything had to be rebuilt.

Luckily, I had used one of the seldom-used machines to mirror our data, etc. on a daily basis. So, once the machines were back (2 days), we were okay and didn’t lose much.

 At this time, the average life span of a sys admin in IT was around 6 months.

Theses are just a few of my encounters in field of the life sciences.  I won’t go into the ones from engineering,  but I’ve got some beauts — especially as a woman in engineering.

Computer Science Wild No comments yet

(I’m delaying the “horror stories” until next week, because I want to fully document them all.)

 

I ran across the phrase “computer science wild” at a recent conference.   I’ve got my own thought, especially since  the top 25 coding errors was released yesterday.  The link to the article  is - http://www.computerworld.com/action/article.do?command=viewArticleBasic&articleId=9125678&source=NLT_SIT&nlid=91.

 

I think any programmer should have the opportunity to write software that might kill someone, blow up an extremely expensive piece of equipment, or cause a waste of thousands of dollars because the system is down.  Maybe then they would think, write better code, and debug the software thoroughly before they released it into the wild.

 

The Ariadne V rocket blew up on take-off because the software didn’t contain an exception handler for buffer overflow!   (This translates to something like an array overflow. An overflow would trigger a programming mechanism that would write out the buffer contents and clear it.  The usual device is to transfer data capture to another buffer while the full buffer is written to i/o.)

 

The excuse for the disaster was that the specifications didn’t spell out the need for that programming mechanism.   An exception handler is a very basic mechanism for catching and correcting errors.  There is no excuse for this oversight.

 

One major project I worked on was acoustical (noise) testing of aircraft engines.  Our crew would go to some really great places like Roswell, NM, Moses Lake, WA, or Uvalde, TX.  We would record and analyze the noise of the engines as the aircraft flew over at different altitudes with variable loads in various approach patterns.

 

There were several pieces of software that had to work in tandem.  The airborne system, the ground-based weather station,  the meteorological (met) plane, the accoustic data analyzer, and the analysis station all had to work together to get the required results. 

 

There was no room for error.  Measuresments had to be exact, even out to 16 places after the decimal point. 

 

Modeling techniques, programming languages and IDEs (Interactive Development Environments) have become very sophisticated and complex.  A programmer today can “gee whiz” just about anything.

“Because we can” has become the norm.

 

This is great, but I’ve run into lab techs, etc. who were just this side of computer illiterate.  Like my dad, they adhere to a limited number of computer applications, accessed by a few key strokes or mouse clicks they have memorized.

 

And don’t think that engineers are immune.  They had to be drawn “screaming and kicking” away from their sliderules.

 

I’m for simple to start.  You can always add more “bells and whistles” as the system (and its users) matures.

 

 

 

Top of page / Subscribe to new Entries (RSS)