This blog entry is trying to describe a moving target.. Every time I get ready to cut it loose, some other wing nut comes out of the woodwork. Anyway, here goes.
When Bill Nye, the Science Guy, has a video on YouTube about the current state of science teaching in our public schools, you just have to say something.
Add to that the current conventional wisdom that says we have a lack of scientists and engineers and will lose our nation’s competitiveness if we don’t do something.
This is what none other than Bill Gates testified before Congress in order to up the number of H1B visas available to bring foreign engineers and software developers into the country.
Follow this up with the political firestorm of the recent presidential election, where the GOP has repeatedly demonstrated that they are anti-science or quote incredulous research to back their claims.
No less than the Florida Senator Marco Rubio who stated that the world was 6,000 years old because the bible says so.
He has since rescinded this claim by saying that the world is some 4 billion plus years old but that teachers can teach the 6,000 year date if they so choose.
This is really disturbing as he sits on the Senate Commerce, Science. and Transportation Committee.
Over in the House, we have the Republican Representative from Georgia who has stated that Evolution and the Big Bang Theory are “lies straight from the pit of hell.”
Hanging in the balance are funds for scientific research and development which could be drastically cut. Some research, such as stem cell, will be drastically curtailed or even eliminated.
Even agencies, such as the EPA (Environmental Protection Agency) and NOAA (National Organization of Atmospheric Agency) could be eliminated.
Birth control and abortions, even those caused by rape or incest, could be done away with.
The latest in this front is from Oklahoma. The state is seriously considering passing a low stating that a woman who becomes pregnant for a rape cannot abort the fetus, because it in itself is evidence of the rape.
The most insidious occurrence has to do with the science education in our schools.
Some states, such as Louisiana, Florida, and Texas, want creationism as explained by the Bible to be taught as a basic fundamental fact, and evolution discussed as a “theory”.
The Louisiana schools are adopting texts from Bob Jones University. These texts are filled with such statements as, “the Klu Klux Klan was a benevolent organization.”
How much remedial teaching will be required before these students can even attempt college. Unless they go to Bob Jones University, of course.
Here are some direct quotes from texts, including books from Bob Jones University Press.
- Only ten percent of Africans can read or write, because Christian mission schools have been shut down by communists.
- The Ku Klux Klan in some areas of the country tried to be a means of reform, fighting the decline in morality and using the symbol of the cross…In some communities it achieved a certain respectability as it worked with politicians.
-God used the ‘Trail of Tears’ to bring many Indians to Christ.
- [It] cannot be shown scientifically that that man-made pollutants will one day drastically reduce the depth of the atmosphere’s ozone layer.
-God has provided certain ‘checks and balances’ in creation to prevent many of the global upsets that have been predicted by environmentalists.
-Unions have always been plagued by socialists and anarchists who use laborers to destroy the free-enterprise system that hardworking Americans have created.”
To bring the matter closer to home, there was a fascinating documentary on TV this past Monday. Scott Thurman presented a PBS “Independent Lens” episode titled “Revisionaries” in Texas on Monday, Jan. 28, 2013.
The documentary pointed out in vivid detail how the “revisionists” are setting the standards for Texas public school textbooks on science, history, social studies, and so on. If possible, by all means, watch the show. The Texas Freedom Network has been following this process and have to be lauded for their efforts in helping the “truth” survive. They are counter the efforts of the religious right to inflict their beliefs on public education in Texas. According to one person interviewed there has been
a concentrated effort in Texas for approximately 18 years on the part of the revisionists to influence Texas school children in grades 3 through 12.
This is very, very frightening news, as Texas is the largest market for school textbooks in the nation. As goes Texas, so does the nation.
Every school should take a serious look at eBooks on the iPad, Kindle, Nook or other such device. The teacher would then, hopefully, have the freedom to choose from a number of textbooks that are anathema in Texas. Books that were not dictated by extreme right-wing adherents, such as Don McLeroy, a dentist, who is determined that “intelligent design” be taught versus evolution, which he considers a theory, not as fact.
Another encouraging sign, there is an effort underway to bring Texas back from the far, far right. People are becoming aware of what is afoot. They may be late to the dance, but .at least they are listening now.
Children in Texas and elsewhere (if the Texas textbooks are adopted) will be graduating from high school with theories that are not accepted by most of the world. This does not bode well for a scientific viewpoint.
As it is we are are falling fast. We went from first to 15th in an Innovation Index in one year’’s time. Are we in a race to the bottom?
The most bloodcurdling statement in the entire documentary came from McLeroy. When asked by a commentator if Texas can come back to the 21st Century in light of the recently revised texts, he answered, “Long after my grandchildren are gone. And I am proud of that fact.”
The face of our state will be changed for decades. The current standards will not be reviewed again until 2020. So very, very sad for our children and Texas.
To get back to Bill Nye, his message is this. “Parents, if you want to believe in creationism, even though evolution is accepted throughout most of the world, don’t force this belief on your children. We need scientists with open minds, and engineers who can make things.”
I agree. We need to do everything we can to ensure that science is based on proven, accepted facts, not revisionist history or a questionable source such as the Bible.
I currently live in a part of Texas that is pretty much smack in the middle of the Eagle Ford Shale area.
The former quiet of this small town has been shattered by the almost non-stop noise of heavy traffic at all hours of the day and night. It doesn’t help that Main Street through the downtown area is also a state highway that is a main artery to communities, oil rigs, and processing facilities in the area.
The motels in the area are all booked, and there are RV (recreational vehicles) and campers everywhere because the motels are all booked and the rent houses are going for something like $1K a month (if you’re lucky enough to find one).
Some derricks that previously dotted the skyline, have been replaced by “Christmas trees,” which are the sign of a successful well, and gas flares light up the night sky.
If you own land and the mineral rights that go with it, you can lease the right to access those minerals for a pretty penny. All land in the Eagle Ford Shale domain is considered, even graveyards. And, it’s even better if they find oil.
Money is flooding into town. Prices have risen on just about everything as merchants want to grab just as much of that green as they can.
Single motel rooms are $125 a night. Gas is .30 to .75 cents more than the state average.
Restaurant prices have doubled if not tripled. Food prices at the grocery store are along line of those at a convenience store.
Prices are the downside of the whole affair. What if you don’t own land that could possibly harbor oil or gas? What if you are on a fixed income?
The closest town is 14 miles away and prices are just about the same as they are here.
The businessman wants to make a buck. So what is the problem?
The word, gouge, comes to mind.
The definition according to the Merriman-Webster website (www.m-w.com) states - Gouge (verb) – overcharge, soak, sting, surcharge; to charge (someone) too much for goods or services. Synonyms: gouge, soak, sting, surcharge. Related Words cheat, defraud, stick; clip, fleece, skin; mischarge
That said, let’s look at Information Technology.
Consultants are in a position to do the same to their clients. They can gouge the unsuspecting by deception - providing false information about current IT situations and suggesting remedies that will provide the maximum result - to themselves, not the client.
That brings up some other definitions, also courtesy of www.m-w.com.
Deceive – be false to; be dishonest with; to mislead by deliberate misrepresentation or lies; Cozen; delude; lead on; Betray; Victimize; Chisel; Dupe;Trick; Bamboozle;
Double-cross – implies the betrayal or a confidence or the willful breaking of a pledge
That said, what happens when the boom or the project is over?
I remember a previous boom, where this company that manufactured rigs for the oil field stated they had a $4 billion dollar backlog of orders.
Well, the bust came, the orders evaporated, and the company did just that, it busted.
People will remember. Maybe it won’t matter much is this smaller town because the alternative is to drive to the next one, but they will remember, and probably make that trip to that other town or they won’t buy as much. A losing situation for the local merchant either way.
Those hotels and the RVs will go vacant, the stores will drop their prices, and things will eventually return to normal. But the rift will remain.
Unsatisfied customers don’t make good references and this can be the a bad, bad thing in this economy. What goes around, comes around.
It may be legal, but it is just not be moral.
I made a recent life change that contained a silver lining. It revealed that the simple way is usually the best way. It also revived a few latent talents and I discovered one I didn’t know I had.
First. Let me explain what happened. My parents are elderly, living alone, and needed some looking after. I talked to my business partners and we agreed that I could do what needed to be done with our product over the web, provided we got together on a bi-weekly or monthly basis.
So my husband and I moved. We squeezed into a smaller living situation, big city to very small town. I grew up in a small town, but after 30+ years away, the return was a little traumatic.
There were several handicaps about living in this small town that we were aware of, but since we were usually “just visiting,” we didn’t pay them much mind.
There is only one grocery store that carries a fairly complete line of items.
There are 4 gasoline stations, but 3 are owned by the same person.
Other than DSL, there is no fast Ethernet.
There is only one doctor, one clinic, and one drug store.
The only soft goods, such as clothing, cosmetics, grooming essentials are carried by a cut-rate supplier and selection is very limited.
There are less than 10 food establishments, including the local drive-ins and they all serve that same type of Texas fried or Tex-Mex food.
There is no cable TV. A dish is the only alternative unless you don’t mind a very limited selection of channels - about 6 provided reception is good.
There is no dry cleaner.
The atmosphere is decidedly rural, with a good dose of sustenance and a healthy gun culture with a fair share of intolerance and everybody knows everybody.
Boredom can be a problem.
Unless you want to drive 14 miles in one direction, or 25 in the other, you are stuck with what you can find. Even at that the selection is still pretty limited.
On the other hand –
There is only one stoplight.
Dietary decisions become simpler, because there isn’t that much to choose from.
There is less traffic. You can get to where you want to be in about 5 minutes.
There are people who have retired from a stressful, urban job to a small town. I call them “jewels”. Find them.
The people look out for each other.
You learn to entertain yourself instead of looking elsewhere.
You learn to make do.
My sister asked if I would come help her out at the local newspaper on a part-time basis.
I said okay. Although I grew up in a small-town newspaper publishing family, it has been a long, long time since I did that.
Although they did not print the newspaper at the local site, articles had to be written, notices and pictures taken, and advertising sold.
Computers and digital cameras have made the whole process a lot easier, but articles, notices, etc., still had to be typeset, whether by scanner equipped with optical scanner recognition of by hand. Advertising still has to be sold over the phone or in person.
I discovered that I had a knack for taking photographs. I was told, I also found I can still write a pretty fair article.
I never cared much for taking photographs, but I think I may have stumbled upon a new hobby.
As to living in a small town. It is different. But the slow pace can be wonderful after a go-go/survival of the urban landscape.
I have learned to take things a little slower and think more simplistically.
Break down a task into a few simple steps.
Keep an eye on what you are trying to accomplish.
Factor in the parameters that will help to achieve that effect.
Trust yourself enough make the best decision about what you are trying to achieve.
Search for the “jewels” who will enable that decision. A conversation with someone about anything just might shake loose the information you need.
A recent Bioinform (www.bioinform.com) poll asked “What are the biggest informatics challenges for the next generation sequencing data?” The poll results are listed as follows: 57% Functional Interpretation; 24% Data Management; 9 % Assembly and Alignment; 4% Variant Calling; and 4% Storage.
As a former Data Engineer entrusted throughout my career with obscene amounts of various kinds of data, I am appalled that data management and storage ranked so low. Where’s your Functional Interpretation without the data?
I’ve worked with all sorts of data. Data, that in some instances, was obtained under adverse conditions and could not be duplicated had to be protected, more or less, by my very skin (or so I was threatened).
Next-gen sequencing is producing files of short run data that are amplifying the errors inherent in first-gen sequence data. These next-gen files are being produced at a phenomenal rate, sometimes surpassing the petabyte count.
Data managers can be thankful that data storage has been developed that provides a lot of bang for the buck, as 4T drives as just about standard and the 2G file limit has been eliminated.
Having worked in the field with the first Compaq lap tops and the later Zenith with a 40M hard drive, this is very heartening news.
I’ve put together a list of data pointers that anyone attempting to work with data of any kind needs to read.
What are you trying to measure or analyze is the most fundamental question.
Close on the heels of this one is - How to Acquire the Data. Is there a system in place that can produce the data stream necessary. If not, is there a system that can be modified that will produce the data you need. If not, what will it take – hardware and software, to produce what you want.
This data acquisition phase can be extremely costly if you don’t have a overall idea of the complete system – acquisition, storage, and analysis.
Next question – How much data are we talking about? Is it limited to a file, a system, or a cluster of devices?
Where are we going to store this data? Do we have the storage equipment at hand? If we do have the equipment, can we add on what we need without reinventing the wheel?
The reader will probably instantly of the “cloud.” However, as of late (i.e.Amazon’s EC2 cloud outage), tech blogs are stating that a cloud hack is just a thought away (http://tech.blorge.com/Structure:%20/2011/04/28/data-security-in-the-cloud-sucks-as-witness-sony-psn-hack/).
Another question, will the data be stored in its raw format or will it need massaging. Raw data vs. manipulated or converted data (i.e. binary converted to engineering units or test) can easily quadruple your storage needs and costs.
Will data from various hardware sources need integration into the data stream? How is this integration occur? Will additional software be necessary? Is a data model required? In some instances, more than one data model may be necessary. Is a database reflecting these models needed? Who will develop the data model and administer the database?
And, while we’re talking about it, How easy or difficult would it be to take archived data and have it available for processing – a few minutes, a day, a week?
If its stored as binary or other basic (raw) form, how long to pull that data from the archive, convert it, and have it available for analysis?
How are you going to certify that the raw data is correct and the conversion utility created a true conversion of that raw data?
Just the term “archived data” has its own implications. What do you mean by “archived” vs. “active” data. What raises the flag that says this active data can now be archived? Are there several phases in archiving that data? How long will it take?
Some of the tests I’ve been involved with involved acquisition of live data in the field, performing spot analysis of the data as it was acquired. This live data was subsequently saved to digital tape or hard drive for further detailed analysis.
A three and a half week field test turned into 3 to 4 months of analysis at home base. The archived data had to perfectly mirror the live data and data analysis obtained in the field.
Could you do this with your data? Rerunning a field test is an expensive proposition – many thousands of dollars could be involved.
Speaking of analysis – who will be analyzing the data. What hardware and software do the have or need? Will further software development be in the picture along with hardware upgrades?
Are different platforms involved? Is the data representation on each platform consistent?
Little Endian to Big Endian was a major problem at one time, followed by 32 and 64 bit system representations. Ask the end users questions and don’t be blind-sided by system differences.
Another analysis question concerns subsets of data. Can you subset your data store? (I hope you’ve developed data models to support effort.)
A final question concerns manpower and experience. Do you have the staff that has the experience to support the endeavor. Saying you know SQL because you read a text defining SQL isn’t going to get it.
I can’t stress how important the proper, experienced staff can be. The hardest staff position to fill is that of project manager. A really, good project manager should come equipped with a CV replete with a list of incremental project management experience. You will probably have to pay through the nose for a good one, but it will be worth it in the end.
I had to choose between a person with a biology background and little to no programming versus one with a background in computer science, mathematics, or engineering, I’d choose the latter. They can pick up the biology. Of course, this depends on the person under consideration.
First question I ask myself is - Could this person help me get a plane off the ground? Can they handle stress? Do they think on their feet? How organized are they? How do they do in ill-defined environments? Do they fit in? Will their personality get in the way?
In any case, look beyond that paper resume and the list of provided references. You don’t want someone whose only experience consists of “Perl Scripts Done in a Panic”.
There is a lot to consider in the development of a system that turns on a piece of data. Ask questions. No matter how naive they may sound, I guarantee you will save time, and time means money.
For a little humor regarding software development check out – http://davidlongstreet.wordpress.com/category/software-development/humor/.
You may need it.
The Volume 29 Number 1 January 2011 issue of nature biotechnology (www.nature.com/naturebiotechnology) finally puts in print what I’ve been recommending all along. The Feature article on computational BIOLOGY, “Trends in computation biology – 2010” on page 45 states, “Interviews with leading scientists highlight several notable breakthroughs in computational biology from the past year and suggest areas where computation may drive biological discovery,”
The researchers were asked to nominate papers of particular interest published in the previous year that have influenced the direction of their research.
The article is good, but what was really interesting was Box 2 – Cross-functional individuals on page 49. To quote, “Our analysis…suggests that researchers of a particular type are driving much of cutting-edge computational biology. Read on to find out what characterizes them.”
I’m going to re-print Box 2 Cross-functional individuals in it’s entirety since it’s short and the message is so very important.
Box 2 Cross-functional individuals
In the courses of compiling this survey, several investigators remarked that it tends to be easier for computer scientists to learn biology that for biologists to learn computer science. Even so, it is hard to believe that learning the central dogma and the Krebs cycle will enable your typical programmer-turned-computational biologist to stumble upon a project that yields important novel biological insights. So what characterizes successful computational biologists?
George Church, whose laboratory at Harvard Medical School (Cambridge, MA USA) has a history of producing bleeding-edge research in many cross-disciplinary domains, including computational biology, say, “Individuals in my lab tend to be curious and somewhat dissatisfied with the way things are. They are comfortable in two domains simultaneously. This has allowed us to go after problems in the space between traditional research projects.”
A former Church lab member, Greg Porreca, articulates this idea further, “I’ve found that many advances in computational biology start with simple solutions written by cross-functional individuals to accomplish simple tasks. Bigger problems are hard to address with those rudimentary algorithms, so folks with classical training in computer science step in and devise highly optimized solutions that are faster and more flexible.”
An overarching theme that also emerges from this survey suggests that tools for computational analysis permeated biological research according to three states: first, a cross-functional individual sees a problem and devises a solution good enough to demonstrate the feasibility of a type of analysis; second, robust tools are created, often utilizing the specialized knowledge of formally trained computer scientists; and third, the tools reach biologists focused on understanding specific phenomena, who incorporate the tools into everyday use. These stages echo existing broader literature on disruptive innovations1 and technology-adoption life cycles2,3, which may suggest how breakthroughs in computational biology can be nurtured.
Christiansen, C.M. & Bower, J.I., Disruptive technologies: catching the wave. Harvard Business Review (1995).
Moore, G.A. Crossing the Chase: Marketing and Selling High-Tech Products to Mainstream Customers (Harvard Business, 1999)
Rogers, E.M. Diffusion of Innovations (Free Press, 2003).
Biologists must become aware of what the disciplines of computer science and engineering can offer computational biology. Until this happens, forward progress in computational biological innovations and discovery will be unnecessarily hampered by a number of superfluous factors not the least of which is complacence.
There once was a web portal called the Search Launcher to which I dedicated 4 years of my life.
It sort of fell into my lap. The only orientation I got was the basic directory structure. The supporting csh scripts, programs, and databases (and the programs that created those databases) I had to discover on my own.
It took me about a year to get it all under control. I think I actually made things better. I was used to dealing with tons of data, being on call at all hours, and minor things like distributed and parallel processing. I had the welcoming server off-loading compute-intensive processes to another server at a faster machine, as NFS was proving too flaky. I had huge databases distributed across many disks according to their size. I had website client programs on the Mac and PC that a researcher could run from a local machine.
I would work late and late on weekends during the off-hours when the load was lighter. If I had something really crucial to do, I telecommuted because I didn’t want any distractions.
Things ran smoothly, but there was the occasional hiccup. Mostly I monitored things and planned for improvements.
People would come by my cubicle and remark on how it looked as if I had nothing to do. If only you knew, I thought to myself.
One day, out of the blue, I was asked to interview someone. I was told he would be joining our group.
Okay, so I interviewed him. His main exposure to computing was Windows PC base. He had little to no Unix/Linux experience, much less programming, dealing with websites, huge amounts of data, and distributed anything. He did, however, have “wet lab” experience.
After he was hired, he told me that he was going to “rework things from top to bottom and make it easy for me.”
As he tried to “rework things”, he decided that everything had to be on one machine on one disk. He said this was to make things easier, but I really knew he didn’t understand the current configuration.
I decided it was time to bow out. I didn’t want to held responsible when his “improvements” came crashing down.
So I left. Before I did, however, he had me move everything to the one disk and then asked me to create a tarball. Well, tar wouldn’t work because the directory structure became so convoluted that the directory names were so long that tar couldn’t relate. Not to mention the size of what had to be tar’d!
This was a warning of future activities as the department decided to invest in a Linux farm at his suggestion. Needless to say, the system was improperly configured and crashed about two to three times a week.
Remember those long directory names that tar couldn’t comprehend? Well, the Linux farm was configured to continue this wonderful tradition with the result that everybody thought that tar was archiving the data when it really wasn’t. Nobody was reading the error files generated by the tar process, so nobody paid attention until sometime later. Then they hired a sysadmin whose sole duty was to watch the tar archive and make sure it took.
I was off to other venues, one of which was Enron. Yes, I was there when it all came down, but the 200 people in my department were not affected. It was a circus getting to work for awhile. Reporters were hiding under the stairways in the parking garage hoping for news tidbits.
I heard that my old bioinformatics department had hired an “efficiency expert” to advise them on the problems they were encountering and advice on how to fix them.
They got tired of hearing how everything was broken, so they let the expert go. Business continued as usual.
Next they hired a really good sys admin to take care of things, but he spent most of his time keeping the farm going by scrounging parts from the backup system. He left after awhile.
The moral of the story is this – know what you are hiring. Don’t give them power to instrument something they don’t know about or understand. This will upset the people that know what they are doing and that valuable experience will move on.
A lot of good people with a much better plan than a misconfigured Linux farm went on to better pastures, putting the department in the position of trying to recover what was lost.
With all the data now forth coming and the latest news that most sequences are wrongly annotated, it’s time for experts.
As an aside – A talk given on Thursday, Jan. 6th a fascinating talk on deep transcriptome analysis by Chris Mason, Assistant Professor, at the Institute for Computational Biomedicine at Cornell University listed the following observations that next-gen sequencing is bringing to light.
Some of the most interesting points from Mason’s talk were:
- A large fraction of the existing genome annotation is wrong.
- We have far more than 30,000 genes, perhaps as many as 88,000.
- About ten thousand genes use over 6 different sites for polyadenylation.
- 98% of all genes are alternatively spliced.
- Several thousand genes are transcribed from the “anti-sense”strand.
- Lots of genes don’t code for proteins. In fact, most genes don’t code for proteins.
Mason also described the discovery of 26,187 new genes that were present in at least two different tissue types.
For more, see – http://scienceblogs.com/digitalbio/2011/01/next_gene_sequencing_results_a.php.
Genetics and biology are concrete sciences. Computer science and engineering entail a lot of abstract thinking which is desperately needed for the underlying structure to support the analysis of the masses of sequence data currently amassing.
Get the right people for the job and you won’t find trouble.
First, we had the guy from Harvard try to explain that women aren’t interested in science because there is an intrinsic aptitude for things scientific based on gender. Guess which gender is deemed as more scientific?
Now, we have a new observation brought to us by Wray Herbert (http://www.huffingtonpost.com/wray-herbert/women-science_b_652858.html).
According to Miami University psychological scientist Amanda Dickman, there is a new explanation citing a difference in worthiness or values rather than ability. It seems, according to the new theory, that women reject science, engineering, and math because they view the these fields as too ego and power driven for their tastes.
The unambiguous results for the study found that young women did see science and engineering careers as isolated and individualistic–and what’s more, as obstacles to finding meaning in their lives.
The article goes on to state that it seems to be a perception thing. I would agree that it could very well be the perception thing, but there I think there is a little more to it than that.
A Little Background
My higher education endeavors began with a trip down the road that would merit approval from the study group quoted above. I got an undergraduate degree in Social and Behavioral Sciences and was just a few hours away from a graduate degree when I discovered I was bored to death. Something was missing. There was no challenge.
I tried the MBA path. Nothing doing.
I had taken an intro to computers course as part of my undergraduate course work and a ton of statistics courses but neither appealed. It wasn’t until I ran into my first “micro-computer” (as they were then known), that I realized this little machine was really going to change things. I even got a Heath kit catalog, ordered the H-89 kit, and put it together.
The closest decree to a computer science degree my university offered was a degree in mathematical sciences. I signed up for that.
Believe me, it wasn’t easy. I had already gotten the required courses out of the way, so for three semesters every class I had was either math or computer science. But it was interesting and definitely challenging.
The isolated and individualistic scientist, engineer, computer scientist as cited by the study does not exist in the real world.
My first post graduation gig was at the Health Services Division of a major aerospace company as a compiler developer. I was part of the Systems Enhancements and Extensions Group. From there, I transferred to the aircraft company in that same corporation. I was part of the Flight Test Research and Development Group. I went to another aircraft company and the Instrumentation Group. And so on. You were always a member of a group. A group that together designed, developed, and produced things – computer software, digital data acquisition systems, aircraft manufacturing scheduling systems, etc.
When I moved over to biotechnology, it was the same – you were a member of a group. A lab group, a bioinformatics group developing LIMS systems, sequence analysis and imaging recognition software, and so on.
However, I did find that scientists more that engineers were more power/ego driven. I think this is because of funding issues. Although both areas receive the majority of their funds from the government, the basis of the awards is different.
The individual scientist, as P.I., applies for the grant, writes the proposal and receives the funding – almost a personal assessment of that scientist’s capabilities. Furthermore, I feel that the letters - “PhD”, carries a lot of baggage.
For most engineers, the company applies for the grant, writes the proposal (after the engineers have okayed the design), and receives the funding. The engineer is associated with the program for which that proposal was submitted. The engineer isn’t as personally involved.
What I’ve Encountered
In the military industrial complex I encountered bored ex-military who used weekly status reports to declare war on some other part of the division . These attacks were mostly diversions and never amounted to much. These could be construed as power plays, but I list them as “play” period.
Believe me, there were some good ones – stopping just short of an exchange of blows. It’s also amazing how far echoes carry in an aircraft hanger.
The following examples are situations I encountered along the way. They are mostly examples of misdirected intentions, but a few border on outright criminality.
There were approximately 8 databases that all held the same information but for 8 different divisions. The electronics parts – transducers, potentiometers, strain gauges, resistors etc, in each of the databases were exactly the same. However, the nomenclature varied by division. We tried to standardize on one database system with one naming standard, but ran straight into a brick wall. Not one division was willing to cede to another. It was only after word came down from on high that additional funding would not be forthcoming, that everybody finally sat down to talk.
Insane Budgeting Exercises
One division needed to get a new system but was offered an old barely breathing system with exorbitant maintenance costs. The division was instructed to budget for and use the old system for the current fiscal year. For the he next budget cycle, the department was to state that a new system (the one originally requested) would save X amount of dollars over last year’s budget. The new system was then be given the green light.
A director was undercutting his yearly budget to emphasize cost savings. Consequently, his budget was always cut to that amount for the next year. It was pointed out that he should over run this year’s budget by the amount he wanted for next year. Then he would (and did) get the additional funding.
A Simple Name Change can Work Wonders
it was ascertained that for less that the amount the department was paying IT for storage of design data, a new system, software, and personnel could be purchased and hired. Department was notified that requesting a “computer system” would not meet with budgeting approval Only after the system was termed a “data multiplexer” to be administered by “data design personnel” was department able to proceed with system purchase.
One Size Does Not Fit All
IT sends down list of “acceptable” software. So-called software was specifically IT oriented and would not work in an engineering environment. Division engineers take up collection and purchase needed software themselves.
Vast amounts of money, time, manpower were spent developing a manufacturing scheduling system for aircraft manufacture. System rated manufacturing personnel in terms of ability. System was deemed a major success – avoiding bottlenecks, completion times, etc. System was never deployed due to union demands that manufacturing personnel could not be rated in terms of ability.
Decode system purchased for data acquisition decode and analysis ($150K) was purchased without installed hard drive for data storage ($15K). It was determined system could use in-house data farm to store data. Decode system required confirmation that contiguous data storage space was available to go ahead and store data.
Transfer mechanism did not provide this info, so decode system would not store data on data farm. Contractor told department officials that the system software on the decode system and in-house data farm were incompatible. Contractor sold department customized software for $750K to replace decode system.
A Meaningful Life
I’ve never considered my career in engineering and biotechnology as isolated and individualistic. Sure, you have individual work, but it is as part of a team.
As far as letting the ego and power driven become obstacles, I have to admit that my behavioral sciences background provided one of the most important career tools I have yet to encounter. My “Advanced Abnormal Psychology” course taught me how to observe and analyze people.
To find meaning in one’s life entails one heck of a lot more than a career. Perhaps by observing and analyzing one’s misconceptions about one area will enhance our conceptions of life in general.
Cloud computing is the current IT rage, said to cure all information management skills.
Cloud computing is just a new name for timeshare, a system in which various entities shared a centralized computing facility. A giant piece or two of big iron and floors of tape decks provided information processing and storage capabilities for a price.
The user was connected to the mainframe by a dumb terminal and later on by PC’s. The advantage (said the sales jargon), was that the user didn’t need to buy any additional hardware, worry about software upgrades or data backup and recovery. They would only pay for the time and space their processes required. Resources would be pooled and connected by a high speed network and could be accessed as demanded. The user wouldn’t really know what computing resources were in use, they just got results. Everything depended on the network communications between the use and centralized computing source.
What is New
Cloud computing is more powerful today because the communications network is the Internet. Some Cloud platforms also offer Web access to the tools – programming language, database, web utilities needed to create the cloud application.
The most important aspect I believer the Cloud offers is instant elasticity. A process can be upgraded almost instantaneously to use more nodes and obtain more computing power.
There are quite a few blog entries out there concerning the “elastic” cloud. For thoughts on “spin up” and “spin down” elasticity see http://timothyfitz.wordpress.com/2009/02/14/cloud-elasticity/. For thoughts on “how elasticity could make you go broke, or On-demand IT overspending” see http://blogs.gartner.com/daryl_plummer/2009/03/11/cloud-elasticity-could-make-you-go-broke/.
And finally, an article that spawned the “elasticity is a myth” connotation or “over-subscriptionand over-capacity are two different things, see – http://www.rationalsurvivability.com/blog/?p=1672&cpage=1#comment-35881.
A good article that covers elasticity, hypervisors, and cloud security in general is located at http://queue.acm.org/detail.cfm?id=1794516. The queue.acm.org site is maintained by the Association for Computing Machinery. There are lots of articles on all sorts of computing topics including, “Why Cloud Computing Will Never Be Free” (http://queue.acm.org/detail.cfm?id=1772130).
The most notable Clouds are Amazon’s Elastic Cloud, Google’s App Engine, and Microsoft’s Azure.
The three Cloud delivery models include:
Software as a service (SaaS), applications running on a cloud are accessed via a web browser
Platform as a service (PaaS), cloud-developed user applications such as databases
Infrastructure as a service (IaaS), provides computing resources to users on an as-needed basis
Pros and Cons
There are pros and cons for Cloud Computing. Microsoft’s Bill Ballmer is a proponent of Cloud computing.
In a recent email (http://blog.seattlepi.com/microsoft/archives/196793.asp) to Microsoft’s employees, Ballmer make the following case for Cloud Computing. He advises his employees to watch a video (http://www.microsoft.com/presspass/presskits/cloud/videogallery.aspx) in which he makes the following points.
In my speech, I outlined the five dimensions that define the way people use and realize value in the cloud:
The cloud creates opportunities and responsibilities
The cloud learns and helps you learn, decide and take action
The cloud enhances your social and professional interactions
The cloud wants smarter devices
- The cloud drives server advances that drive the cloud
Some very notable people are anti-cloud.
Richard Stallman, GNU software founder, said in recent interview for the London Guardian (http://www.guardian.co.uk/technology/2008/sep/29/cloud.computing.richard.stallman) that Cloud computing is a trap.
The Web-based programs like Google’s Gmail will force people to buy into locked, proprietary systems that will cost more and more over time, according to the free software campaigner.
‘It’s stupidity. It’s worse than stupidity: it’s a marketing hype campaign,’ he told The Guardian. ‘Somebody is saying this is inevitable — and whenever you hear somebody saying that, it’s very likely to be a set of businesses campaigning to make it true.’”
Aside from all that, what should a potential user be wary of in the Cloud? I’ll try to answer that below.
Security in the Cloud
Security in the cloud is a major concern. Hackers are salivating because everything – applications, data, are all in the same place.
How do you know the node your process is accessing is real or virtual? The Hypervisor (in Linux, a special version of the kernel) owns the hardware and spawns virtual nodes. If the Hypervisor is hacked, the hacker owns all the nodes created by it. http://www.linux-kvm. org has further explanations and discussions of virtual node creators/creations.
Data separation is a big concern. Could your data become contaminated by data in other environments in the cloud.? What access restrictions are in place to protect sensitive data?
Can a user in another cloud environment inadvertently or intentionally get access to your data?
Data interoperability is another question mark. A company cannot transfer data from a public cloud provider, such as Amazon, Microsoft, or Google, put it in a private IaasP that a private cloud provider develops for a company, and then copy that data from its private cloud to another cloud provider, public or private. This is difficult because there are no standards for operating in this hybrid environment.
Who is the custodian and who controls data if your company uses cloud providers, public and private?
Ownership concerns have no been resolved by the cloud computing industry. At the same time, the industry has no idea when a standard will emerge to handle information exchanges.
W3C – http://www.w3.org/, is sponsoring workshops and publishing proposals concerning standards for the Cloud. You can subscribe to their weekly newsletter and stay up on all sorts of web-based technologies.
Also, the Distributed Management Task Force Inc.(http://www.dmtf.org/home) is a consortium ofof IT companies focusing on, “Developing management standards & promoting
interoperability for enterprise & Internet environments”.
The DMTF Open Cloud Standards Incubator was launched to address management interoperability for Cloud Systems (http://www.dmtf.org/about/cloud-incubator). The DMTF leadership board currently includes AMD, CA Technologies, Cisco, Citrix Systems, EMC, Fujitsu, HP, Hitachi, IBM, Intel, Microsoft, Novell, Rack Space, RedHat, Savvis, Sun Guard, Sun Microsystems, and VMWare.
Working with the Cloud
Working with the Cloud can be intimidating. One suggestion is to build a private cloud in-house before moving on to the public cloud.
However, even that has its difficulties. Not to worry, there are several tools available to ease the transition.
There is a Cloud programming language – Bloom, developed at UC Berkeley by Dr. Joseph Hellerstein. HPC In The Cloud has published an interview with Dr. Hellerstein at http://www.hpcinthecloud.com/features/Clouds-New-Language-Set-to-Bloom-92130384.html?viewAll=y
Bloom is based on Hadoop (http://hadoop.apache.org) which is open source software for High Performace Computing (HPC) from Apache..
For ease of inter connectivity, Apache has released Apache libcloud, a standard client library written in python for many popular cloud providers – http://incubator.apache.org/libcloud/index.html. But libcloud doesn’t cover data standards, just connectivity.
MIT StarCluster– http://web.mit.edu/stardev/cluster , is an open source utility for creating and managing general purpose computing clusters hosted on Amazon’s Elastic Compute Cloud (EC2). StarCluster minimizes the administrative overhead associated with obtaining, configuring, and managing a traditional computing cluster used in research labs or for general distributed computing applications.
All that’s needed to get started with your own personal computing cluster on EC2 is an Amazon AWS account and StarCluster.
HPC presents use cases as a means to understanding cloud computing. http://www.hpcinthecloud.com/features/25-Sources-for-In-Depth-HPC-Cloud-Use-Cases-93886489.html.
BCM Bioinformatics has a new methodology article – Cloud Computing for Comparative Genomics that includes a cost analysis of using the cloud. Download the .pdf at http://www.biomedcentral.com/1471-2105/11/259/abstract.
I hope this will get you started. Once again, a big thanks to Bill for his assistance.
First, a little irony. In the late ’90’s I interviewed with BMC software in Houston. At that time, BMC was a supporter of big iron, providing report facilities, etc.
When asked what software I currently used, I replied with “GNU software”. The interviewer asked, “What is GNU? I’ve never heard of it.”
I explained that it was free software that you could download from the web, etc. But they weren’t really interested.
Anyway, eWEEK.com had a feature this week - ‘MindTouch Names 20 Most Powerful Open-Source Voices of 2010. The first name mentioned was William Hurley. The chief architect of Open Source strategy at BMC. (http://www.eweek.com/c/a/IT-Management/OSBC-Names-20-Most-Powerful-Open-Source-Voices-of-2010-758420/?kc=EWKNLEDP03232010A).
I guess they’re interested now.
There are any number of sequence data formats. This link at EBI – http://www.ebi.ac.uk/2can/tutorials/formats.html describes several.
What is really astounding is that most of these formats have remained to same over the years. The tab-delimited and CSV (comma separated values) format is as prolific as ever, as is the GenBank report.
And equally astonishing is the fact that manipulating the data (e.g. parsing GenBank reports) is still the same.
True, the Bio libraries such as BioPerl, BioJava, BioRuby, now provide modules that make this easier, (if you can install them) but it is still the same old download and parse.
There are also several groups trying to standardize sequence data. The SO (Sequence Ontology) group (http://www.sequenceontology.org) is trying to do for sequence annotations what GO (Gene Ontology - http://www.geneontology.org) did for genes and gene product attributes.
MIGS (Minimum Information About A Genome Sequence spec at http://nora.nerc.ac.uk/5548/) is following the course of the MAGE MIAME Standard (Minimum Information About a Microarray Experiment at http://www.mged.org/Workgroups/MIAME/miame.html). Good luck with that, as many scientists have openly voiced objections to that standard.
XML and the Web
XML (eXtensible Markup Language) and WSDL (Web Services Description Language) are one method of easing the interchange of data. Links at – http://en.wikipedia.org/wiki/XML and http://en.wikipedia.org/wiki/Web_Services_Description_Language.
There are a number of drawbacks to this setup.
Not all of the sequence data is available in XML or well-formed XML.
Some XML, such as NCBI XML, needs further interpretation. For example, the sequence feature (annotation) locations must be “translated” for further use.
XSLT has performance issues, and is size-delimited. We tried processing LARTS converted NCBI ASN.1 GenBank XML data to XSLT and found there were definite size limitations.
Using WSDL means exposing yourself to the world via the web.
Software development takes time and the right people. True, there is a lot of open source software out there, but I’ve mentioned the perils of that method in a previous blog.
A scientist with a grant to produce results dependent on computer analysis is only going to write code that is good enough to create code (or find someone (read post-doc) who can create that code very cheaply) that will back up those findings.
Has the code been extensively tested? Are the results produced by the code valid? Can the code be used by future projects? Is the software portable? Is it robust? Can it be ported to different hardware environments?
There is a great article – “Are we taking supercomputing code seriously?” at (http://www.zdnet.co.uk/news/it-strategy/2010/01/28/are-we-taking-supercomputing-code-seriously-40004192/). This article, in turn, has links to other articles on methods and algorithms, and error behavior, for example. This one on scientific software considers how multi-processing has influenced algorithm development and the problem of different multi-processors co-existing on the same machine (http://www.scientific-computing.com/features/feature.php?feature_id=262).
He states that in the rush to do science, scientists fail to spot software for what it is: the analogue of the experimental instrument. Therefore the software must be treated with the same respect that a physical experiment would.
When I started my career, I worked on a system that was a totally integrated database system for hospitals. It was one of those systems that was so very ahead of its time (mid-80’s), that a corporation bought the product and squashed it.
Anyway, our Systems and Extensions group supported the 6 compilers that comprised the system software that made the system function. The tailoring group wrote the code that created the screens that drove the system.
At the inception of the system, a decision was to be made over the make up of the tailoring group: should they be programmers that would be taught medical jargon, terms, etc; or should they be medical personnel – doctors, nurses, techs, that would be taught programming?
The decision was to go with medical personnel, as it was surmised they would understand hospitals better.
At the same time, a decision to limit the number of screens a hospital could request (called tailoring) to 500 was discussed. The decision was to let the hospital have however many screens it wanted.
The tailoring group got their training and set in to programming. After a period of time, it was realized that the group had, in essence, created one bad program and copied it thousands of times.
It was so bad, we did two things. We created a program profiler that produced a performance summary of the programming aspects of that program. (We were immediately asked to remove it by the tailoring group, as it was too confusing.) Two, we created an automated programming module that would create the code from the display widgets on the screen designed by the tailoring group.
This approach was helping, but people were abandoning ship as talk of an acquisition was surfacing. Our junior programmer went from new-hire to senior team member in 30 days.
I think we would have done a lot better with programmers learning medical terms.
As for the hospital screen limit, we had hospitals with 10,000 individual screens. We should have stuck with 500.
One last thing. When looking at any piece of scientific programming, please realize that in the Authors accreditation usually starts with the PI. The people who did the actual work are generally listed at the end of the line. The PI may have had the idea, but likely as not could not code it.
All Things HPC
Traditionally, High Performance Computing (HPC) means using high-end hardware like super computers to perform complex computational tasks.
A new definition of HPC (“High Productivity Computing”) means the entire processing and data handling infrastructure. This includes software tools, platforms (computer hardware and operation systems), and data management software.
Parallel or Multicore Processing
I think just about everybody has performed some sort of parallel programming. Starting two processes at once on the same machine is parallelism. If the program runs by itself and doesn’t need input from another program or product output for another program to use, it’s loosely coupled. It’s tightly coupled if one program feeds another.
PC architecture today supports multicore processors. A two-core CPU is, in essence, two CPUs on the same chip. These cores may share memory cache (tightly coupled) or not (loosely coupled). They may implement a method of message passing – intercore communication.
Cilk is a language for multi-threaded parallel processing based on ANSI C. MIT was the initial developer of the Cilk technology. The link to their page is at – http://supertech.csail.mit.edu/cilk/.
MIT had licensed Cilk to Cilk Arts, Inc. Cilk Arts added support for C++, parallel loops, and interoperability with serial interfaces. The product has since been acquired by Intel and will be incorporated into the Intel C++ compiler. The Intel page is at http://software.intel.com/en-us/articles/intel-cilk/.
Cilk++ makes multicore processing easy. CILK++ uses keywords to adapt existing C++ code to multicore processing. (You will need a multi-core processor).
Cilk++ is currently is a technical preview state. This means they want you to use it and give them feedback. Download the Intel CILK++ SDK at http://software.intel.com/en-us/articles/download-intel-cilk-sdk/. You will need to sign a license agreement.
The page also presents download links for 32-bit and 64-bit Linux Cilk++. (You will need an Intel processor for the Linux apps.)
There is an e-book on Multi-Processor programming available from Intel. The link is - http://software.intel.com/en-us/articles/e-book-on-multicore-programming/.
The book contains a lot of information on multicore programming, parallelism, scheduling theory, shared memory hardware, concurrency platforms, race conditions, divide and conquer recurrences, and others.
Grid computing is distributed, large scale, cluster computing. Two of the most famous grid projects are SETI@home and Folding@home (http://folding.stanford.edu).
SETI (the search for extra-terrestrial Intelligence) at home uses internet connected computers hosted by Space Sciences Laboratory at the UC, Berkeley. Folding at home focuses on how proteins (biology’s workhorses) fold or assemble themselves to carry out important functions.
Other lesser known grids are einstein@home (http://www.einsteinathome.org - “Grab a wave from Space”) processing data from gravational wave detectors, and MilkyWay@home (http://milkyway.cs.rpi.edu/milkyway) creating a highly accurate 3-D model of the Milky Way Galaxy.
The clusters mentioned above use the internet to exchange messages. If fast messaging is not required, plain old ethernet should be sufficient for your messaging needs, The problem with ethernet is latency. It takes a long time to set up and get that first message out there. After that, it’s solid.
But if you’re looking for constant speed try Infiniband (http://en.wikipedia.org/wiki/Infiniband), Myriet (www.myri.com), or QsNet (http://en.wikipedia.org/wiki/QsNet).
Oh, those gamers. Without their demand for faster, bigger, better, where would we be?
For example, do not overlook the gaming console. NCSA (National Center for Supercomputing Applications) has a cluster of Sony PlayStations. The PlayStation 3 runs Yellow Dog Linux. The average PS3 retails for around $600. The Folding@home grid runs on PS3s and PCs.
Then we come to the GPU (Graphics Processing Unit). GPU computing means using the GPU to do general purpose scientific and engineering computing. The model for GPU computing couples a CPU with a GPU, with the GPU performing the heavy processing. (http://www.nvidia.com/object/GPU_Computing.html)
One of the hottest GPU’s is the NVIDIA Tesla GPU which is based on the CUDA GPU architecture code-named the “Fermi”.
FPGAs (Field Programmable Gate Arrays)
Technological devices keep getting smaller and smaller, and the machinery gets buried under tons of software burdened with the menu systems connected to the development environment from hell.
FPGAs take you back to the schematic level. (I was known as a “bit-twiddler” at IBM.)
My old friends at National Instruments (http://www.ni.com/fpga/) have NI LabView FPGA. LabView FPGA provides graphical programming of FPGAs.
Their video on FPGA Technology is a good intro to FPGA. (http://www.ni.com/fpga_technology/). Several other videos are available at this same sight go into further detail. For more info on the FPGA hardware see http://en.wikipedia.org/wiki/FPGA.
(I still haven’t forgiven NI for nuking my data acquisition PC with their demo. I lost a lot of stuff. All was backed up, but re-installing was not fun.)
FYI -The industry is desperately seeking parallel and FPGA programmers.
Data Representation in Database Design
The most recent programming languages are object-oriented. However, the most efficient databases are relational. There are object-oriented database systems, but for the most part they are very expensive and very, very slow. Postgres is a RDBS (Relational Database Management System) that does implement a form of inheritance where one table may extend (inherit) another table.
Then you have XML. XML Schemas are adding another dimension to this complexity. XML is popular for communication (SOAP) and representation (XSLT). Data comes from an RDMS, gets stuffed into objects, translated to XML on one end and sent as XML, translated to objects, and stored in a RDMS at the other end.
The mapping of objects to RDBMS is known as object relational (O/R) impedance mismatch. See this link for a discussion (http://www.agiledata.org./) of software development processes and a link to a recent book on database techniques for mapping objects to relational datbases – http://www.agiledata.org/essays/mappingObjects.html.
But beware, as most of these ORMs (Object to Relational Mapping) sometimes produce a schema that wouldn’t be completely relational and therefore suffer in performance. Also, the SQL produced by ORMs may not be optimal.
To effectively design and develop a RDBMS , learn UML (Universal Modeling Language). The Objects By Design web site (http://www.objectsbydesign.com) covers UML and a lot of other object-oriented topics and is worth a look.
Rational Rose is the UML design tool that I use. It’s now been purchased by IBM. Rational uses what is known as the Rational Unified method,
Speaking of XML, some of the UML design tools can now output XML directly from the data record definitions.
See this link for a list of current UML products – http://www.objectsbydesign.com/tools/umltools_byCompany.html.
The End of SQL
The ComputerWorld blog sight has an interesting 3-part series entitled – The End of SQL and relational databases?
Part 1 covers Relational Methodology and SQL.The link to part I is here – http://blogs.computerworld.com/15510/the_end_of_sql_and_relational_databases_part_1_of_3.
Part 2 is a list of current NoSQL databases. The link to part 2 is here – http://blogs.computerworld.com/15556/the_end_of_sql_and_relational_databases_part_2_of_3
Part 3 is a list of links to NoSQL sites, articles, and blog posts. The link to part 3 is here - http://blogs.computerworld.com/15641/the_end_of_sql_and_relational_databases_part_3_of_3
In short, the “NoSQL” (http://en.wikipedia.org/wiki/NoSQL) movement and cloud-based data stores are striving to completely remove developers from a reliance on SQL and relational databases.
In a post-relational world, they argue that a distributed, context-free key-value store is probably the way to go. This makes sense when are can be thousands of sequence searchers, but only one updater. A transactional database would be overkill.
Part 5 of Effective Bioinformatics Programming coming soon..