mah1002

 

Last week I went along to a seminar given by Mike Vella at the Behavioural and Clinical Neuroscience Institute here in Cambridge. Mike spoke about the Open Worm project which aims to simulate a whole C. Elegans organism in silico. It’s also a great example of Open Science.

Adult C. Elegans

C. Elegans is a nematode worm, about 1mm long and transparent, which lives in the soil and feeds on decaying vegetable matter. It is probably the single most well documented organism in biology. Web sites such as WormBase and WormAtlas are brimful of information from its genome to the position and function of every one of its roughly 1000 cells. Just over 300 of those cells are neurons, which is enough for C. Elegans to exhibit some interesting behaviour: it can recognise and move towards food, it will detect and move away from chemical toxins, it can learn (some very simple) behaviours, it even ‘sleeps’. C. Elegans is the only species for which a full connectome has been published: all the neuronal connections – neuron to neuron and neuron to other cell types such as muscle – have been mapped. The full cell lineage is also known, that is how the 1000 cells in an adult are related to and derived from the original egg through the development of the embryo worm. This abundance of data suggests that if we want to model a whole organism computationally, C. Elegans is a great place to start.

The Open Worm project aims to do just that. It is a loose and widely distributed collaboration involving computational neuroscientists, professional software engineers and biologists in the US, UK, Italy & Russia. They’re keen to recruit new project members. It’s not currently funded by any research grant but is run by the participants in their “spare” time. All source code, data and publications are Openly available. They make great use of online tools such as Google Plus, GitHub and Amazon Web Services.

So far they’ve built a 3D visualisation using WebGL – the Open Worm browser – which shows the various structures in the worm down to the cellular level. They also have a mechanical model which simulates the physical motion of the worm. The simultaneous firing of the 302 neurons constituting the “brain” of C. Elegans has been simulated. The next step is to model the connection of a neuron to a muscle cell.

Why bother constructing a worm in silico? After all, it’s pretty easy to work with real life C. Elegans in the lab. Well, one of the reasons for constructing a computer model is that it helps organise what we know about the subject. There’s still an awful lot we don’t know about C. Elegans and attempting to simulate it helps identify those gaps. Even more ambitiously, it’s a tentative first step towards more complex simulations, eventually including the human brain.

People have tried and failed in the past to simulate C. Elegans. The Open Worm project may or may not succeed though I think that adopting an Open approach tips the balance in their favour and has given them an important edge that previous work has lacked.

 

Last week I attended a meeting on Semantic Physical Science organized by Peter Murray-Rust and colleagues. The stand-out talk for me was given by Cameron Neylon on why good software engineering practice is so important in science and how the scientific publishing market is in dire need of change.

He started by listing a few good reasons for scientists to take good software engineering seriously.

  • Firstly, one of our major purposes in academia is to produce a trained workforce, many of whom will move into industry where good software engineering skills will be an asset and a requirement.
  • Secondly, it simply makes our lives easier. A bare minimum of effort invested in version control and good documentation, for example, is repaid many times over when we share our code with others or come back to it ourselves at a later date.
  • Thirdly, practices such as unit testing, continuous integration etc. are thoroughly compatible and supportive of the repeatability, consistency etc. we expect of good practice in science. Also, good software engineering can inform and translate into the experimental domain.

Unfortunately, science (perhaps especially so in the long tail) is plagued by badly written code, bad habits and the consequent inability or unwillingness to share code and data alongside publications. A large part of the problem is that our current incentive system in academia, centred as it is around the journal publication and various measures such as the number of citations of a publication, does not reward the generation of high quality software that might be used by others.

Cameron’s answer is to “hack the system” – take the existing measures and play with them a little. He has recently established a new journal: “Open Research Computation” under the aegis of open access publisher Biomed Central. ORC will take submissions focussing on software developed for use in any area of science, “algorithms, useful code snippets, as well as large applications or web services, and libraries.” As Cameron pointed out, if ORC publishes 100 papers/year with perhaps 5-10 papers on software with a substantial number of users, ORC stands a good chance of gaining a respectable impact factor.

On the scientific publishing industry in general, the current problems with the proposed Research Works Act (not to mention SOPA & PIPA) in the US have been well documented elsewhere. Cameron’s view is that the market is simply now broken. It used to be that the distribution of paper copies of journals was the main service provided by publishers. Now that everybody reads journals online, the cost of distribution is essentially zero. All the costs are now in the process of generating the first copy of an article. In fact, the main service now provided by publishers is arranging for the peer review of articles. This is a service worth paying for. Why not re-configure the market to recognize this?

Of course, this won’t happen overnight. The efforts of open access publishers such as the Public Library of Science and Biomed Central (both of which charge authors a publication fee, rather than charging readers) are a great step in the right direction.

Another great place to start is to encourage funders, research institutions and publishers to recognise the importance of quality software engineering in science. Educational resources such as Software Carpentry deserve our support.

My own view is that the UK eScience programme recognised early on the importance of good software engineering. The most successful eScience projects, among them RealityGrid and MyGrid, employed full time software engineers. The annual All Hands meeting regularly featured papers focussed on new software tools and methods. One of the recommendations of the RCUK review of eScience in 2009 was that “professional software engineers or informatics specialists who build reliable production-grade systems” in academia need better defined roles, reward structures and career progression. My hope is that efforts like Open Research Computation will help bring this to fruition.

 

Hello and welcome. The title of this blog, “Long tail science”, is a term coined by Cambridge colleagues Jim Downing and Peter Murray-Rust which describes the work done by the hundreds and thousands of small groups and individuals who comprise the vast majority of working scientists. This is in contrast to “big science”, such as the Large Hadron Collider, characterized by large sums of money and a large, highly co-ordinated international community. I’ve spent the past 10 years working on various projects in grid computing, largely supported by the UK eScience programme. One of our major successes was to have built the distributed computing infrastructure for the LHC. Grid computing did a lot for big science but, I feel, very little for everybody else in the long tail. I’m hoping to help change that.

The once fashionable field of grid computing has evolved to encompass some new buzz words: cloud and utility computing. Commercial providers such as Amazon will sell you compute time by the hour with options for high performance and GPU assisted number crunching. There are legions of entrepreneurs setting up their own businesses with nothing more than a laptop and an internet connection, out-sourcing everything they can to Amazon and other application providers in the cloud such as Salesforce and Liquid Accounts. Ian Foster, Godfather of the Grid, has posed the question: if it is now possible to run a small online business from a coffee shop, why can we not run a science lab from that same coffee shop? Why not, indeed. It’s all about removing barriers to entry, including cost, for the individual with a bright idea they wish to explore, or as Foster puts it, “outsourcing the mundane”

Much science is laboratory based of course, requiring access to expensive specialist equipment. However, many such facilities can be run as a remotely accessible service, pay-as-you-go, just like Amazon. Locally, the Eastern Sequence and Informatics Hub and the Cambridge Advanced Imaging Centre are doing just that for DNA sequencing and microscopy.

Other areas of interest I’ll write about include Open science, open access publishing and citizen science. Enjoy.

Suffusion theme by Sayontan Sinha