Chapter 5. Biotechnology Must Head for the Cloud

Sajith Wickramasekara

It’s clear that the rate of hardware innovation in life science is staggering. Illumina recently made headlines by announcing it had dropped the cost of sequencing one human genome to a mere $1,000. While this cost excludes the much greater costs associated with data analysis and interpretation, it’s still a remarkable milestone considering that the first human genome cost $3 billion to sequence just over a decade ago. That’s a 3,000,000x improvement.

In contrast, the state of software is unacceptable and progressing much more slowly. How many scientists do you know who use spreadsheets to organize DNA? Or who collaborate by emailing files around? Or who can’t actually search their colleagues’ sequence data? If synthetic biology is going to reimagine genetic engineering, it won’t be on the foundation of archaic software tools.

We need a cloud-based platform for scientific research, designed from the ground up for collaboration. Legacy desktop software has compounded systemic problems in science: poor scientific reproducibility, delayed access to new computational techniques, and rampant IT overhead. These issues are a thorn in the side of all scientists, and it’s our responsibility to fix them if we want to accelerate science.

Reproducibility

The reproducibility of peer-reviewed research is currently under fire. Scientists at Amgen tried to reproduce 53 landmark cancer studies, only to find that all but 6 could not be confirmed. Many journals do not have strict guidelines for publishing all datasets associated with a project.

Just as in life science, computer scientists care about peer-reviewed research. However, a powerful prestige economy exists around creating and maintaining open source software. If you release broken software, people will say something and contribute fixes. This feedback loop is broken in biology. It can take months for journals to accept corrections, and spotting flaws is incredibly difficult without access to all of the project materials.

Replicating this prestige economy around practical, usable output requires a culture shift. Nonetheless, that doesn’t mean we can’t facilitate the process. Preparing a manuscript for publication is tedious and time-consuming. Thus, there’s little incentive for scientists to expend additional effort preparing and hosting project materials online after the fact. Files need to be wrangled from collaborators, data needs to be hosted and maintained indefinitely, and code used in data analysis needs to be documented. This process of "open-sourcing" a project must be as frictionless as flipping a switch.

The solution lies upstream with the software we use to actually do the work. Imagine that you plan your experiments in silico, using specialized applications for working with each type of biological data that talk to a central project repository in the cloud. Each bit of incremental progress is tracked and versioned so at any point you can jump back in time and replay your work. As you run your experiments, the hardware talks to the cloud applications in which the experiments were designed, enabling easier analysis of the results and passive note-taking that is far more comprehensive and accurate than using a lab notebook. You give your collaborators access to the online project from day one, allowing them to quickly offer suggestions and speed up research cycles. When it comes time to publish, you simply mark your project as public instead of private, and the community immediately has access to all of the work that led to your results.

Access to New Computational Techniques

Another major problem is the speed at which the latest computational techniques are disseminated. New versions of desktop software are often released on a yearly basis due to the overhead involved in developing patches and getting them installed. The upgrades are often tied to expensive license renewals, which slows uptake further. For a quickly developing field like synthetic biology, the algorithms and methods change too quickly for traditional desktop software to keep up. With web-based software, developers can easily push updates multiple times per day without any user intervention. This results in scientists getting access to cutting-edge tools without any inconvenience.

More importantly, a cloud solution would prevent the need to reinvent the wheel for each new computational technique. Imagine you devise a new algorithm for aligning DNA sequences. Any small script you write that does some data processing or analysis should not require you to go through the trouble of making sure it runs properly on your colleagues’ machines. On the other hand, hosting a tool by yourself that other scientists could use would require creating your own data storage, visualization, and serving infrastructure. A shared cloud platform would provide web APIs that provide this functionality, allowing anyone with basic programming skills to develop new tools for the community.

IT Overhead

For biologists who understand that easy access to their colleagues’ data facilitates more productive collaboration, the only option is often to develop a one-off database for their organization to use. Some existing products offer expensive shared database solutions, but they still require a scientific lab to obtain and maintain its own machines for running the software. The cloud should be leveraged to offer managed solutions for shared databases of biological data. Rather than having to run their own database infrastructure, scientists should be able to upload their data to an existing cloud repository, where it can be organized, searched, and quickly shared with collaborators around the world. IT will never be biologists’ core competency, and moving their data from servers in their labs to the cloud would let them focus on doing biology.

The benefits of the cloud in reducing IT overhead are even clearer in “big data” applications. In many cases, biologists could clearly benefit by applying cloud technologies that the tech world has been developing for a decade. It seems ridiculous that, although bioinformatics is dependent upon having large datasets for cross-reference, simple cloud storage solutions providing easy contribution and access are not adequately used. Even when this data is available, biologists still use expensive supercomputing clusters, failing to take advantage of modern distributed techniques that make commodity hardware just as powerful. Although price may not be a significant issue for top-tier labs, the democratizing effect of cloud solutions would enable new ideas to come from anywhere. Furthermore, too many biologists are still unable to use big data because they aren’t adept programmers. Cloud infrastructure could abstract away the complexity of large-scale data processing, allowing them to write simple, high-level scripts and queries that provide insight into massive datasets.

The Road Forward

A few colleagues and I were so fed up with the quality of software used in the life sciences from personal experience in research labs that we set out to solve this problem ourselves. We’ve made it our mission to build beautiful tools for scientists that are easy to use and take the pain out of managing and sharing biological data. Our solution is called Benchling and is currently focused on DNA. We offer a free version of our software to academics and biohackers, so I encourage you to check it out at https://benchling.com. Thousands of scientists are already using it to design, analyze, and share DNA, so you’re in good company.

Software tends to be an afterthought in biotechnology. It’s time to recognize that as scientists, whether you are a biohacker in a garage or work on a team with hundreds of scientists, you should be demanding better web-based software tools that enable collaboration and data sharing. The cloud is quickly replacing desktop software in other fields, and it provides clear benefits that should make biology no different.