Big (Sequence) Data in Pre-competetive Pharmaceutical R&D.

William Spooner (Eagle Genomics)
Business & Industry
Location: Buckingham Room
Average rating: *....
(1.00, 1 rating)

Does pre-competetive collaboration ease the pain of adopting disruptive big-data technologies? Next generation DNA sequencing technologies (NGS) have revolutionised molecular biology over the past 5 years, and now provide key tools in the development of personalised (genomic) medicine. Their adoption is, however, complicated by the prodigious data volumes produced, and the highly-specialised nature of the analysis software, developed mainly by academic institutions.The pharmaceutical industry is increasingly looking to external providers for NGS data management/analysis, as epitomised by the Pistoia Alliance Sequence Services project, namely the vision of “a platform to users where they can solve scientific problems based on DNA/RNA sequence information”. This presentation will look at the approach taken by one group (Eagle Genomics/Cycle Computing) funded under this project, and assess the benefits gained from such an open innovation initiative for both “seeker” and “solver”.

Science: It is predicted that genomic, personalised medicine will transform the health of our children and our children’s children. In this context I will present a brief, layman’s introduction to the effect of genome sequence differences on human traits, wth examples from personal experience, and how these associations can be used in the development of genomic medicine.

Logistics: At 3 billion letters long, genomes are big. The data generated by a sequencing experiment may exceed 100Gb per individual, or 100Tb for a typical study involving 1000 people. Genome sequences are structured datasets which are analysed using highly-specialised and rapidly changing techniques. I present options for managing/analysing these data sets on traditional infrastructure vs. the cloud.

Integration of Open Data: Signals in propriatory genomic data can only be interpreted in context of the public reference genome annotation, and the very large quantity (Petabytes) of sequence data that have been deposited in public data banks. I present approaches to accessing and exploiting these data, and review the availability of public datasets on the cloud.

Regulation: I introduce the various regulatory restrictions on management of genome sequence data, and the limitations that these impose.

Analysis Platform: The task of building from scratch (to a tight budget/timescale) a platform for the analysis and sharing of genomic data is challenging. We adopted a losely-coupled architecture that reused existing open source components wherever possible, addressing the pressing issues of logistics, integration and regulation.

Conclusion: Tools for the effective management/analysis of genome sequences are needed to support one of the grand-challenges in healthcare today – the development of truly personalised medicine. It is recognised that, whilst pushing the limits of current information technology, the development of such tools provides no competetive advantage for companies engaged in pharmaceutical R&D. Once a particular technological challenge has been identified as pre-competetive, it frees the stakeholders to collaborate on the specification and share the costs of developing a solution. The Pistoia Alliance Sequence Services project is a great example of this approach, with demonstrable benefits for solution “seekers” and “solvers” alike. I commend this approach, and encourage others consider whether similar collaborative approaches could be applied to solving their particular big data challenges.

William Spooner

Eagle Genomics

William Spooner is a seasoned operational bioinformatician with a track record of delivering tools for high-throughput genomics research. Having worked previously on Ensembl, BioMart, Gramene and WormBase projects at the European Bioinformatics Institute, the Wellcome Trust Sanger Institute and Cold Spring Harbor, his current focus is on making life easier for users of open source/data in commercial settings. His strategic thinking is driven by the huge opportunities for data analysis in the life sciences provided by the near simultaneous arrival of NGS and cloud computing.


