Shaping healthcare’s future with genomic data
The costs of genome sequencing are also influenced by the nature of genomes themselves, which are paradoxical in that they’re both unique yet highly similar in people. “You don’t usually have to sequence the whole genome,” Hunter explained. “You can look for particular markers. Human beings are 99.9 percent identical with each other at the sequence level. So we can just look for the parts that are different.” Focusing on common areas of differences in genomes is known as genotyping, which increases the efficiency of sequencing genes for both research and clinical purposes.
From research to clinical care
Spurred in part by the decreased costs of analyzing genomes for mutations, which contribute to healthcare conditions, genetic data is in the process of transitioning from research settings to clinical use. The University of Colorado incorporates genotyping and genome sequencing for clinical care of patients at both a children’s and adult hospital—Children’s Hospital Colorado and UC Health respectively. RDConnect’s Scientific Advisory Board (a European rare diseases council of which Hunter is a member) has been building out infrastructure to diagnose children with rare diseases. “It’s by far the most effective tool at this point for kids with an undiagnosed problem or detecting disease early,” Hunter said. Clinical usage of genomic data is aided by exomes, which Hunter described as “the part of a genome that will be translated into a protein. Most of the genetic changes, not all of them, that will cause disease cause a change in the protein part.” Exome sequencing increases the productivity of genotyping, which makes genome sequencing much more viable in clinical settings than it was before.
Scale and processing power
The need to compile a host of unstructured, semi-structured and structured data from both internal and external sources presents massive challenges for those working with genetic data. That diversity fuels questions of compute power and scale, each of which is critical to such an undertaking. “As these modern sequencing machines generate their data, you’re talking about reams of a hundred to thousands of terabytes,” said Tom Plasterer, head of research of the Research Development Group at AstraZeneca.
AstraZeneca is attempting to counteract afflictions related to cardiovascular, cancerous, metabolic, respiratory and other issues as part of its role in a global genomic initiative attempting to sequence 2 million samples in the next several years. The magnitude of the data is multiplied when working with certain conditions such as cancer, which frequently requires sequencing both normal and cancerous cells to identify variants. “Right there you’re faced with a huge computational problem where you need to reassemble those genomes,” Plasterer said.
AstraZeneca accounts for those processing demands with elastic computing resources in the cloud, which scale up or down as needed “for additional compute power and then go back to an environment where we’re not paying for all those systems when we don’t need them,” Reinold explained. Additional cloud benefits include compression techniques for more cost-effective storage and the means of accessing data from multiple locations.
Integration and aggregation
The true value of genomic sequencing comes from cross-referencing such data with the abundance of external resources dedicated to genetic information, pharmaceuticals and healthcare conditions. “A lot of this data is available in public sources,” Biotricity’s Al-Siddiq said. “You’ve got information from multiple people and databases that study a very specific disease so you know which mutation to go after.” Most of that semi-structured or unstructured data was created with varying data models, formats and taxonomies, which make integration with traditional relational techniques cumbersome and time-consuming.
A more practical approach is to “do a semantic integration that translates all those different databases into an ontological form of knowledge representation, one that’s really just about the biology,” Hunter said. “So now you can query this knowledgebase without having to know which database the information came from or how that database is organized or any of that. All you have to know about is the biology.”
One of the innate benefits of such an approach is that by linking data on an RDF graph with common ontological models, users can accommodate the evolving nature of biological informatics. New developments related to genomic data, characteristics, mutations and more are readily encompassed within the underlying semantic technologies linking what were initially disparate data types. That methodology facilitates both semantic and logical consistency for representing facets of genomic research that may one day change or, perhaps more commonly, become disputed.