How does precision medicine become a reality? The Semantic Data Lake for Healthcare makes it possible
Ontological user interface
One of the most immediate advantages of the semantic framework of the SDL is the mutable nature of an ontology—the uppermost layer of the maturity stack—that largely functions as a user interface. Unlike relational data modeling that depends on schema and significant recalibration times to add additional requirements or sources, ontologies naturally evolve to include additional requirements based on semantic standards. Montefiore has been leveraging an overarching ontology as a pipeline with which to accelerate the integration of new data to fuel its predictive analytics and “do everything that would take days, if not weeks, for a data scientist to be able to bring this data together and begin to do an analytic model,” Mirhaji says. “Now, this can be automatically done in literally minutes.”
The ontology enables data scientists to create a predictive analytics model, use the various forms of data in the SDL as an input, and then take the outputs of those results into any number of data science platforms for further analysis. Significantly, it allows them to do so in a manner close to self-service so they “basically have the ontology answer their questions during the interrogation process without anyone knowing anything about the backend of the SDL,” Mirhaji explains. Perhaps even more importantly, the ontology empowers users to ask questions, answer them and prepare models on the fly for extremely nuanced subsets of patient classifications—which is the objective of the Precision Medicine Initiative.
Data science
One of the most compelling facets of the maturity stack upon which the SDL is based is the fact that, in addition to the foregoing data discovery process, its top layer provides a pipeline to any number of data science platforms. Although there are few limits to the tools available, popular ones include R, Python, Spark, Knime and others.
“You can take any new algorithm and automatically link it into the Semantic Data Lake,” Aasman says. “It’s very modular. The cool part is that we linked the data science to the platform in such a way that data scientists can now use any technique that they want and plug it in.” They can also do so with a fair amount of organized autonomy that still enables individual access to data without affecting the ability of others to use the same data.
In a project of this magnitude, there are competing models and algorithms of various effectiveness. Thus, it is essential that Montefiore has “generalized the data access and data preparation aspects of the SDL so we can run multiple models with different prerequisites at the same time on the same sets of data,” Mirhaji says. That way, it is possible for individual users to concentrate on different subsets of patient groups at the same time. This capability is enhanced by the fact that the SDL’s proponents have “created this concept of disposable or just-in-time data marts; we can actually write them into a relational database so users can start playing with the analysis-ready data without understanding the mechanics of the SDL,” Mirhaji says.
Machine learning
The ultimate trump card of the SDL for achieving the goals of the Precision Medicine Initiative currently revolves around its application of machine learning. All of the various data sources and comprehensive list of patient outcomes are significantly enriched by the fact that data scientists can take the results of their predictive analytics—and, in some cases, the effects of actual patient outcomes based on previous predictions—and reinsert them into the graph for added empirical cogency.
“It’s very important that we take the results of analytics and put them back in the database,” Aasman explains, “so that we can do visual discovery in a sea of results and find interesting relationships in the data, or do cluster analytics over certain properties of patients and assign the patients to the clusters they belong to. So now, being part of a certain cluster is part of a new property of a patient and can be used in follow-up analytics.”
The machine learning feedback process is facilitated by the SDL via two principal algorithms, a cluster based one called Progeny and another called SuperLearner. According to Mirhaji, the latter is vitally important for achieving the Precision Medicine Initiative’s goals because it is “basically an integration algorithm on the results of the predictive models. It is like a process where you have multiple algorithms looking at the same patient; each one of them is probably different on certain patients than the others. The SuperLearner decides for this patient or cluster of patients which one of these algorithms is performing better.”
Progeny is vital for determining clusters of patients in big data quantities for further deconstruction of patient groups and subgroups—which is well aligned with the aim of the Precision Medicine Initiative. “We have basically taken that Progeny model and implemented it into our pipeline mechanism so we can do these types of dynamic clustering on the fly and make those clusters available to those analytic models,” Mirhaji says.
A work in progress
The predictive analytics deployments of the SDL are still in the clinical trial stage and, according to Mirhaji, have registered a 99 percent accuracy rate for predicting patient outcomes related to treatment options. The subsequent step for the medical group is to gather data about the rate of acceptance of the recommendations yielded from those analytics—determining what percentage of practitioners is actually using those recommendations and what sort of success they engender.
Additional developments are likely to include the incorporation of unstructured data (specifically, image data and their reports) and deployments centered around sleep disorders, diabetes and possibly behavioral and mental health issues. Nonetheless, the SDL represents a unique amalgamation of technologies for implementing healthcare for extremely specific patient populations of individuals and traditionally neglected groups according to their personal attributes. Mirhaji says, “That’s currently where we are heading, and that’s what our infrastructure is designed to support.”