A pivotal magazine article helped point medical doctor Parsa Mirhaji along a path to a semantic data lake for healthcare analytics applications, using Hadoop, RDF, graph databases and more.
Nearly 15 years ago when Parsa Mirhaji – then a doctor and medical researcher at the University of Texas – picked up the May 2001 issue of Scientific American, he found an article on semantic data that would prove to be a harbinger of his later work.
That issue of the magazine featured a cover story entitled “The Semantic Web.” Co-authored by Tim Berners-Lee, the inventor of the World Wide Web, the story described an Internet on which webpages were structured so that software agents could extract and manipulate relevant content. It also focused a bit on the Resource Description Framework (RDF), a then-new model for describing Web data.
Mirhaji saw something in the story that spoke to a practical problem he had encountered in searching for and saving information on medical advances, and it set the tone for research and analytics work he continues to do now. “I was being challenged to do analytics on heterogeneous data sets from across the Web,” he said. "I started to learn about artificial intelligence and software agents and how they could ‘grab concepts’ from the Web or databases.’’
Mirhaji’s dual interests in medicine and IT led him to work in bioinformatics and, in the wake of the 9/11 terrorist attacks, to create a prototype architecture for public health preparedness systems using RDF semantic technology and a service-oriented architecture.
In the intervening years, the semantic Web has yet to gain currency of the kind enjoyed by the conventional Web, but Mirhaji and others continue to pursue its promise. Today, Mirhaji is director of clinical research informatics at Montefiore Health System in Bronx, N.Y., and CTO of the New York City Clinical Data Research Network. At Montefiore, he’s leading efforts to build a “semantic data lake” – one meant to integrate various data sources to improve care management, risk assessment and other aspects of healthcare.
Not your typical data lake
Mirhaji said the data lake system is still in training mode, being tested out on some specific analytics tasks. It takes in all sorts of genetic, population, health and wellness data; that includes data from the U.S. Census, clinical trials and patient records – for example, heart rate, temperature and blood pressure measurements collected by patient monitoring devices.
The system also includes historical data “from people who have been treated in the past,” Mirhaji said, placing his and colleagues’ efforts squarely in the realm of evidence-based medicine that can include information from a hospital’s existing patient records – with some redaction for privacy – to help with diagnoses of current patients. “You build a model and validate it,” he said. “So far, we have retrospectively validated on 60,000 patients.”
Because the Montefiore system employs semantic technology as part of its data integration routines, it isn’t your garden-variety data lake in which a mix of structured and unstructured data may be stored together, largely undifferentiated. The system does make use of Hadoop, the open source distributed processing framework that’s typically associated with data lakes, but there’s more to it than that.
Pocket Glossary of semantic data
RDF: Promoted by Tim Berners-Lee and the World Wide Web Consortium, or W3C, the Resource Description Framework is a way to organize data resources.
Triple stores: Also known as RDF stores or databases. These technologies provide a way to handle data to bring out semantic meaning. They store data as triples, typically comprising three elements – a noun, verb and object – and supporting the idea of turning chains of data into statements.
Graph databases: NoSQL databases that utilize a node-and-edge structure. The adjacencies of data nodes help represent the interconnectedness of different data elements. Some graph databases are used to organize formal vocabularies, or ontologies.
According to Mirhaji, data in the system is fed into an RDF database, a variant of graph database technology. Graph software is one of the categories of NoSQL databases; it stores data elements in the form of maps indicating the relationships between different elements. RDF databases store “triples,” in which basic relationships are described using a subject-predicate-object representation; as such, they’re also known as “triple stores.”
Make mine a triple
The system has graph database instances running on nodes in a Cloudera-based Hadoop cluster, along with the YARN resource manager and Spark data processing and analytics engine. The database being used is AllegroGraph from Franz Inc., a company based in Oakland, Calif., that was founded in the artificial intelligence heyday of the 1980s.
AllegroGraph was selected partly due to its treatment of data, Mirhaji said, adding that Franz uses a form of RDF with which he and his colleagues were comfortable. “There were a lot of other graph databases out there, but every one had its own way of representing graphs,” he said.
While others are responsible for building out the system’s storage infrastructure, Mirhaji focuses on the data as it resides in the Allegro repositories. “My job,” he said, “is to understand how the information can be turned into a triple so it can be queried and made sense of.”
Querying the accumulated data “is the area where this semantic lake excels,” Mirhaji said, noting that applying such semantics greatly increases the system’s usability for analytical queries by a variety of end users, including medical researchers and hospital administrators.
There are a lot of complex parts to the system. Yet, the semantic data lake, at its core, works off of a few basic constructs – and Mirhaji mused that there’s nothing at all wrong with the simple nature of the RDF triple store. “In its simplicity, there is a kind of beauty,” he said.
Revenge of the semantic Web
For many casual technology watchers, the notion of the semantic Web came and went without much effect. But the work of Mirhaji and other advocates points to continued interest in at least some quarters.
He sees the semantic data lake being built at Montefiore as something of a resurrection for semantic Web technology that has been a long time coming. The system is intended to support an ever-widening collection of data types for near-real-time predictive analytics and knowledge-based, or cognitive, computing, and he expects that other data management professionals will make similar journeys.
“You just can’t accumulate data in a big repository,” Mirhaji said. “Once you begin to work with massive amounts of data and, then, advanced predictive analytics, you will need to have a deep understanding of your data.”