RFI: Datalake combined with openEHR CDRs - combined data dictionaries etc

erik.sundvall · 30 August 2024 07:16

Karolinska University Hospital (Karolinska) some weeks ago published an RFI at 495422-2024 - Planning - TED

Karolinska is looking for information about opportunities for solutions in the area of Follow-up and Analysis of both clinical data (combination of openEHR and non-openEHR formats) and administrative data (that is usually not in openEHR format).

In this early RFI step, Karolinska has chosen to focus on issues limited to data dictionaries/catalogs (see Data dictionary - Wikipedia) linked to both CDRs and more traditional datalake content (including in combination).

Note: This RFI is separate and not formally related to the framework agreement or the call-offs described at https://discourse.openehr.org/t/karolinska-stockholm-procurement-of-digital-health-platform-cdr-tools-services-consultants/445 so responders do not need to have taken part in that.

Experiences from openEHR-customers and researchers that have explored these things would of course be intersting to hear here in this discussion thread.

P.S.
Personally I was not directly involved in preparing the RFI, but want to highlight it here to openEHR actors and I will try to listen in to the presentations. I suppose openEHR vendors with many and/or big customers will already have experienced questions like these many times and would be intersting to hear solutions/experiences from. I would guess that in, addition to data dictionaries/catalogs that can handle both openEHR and other more traditional datalake content, also solutions regarding how to author and execute combined queries over openEHR and non-openEHR-data will be interesting.

Robert_Ouko · 30 August 2024 08:40

We are currently implementing a DBSM wouldn’t mind sharing our experience and lessons so far @erik.sundvall .

erik.sundvall · 4 September 2024 05:53

@Robert_Ouko feel free to share a short overview/summary of your approach here, I believe more than Karolinska would be interested in reading!

Robert_Ouko · 6 September 2024 07:15

The approach we are using is quite simplistic I’d say. Intergration to a traditional data lake which played a standalone role as a data source then using a non-openehr storage that acts as a DBSM to integrate of where and how the data is consumed in an openehr format.

erik.sundvall · 11 September 2024 04:55

Others reading this thread, might also be interested in the thread…

…that i discovered now, where several people (like @pablo, @ian.mcnicoll, @birger.haarbrandt and @thomas.beale) respond to good questions (from @Jelte and @joostholslag ) in several ways explain things that I often try to (but sometimes fail to) explain in BI contexts:

The sematically rich tree-shaped (or rather “directed graph”) models of openEHR (and many other object oriented data sources/storages) often need to be turned into “flat” table-like-structures in order to make “normal” table-focused BI tools (and BI experts) work the way they usually do, and
when creating those table-like-structures (e.g. using openEHR AQL queries) it really helps if you already know the purpose of the BI-tables/operations you want, so that you can make purposeful tables. Creating a general openEHR-tree to openEHR-table conversion would likely be hard to do in a way that easy to work with in tabular BI tools in a way that fits all general purposes.

Working with the Object–relational impedance mismatch - Wikipedia in the BI world, I assume must be a general problem and not openEHR specific.

Questions:

Is anybody aware of BI tools (query layers/tools, data dictionaries etc) that are good at working with source data that is directed graphs and trees (like RDF, OWL, Object oriented models or rich/deep XML and JSON structures)?
Have those tools been used for openEHR-based data?
How easily can “normal” BI-people, more used to tables, learn/understand such tools?

We did in the paper IOS Press Ebooks - Querying Archetype-Based Electronic Health Records Using Hadoop and Dewey Encoding of openEHR Models of course turn openEHR trees/graphs into tables suitable for efficient relational algebra operations to retrieve data, but I doubt that the structure used there would make “classic” BI-tools and BI-experts happy since the purpose was very different.

JuhaMuinonen · 11 September 2024 06:07

Hello Erik,

Very good points - semantically rich & tree-shape models do have great advantages - my experience is about Epic systems with similar CDR. With Epic normalized flat extract is 18.000 to 40.000+ SQL items. It is real art & effort and usability to have the needed tools for all kind of Datalake & BI purposes. Real value with Epic system is that the vendor supports those tools having new versions every three months for their 100+ customers. I have big brainstorming interest toward solving this base need with openEHR implementations. Could there be on model for Global usage? In my eyes - it is much bigger challenge that have open interfaces like FHIR etc.

lars.lindskold · 11 September 2024 06:21

Thanks, Erik, for understanding why we need to think through things before we act.

You’ve raised a very interesting point about the challenges faced when trying to bridge the gap between semantically rich, directed-graph-like models such as openEHR and traditional table-based Business Intelligence (BI) tools. The Object-relational impedance mismatch is indeed a common problem across various domains, not just openEHR.

When it comes to BI tools that can handle such complex structures, there are a few areas where I think we can explore solutions:

Tools with native JSON or XML support ( Tableau and Power BI as commercial Products )
Graph databases with BI Integration. I Love Neo4J and try to understand how to use relations for better understanding; there are a lot of interesting KG solutions out there to test.
Protege for work with RDF and OWL, especially when we talk about semantic data.

With your experience, Erik, would it be feasible to add a layer of abstraction, such as middleware, to handle the transformation from openEHR’s graphs into a more manageable format for BI?
PS: Will read your paper:-)

bna · 11 September 2024 07:31

This is a very interesting topic. As some of you are familiar with we, DIPS, have developed an openEHR ETL service. The purpose is to use AQL to extract hierarchical clinical openEHR data Into more BI friendly form.

I am currently working on a definition for an oncology application (head and neck cancer). Its fun and meaningful to see data tranformed into reports and insight.

matijap · 11 September 2024 07:57

Here at Better we also have an ETL solution in our offering to make it possible to extract openEHR data via AQL into a relational database. (Our vision is to include FHIR data as well, and to make extraction near-real-time, driven by asynchronous queues, but we’re not quite there yet.) That is in our experience the most appropriate format for BI specialists, and it is also possible to correlate clinical data to all other sorts of data there.

Olibou · 11 September 2024 08:29

The ideal product would be a middleware that automatically generates AQL in order to fetch openEHR data and make them ready for relational databases. This is actually how works most of ETL tools: 1) at data mapping time the ETL tool provides a GUI to import metadata and specify the extract 2) at execution time the SQL to fetch the data is automatically generated and executed. Ideally this kind of middleware (a kind of driver between openEHR and relational data if you want) could be used by traditional ETL tools - already in place in existing datawarehouses- and would make integration super easy.

CzarMich · 11 September 2024 08:56

It seems like the majority of us are following a similar approach, if not already. In my case, I construct and define my AQL based on specific requirements, as I believe it’s important to maintain control over the AQLs rather than relying on vendor-defined ones. I then push the results into a SQLite database that’s created on the fly (though it can also be a persistent database for other uses), and from there, I map the data to a predefined FHIR profile.

erik.sundvall · 11 September 2024 17:43

I’ll add some things that are obvious to openEHR veterans, but might need to be pointed out to traditional BI/datalake/lakehouse-experts discovering this discussion.

A challenge with some such tools may be that the query tools and other thigs may expect the JSON-schema or XML-schema to contain all important business/clinical identifiers. In openEHRs general “canonical” Reference Model (RM) based formats, the schema reflect the RM, rather than most of the clinical details (see openEHR - JSON Schemas (ITS-JSON) Component - development and openEHR - XML Schemas (ITS-XML) Component - development) .

Clinical details are instead in the higher modeling levels of Archetypes and Templates in the openEHR multi-model stack, see examples of models in https://ckm.openehr.org/

It might be tempting to instead build a datalake on openEHR’s “simplified” JSON formats (“flat” and/or “structured”, see Simplified Data Template (SDT)) that actually do contain readable paths containing clinically relevant path segment names from openEHR templates. But then you risk ending up with thousands of schema where several data points may be from the same archetype but will look different in different templates (e.g. different tree depths, diferent text like ”heart rate” instead of ”pulse” and thus often different simplified paths)

You can compare e.g. two formats of an example instance of an actual template we use at Karolinska, released in Release ChemoForm-MBA.v7.0.0-rc.1 · regionstockholm/CKM-mirror-via-modellbibliotek · GitHub

v7_rc1_some_yes-Better_Sandbox-response_1693500892877.flat.json (simplified template specific format - different for every template)
v7_rc1_some_yes-Better_Sandbox-response1693499899288.json (canonical format. Canonical format works fine and follows the same JSON schema no matter what template you base the data on.)

lars.lindskold · 12 September 2024 05:23

It’s great to hear from the openEHR veterans :-); I Agree that it might be a problem with the same, same, same, but just a little different :-). It could be interesting to arrange something with openEHR veterans , FHIR front soldiers, and perhaps some OMOPs supply soldiers, would be happy to arrange a think tank under the Umbrella of EFMI for that topic.

birger.haarbrandt · 12 September 2024 05:40

Also my two cents to the topic: for some time, I considered to use our Template to DTO class generator (openEHR_SDK/generator at develop · ehrbase/openEHR_SDK · GitHub) to derive POJOs which then could be fed into an ORM tool. However, this would lead to the situation described by @erik.sundvall that you would store identical data across different tables because of the templates. So this would require some more thoughts how it would be possible to annotate data concurrently with the archetype ID and the names which is problematic with columns in a normalized schema.

Another approach I explored as part of my academic research was automated translation from openEHR to tranSMART/i2b2 (Automated population of an i2b2 clinical data warehouse from an openEHR-based data repository - PubMed). This approach would tick some boxes regarding Eric’s request for a data structure that works well with hierarchical data, though lacks some of the rich semantics that openEHR provides. In the end, i2b2 uses a path-based EAV schema which works quite nicely with openEHR. You still would need to choose if you map your data to Archetype or Template structures. However, the i2b2 approach could be a good starting point for further enhancements on flexibility and RM representation.

So I think there is still room for innovation and improvement. Right now we recommend using openEHR as the single source of truth and build data marts in a next step so that you have an appropriate data stucture for analytics and you can combine with non-EHR data like admin data + reimbursement data.