Data extraction from openEHR - a demanding and challenging task?

varntzen · 18 March 2025 08:10

Hi all,

I’ve obtained a text from a data analyst, claiming the task of extracting data from openEHR format to serve in a data warehouse is extremely resource demanding and challenging. I’ll post it below, with some alterations not to disclose its author, nor the hospital or vendor in question.

If some of you out there with experience in analysis on openEHR data can provide examples of how to ease the burden or other comments to the text, I’ll be most grateful. In nice language, please

The introduction of a structured EHR has the potential to address several challenges arising from the lack of structured data in primary EHR sources at (a named hospital). However, the current archetype-based construction of (a named EHR product) (openEHR) presents multiple challenges, including the inability to use this data directly for analysis. The implementation of openEHR in the system makes it impossible to query the data for analytic purposes. The current solution relies on consultancy services and software from (a named vendor) to query and replicate the data so that it is suitable for analysis purposes ensuring acceptable response times even for high-volume querying. This results in a sole dependency on consultancy services from the vendor to access the structured data entered by the hospitals employees, and necessitates a robust plan for maintaining structured views when there are changes to templates/forms or archetypes

…

The extraction (of data from openEHR format to normalized relational database structures) is expected to be significant and demanding, requiring sufficient allocation of resources, both technical resources and expertise in openEHR/archetypes.

Comments, anyone? Pinging @birger.haarbrandt @ian.mcnicoll @yampeku

SevKohler · 18 March 2025 08:14

Sounds like bullshit, work with this data in AI and clinical research etc at HiGHmed.
Define a cohort, identifiy EHRs → export data.
E.g. JSONS or parse into CSV and let the researcher have it.
Response times can take a while this is true.
What is he/she doing with the data ?

I have even an export script laying around btw if you need it.

Its not a fullfletched dataWarehouse thats true missing e.g. the business intelligence layer.

yampeku · 18 March 2025 08:29

Yeah, we have used openEHR as a data warehouse in projects like infobanco. Workflows to obtain patient cohorts in other standards were designed to avoid exactly the points described.

If you want to use queries for analytics purposes inside of your openEHR system, it can be achieved if you prepare your data for that, i.e. by using abstraction archetypes as presented at the openEHR conference. As a simple example, if you apply and store a scale result for your patient that depends on past history in your data under a trigger, it is trivial to query it, but could be hard to do that on demand for a million patients using inference. That would be hard regardless of the underlying data model you use

damoca · 18 March 2025 08:57

I imagine the study is confusing querying openEHR (AQL) with querying the native database of an openEHR CDR. I bet there is no commercial EHR system where you could query the underlying database without a consultancy service of the provider (if they allow it at all).

When they say “The current solution relies on consultancy services and software from (a named vendor) to query and replicate the data so that it is suitable for analysis purposes” they probably refer to things like the module from Better that allows to extract data to a relational database. It is a very nice and useful module, and we used it in the Infobanco project to populate an OMOP database, as @yampeku said. But we will always have AQL as a standard, model-based, query alternative, that is much more than many other systems can say.

I also agree with @SevKohler, we are still far from having a native and optimized openEHR native datawarehouse, there is still work to be done for that, but that’s a completely different topic.

varntzen · 18 March 2025 09:51

Thanks guys.
The solution in question is to be able to extract data to a warehouse with non-openEHR data, to be able to ad-hoc query, but also to serve the objectives of defined studies as well as hospital-wide reports . The other sources to the warehouse is from a wide variety of applications.

I think the concern is the consequences from potential (sometimes rapid) changes in the data originated from openEHR templates which may cause challenges to queries in the data warehouse.

SevKohler · 18 March 2025 09:59

They will have to adapt data to a common model, or use LLM to extract knowledge.
Otherwise queries get messy, they should implement either openEHR or a openEHR-to-DataWarehouse mapping https://www.sciencedirect.com/science/article/pii/S1532046416300843 .

You mean templates are rapidly changed in the environment ?

borut.jures · 18 March 2025 10:38

I had the same concerns ever since I started working with openEHR but I accepted the wisdom of experienced people that using pure RDBMS is not possible for openEHR CDRs. I also agree it is not trivial at all.

RDBMS and SQL are what people are used to. Hierarchical structure of openEHR is a great barrier to most of them (understandably).

I was always curious whether there is a way to flatten the hierarchy of openEHR data to fit into regular RDBMS tables. This is why I’m trying to do it (what can I say - I get bored sometimes ).

I’m happy to share my findings with you or the data analyst but please be aware that according to other responses it might not work. However, learning about Better’s module “that allows to extract data to a relational database” gives me some hope.

bna · 18 March 2025 15:51

I would recommend the people you talk with @varntzen to try out the openEHR ETL solution we have developed in DIPS.

It takes annotated AQL queries and extract openEHR data and store them in a relational database. And it even collect metadata as demographics to build a star datamodell for analytics.

Yes it is a learning curve to learn openEHR reference model and archetypes. Yes it is an even steaper learning curve to master clinical data. But you can go stepwise. And you can learn from experts around the world.

Welcome to our strong community!

ian.mcnicoll · 18 March 2025 16:43

Interesting…!

There is definitely a challenge in making complex, contextual tree-shaped clinical data suitable for downstream analytics systems that expect data in tabular formats. @Seref may have some thoughts here.

It is not restricted to openEHR - the whole basis of SQL-on-FHIR is about simplifying FHIR REst search into something close to SQL tabular output and FHIR CDRs have a similar processing challenge.

So ETL and replication out to a separate, probably SQL datastore is pretty standard practice and IMO not at all problematic.

Creating the AQLs to populate the hat datastore is a bit of a specialist role as it requires pretty deep knowledge of how the data has been designed, so that the correct queries can be run to satisfy the analytics need, but as has been said already, this is possibly even more of a barrier in a complex, native SQL system. I would always encourage the health provider team to get familiar with AQL and the data design but this is often not practical due to lack of training or resources.

The kind of tools that the various CDR vendors provide to support ETL are very nice but it Is not rocket science to do a fair bit if this with standard AQL and normal integration tooling.

The challenge of change with archetypes and templates is just the challenge of change which I would expect to be even more difficult to manage with traditional methods.

birger.haarbrandt · 18 March 2025 17:19

a) I would never do analytics queries on any live system. You always should use a copy
b) If you copy the data, you can make sure that it can be in an optimized format for your purpose.
c) While it is a bit tricky, you can come up with a model that maps compositions to relations. Of course you will lose your ability to use AQL, but that’s a trade-off.
d) In fact, I had good experiences AUTOMATICALLY populating all kinds of analytics tools including tranSMART, I2b2 and Triple Stores in the past
e) We also have a tool “Cohort Explorer” to select patient populations and export the data to CSV.
f) There is nice tooling to include AQL data into “R”.

And yes, for some queries performance might not be optimal directly on the EHR, but for others it’s fine. Especially if we compare to any proprietary database (and I have been there with Hannover Medical School and HiGHmed), Data Extraction is WAY easier.

Seref · 18 March 2025 18:48

Sigh. Colleagues and members of the community already gave good answers and pretty much explained why secondary use of openEHR data (and FHIR for that matter) is bound by constraints that also apply to every other information system. I wish I had better news.

I think there is a bigger issue here: managing expectations. This person is clearly frustrated because … water is wet. What I would like to know is why they think it would not be?

I would appreciate if someone could ask them the following:
What were your expectations from openEHR re secondary use? Do you have a reference approach or solution against which openEHR pales? Do you have an ideal solution that you could describe at any level of detail?

I am asking these questions with no intention of hostility. I would love to know what someone was expecting if they were so unhappy with a solution that, based on what I’m reading sounds more or less the state of the art: not only for openEHR, but for any data analytics solution fed from above-average-complexity data, let alone healthcare data, which is like Cthulhu compared to a Chihuahua in comparison to retail or finance.

The answers, if we could get them, would help those of us who design, implement and offer these solutions improve our communications. There may be a need for it.

ian.mcnicoll · 18 March 2025 19:08

Love it

vanessap · 18 March 2025 19:22

Well I saw this post and just reading the concerns I feel this is more of a case of “you got a bit tangled by lock-in strategies from your vendor and you are quite disappointed because you didn’t notice it before and now its too late” and “did you get enough openEHR training on how to deal with the data afterwards, including training on tools provided by your vendor, in a timely manner”?

“This results in a sole dependency on consultancy services from the vendor to access the structured data entered by the hospitals employees, and necessitates a robust plan for maintaining structured views when there are changes to templates/forms or archetypes”

This sounds to me like the “low code vision” in the long term in my opinion - rapid development, waste of time after in endless support because these are usually not generic for every use case - you need to know where it fits and its consequences. The case pointed should not happen in a proper openEHR environment - these should be way more easier to fix. But all depends on what is being used at the end

“The extraction (of data from openEHR format to normalized relational database structures) is expected to be significant and demanding, requiring sufficient allocation of resources, both technical resources and expertise in openEHR/archetypes.”

I really don’t understand this as a concern - are they expecting to extract a typical csv file with columns with variable names and their values and thats it? what about the relationships between those? there is always some work to be done in between anyway, using openehr or not. what do they usually use at their end? redcap? innovaclinic? other EDC app that stores this data whatever it feels like it is? and then put all this together in a “data warehouse”? I always like to understand these cases since I work in research and sometimes I don’t understand if the concerns are:

I am used to working this way for the last 20 years and I don’t want to change it, or
this is new, I don’t understand it or have the right background for it, nor will I take the time to learn it, so I will be 100% against it without a clue on how this can improve the data I will be working on.
Those are the cases I am seeing the most.

Better does indeed have an ETL module to extract openEHR to SQL databases, but that is the reality of all ETL engines we build in between - they take time and are not that simple. Perhaps for users that are not so tech savvy we need to provide some tools, more user-friendly. But we need to be aware of those needs.

requiring sufficient allocation of resources, both technical resources and expertise in openEHR/archetypes

I wouldn’t put anyone driving a plane without having a licence and sufficient training for it either . The whole sentence looks weird and not reasoned.

Maybe there are more sensible reasons, but in the last 3y working in the research area I haven’t seen a sensible position for a post like this. If someone comes and says "this is the reason why openehr doesn’t work for me because X, Y, Z and i have a factual and proven reason for it. " then we can understand the case and work together.

Would be great to understand the context that data analyst is working in.
As Seref said it looks like something was promised as being the last cookie in the box and didn’t end up being the most tastiest.

varntzen · 18 March 2025 19:27

Yep. As requirements in the EHR tends to change, especially in the first months after go-live. Also, the agile approach to extend functionality in the EHR creates a “less stabile” environment for the data scientists. How bad, right?

SevKohler · 18 March 2025 19:31

Are you versioning your templates somehow (and persist that information in the compositions), that would help at least.

varntzen · 18 March 2025 19:56

Thank you all for your useful input, which was exactly what I hoped for from this wonderful community.

I’m not surprised by the content of your replies (otherwise I would give myself a grade D- in openEHR and forever be shamed).

I understand the text I started out with is similar to something you’ve heard yourself, so the pushback from some consumers of openEHR data is real. I get it’s not uncommon. There are misunderstandings and wrong assumptions leading the authors of the text to write it as it is. Please do not interpret the described situation as the whole fact.

What to do with it? Dunno. Maybe just demonstrate it will work, and persuade the non-believers one at the time. As for Vanessa’s question, quoted below, I think I’ll just reply, it’s reasonable questions.

Again, thanks a lot!

Olibou · 27 March 2025 09:30

Hi Birger, totally agree with you. We would like to automatise the extraction from openEHR to analytic platform. I see you seemed to be ahead of us. Do you have references on c) and how you automatize the extraction in d). Actually references on e) and f) are also welcomed

SevKohler · 27 March 2025 18:23

Took the time to clean and upload.

NY_Frank · 16 August 2025 16:32

Hi all, and thank you - grateful to be part of this community.

I’m currently tasked with structuring ~20 years of clinical data for a solo neurology practice that has operated entirely in a flat-file system (well-organized folders, scanned notes, structured PDFs, neuro assessments, etc.).

I’m evaluating how best to approach this - whether to lean into the full openEHR stack now (with .opt templates, COMPOSITIONs, AQL, etc.) or begin by implementing an openEHR-inspired model first: a normalized relational schema (Subject, Encounter, Assessment, Diagnosis) using DV_CODED_TEXT as lookup tables. This thread has been very insightful.

Our primary goal at this stage is analysis, not clinical workflow or system interoperability. We want to:

Structure the data into queryable form
Enable patient journey views and cohort analysis
Possibly support a future concierge brain health offering, or explore commercialization paths for the data

The option I’m leaning toward (for now) is to start with a relational model—lightweight, queryable, and inspired by openEHR’s structural discipline - while deferring the complexity of full COMPOSITION serialization and AQL authoring.

Later, if needed, I could layer on .opt mappings or create one-way export to openEHR or FHIR resources.

Curious to hear any thoughts or cautions from the community:

Have others taken a similar path and transitioned back into openEHR once value was proven?
Are there lightweight tooling options for creating .opt templates from existing tabular data?
Any pitfalls I might face in retrofitting openEHR onto an initially RDBMS-based model?

Thanks again - and huge appreciation for what this community has already contributed to global health data thinking.

borut.jures · 16 August 2025 19:35

@NY_Frank I’m sure the prevailing advice is NOT to use relational structures for EHR. As I mentioned above, I was bored and wanted to gain a deeper understanding why not

It turns out that it is possible to use plain SQL (and all the standard SQL tools) for openEHR. I’m using a widely used “Rapid Web App Dev Platform for Java Developers” to let non-openEHR developers work with low-code UI studio they are accustomed to and get ACL, DB migrations, integrated BPM engine and designer, search, reports, audit,… And the framework is open source too

Here is an excerpt of a body temperature archetype:

The Studio is using EclipseLink (JPA) and an Entity Designer for quick data model design. All openEHR data types (like DV_CODED_TEXT) have their own types that are embeddable into other entities:

This approach makes openEHR just another “domain” that is solved using existing tools and frameworks. It remains a difficult domain, but we get the benefit of using the tooling that developers are used to and operations specialists can optimize for performance. Data is queryable with standard SQL.

Please let me know if you are interested in further details so that we don’t hijack this thread.