# Data extraction from openEHR - a demanding and challenging task? **Category:** [Implementation](https://discourse.openehr.org/c/implem/39) **Created:** 2025-03-18 08:10 UTC **Views:** 618 **Replies:** 25 **URL:** https://discourse.openehr.org/t/data-extraction-from-openehr-a-demanding-and-challenging-task/6566 --- ## Post #1 by @varntzen Hi all, I've obtained a text from a data analyst, claiming the task of extracting data from openEHR format to serve in a data warehouse is extremely resource demanding and challenging. I'll post it below, with some alterations not to disclose its author, nor the hospital or vendor in question. If some of you out there with experience in analysis on openEHR data can provide examples of how to ease the burden or other comments to the text, I'll be most grateful. In nice language, please :-) > *The introduction of a structured EHR has the potential to address several challenges arising from the lack of structured data in primary EHR sources at (a named hospital). However, the current archetype-based construction of (a named EHR product) (openEHR) presents multiple challenges, including the inability to use this data directly for analysis. The implementation of openEHR in the system makes it impossible to query the data for analytic purposes. The current solution relies on consultancy services and software from (a named vendor) to query and replicate the data so that it is suitable for analysis purposes ensuring acceptable response times even for high-volume querying. This results in a sole dependency on consultancy services from the vendor to access the structured data entered by the hospitals employees, and necessitates a robust plan for maintaining structured views when there are changes to templates/forms or archetypes* > > *…* > > *The extraction (of data from openEHR format to normalized relational database structures) is expected to be significant and demanding, requiring sufficient allocation of resources, both technical resources and expertise in openEHR/archetypes.* Comments, anyone? Pinging @birger.haarbrandt @ian.mcnicoll @yampeku --- ## Post #2 by @SevKohler Sounds like bullshit, work with this data in AI and clinical research etc at HiGHmed. Define a cohort, identifiy EHRs -> export data. E.g. JSONS or parse into CSV and let the researcher have it. Response times can take a while this is true. What is he/she doing with the data ? I have even an export script laying around btw if you need it. Its not a fullfletched dataWarehouse thats true missing e.g. the business intelligence layer. --- ## Post #3 by @yampeku Yeah, we have used openEHR as a data warehouse in projects like infobanco. Workflows to obtain patient cohorts in other standards were designed to avoid exactly the points described. If you want to use queries for analytics purposes inside of your openEHR system, it can be achieved if you prepare your data for that, i.e. by using abstraction archetypes as presented at the openEHR conference. As a simple example, if you apply and store a scale result for your patient that depends on past history in your data under a trigger, it is trivial to query it, but could be hard to do that on demand for a million patients using inference. That would be hard regardless of the underlying data model you use --- ## Post #4 by @damoca [quote="varntzen, post:1, topic:6566"] In nice language, please :slight_smile: [/quote] [quote="SevKohler, post:2, topic:6566"] Sounds like bullshit [/quote] :joy: I imagine the study is confusing querying openEHR (AQL) with querying the native database of an openEHR CDR. I bet there is no commercial EHR system where you could query the underlying database without a consultancy service of the provider (if they allow it at all). When they say "*The current solution relies on consultancy services and software from (a named vendor) to query and replicate the data so that it is suitable for analysis purposes*" they probably refer to things like the module from Better that allows to extract data to a relational database. It is a very nice and useful module, and we used it in the Infobanco project to populate an OMOP database, as @yampeku said. But we will always have AQL as a standard, model-based, query alternative, that is much more than many other systems can say. I also agree with @SevKohler, we are still far from having a native and optimized openEHR native datawarehouse, there is still work to be done for that, but that's a completely different topic. --- ## Post #5 by @varntzen Thanks guys. The solution in question is to be able to extract data to a warehouse with non-openEHR data, to be able to ad-hoc query, but also to serve the objectives of defined studies as well as hospital-wide reports . The other sources to the warehouse is from a wide variety of applications. I think the concern is the consequences from potential (sometimes rapid) changes in the data originated from openEHR templates which may cause challenges to queries in the data warehouse. --- ## Post #6 by @SevKohler They will have to adapt data to a common model, or use LLM to extract knowledge. Otherwise queries get messy, they should implement either openEHR or a openEHR-to-DataWarehouse mapping https://www.sciencedirect.com/science/article/pii/S1532046416300843 . You mean templates are rapidly changed in the environment ? --- ## Post #7 by @borut.jures I had the same concerns ever since I started working with openEHR but I accepted the wisdom of experienced people that using pure RDBMS is not possible for openEHR CDRs. I also agree it is not trivial at all. RDBMS and SQL are what people are used to. Hierarchical structure of openEHR is a great barrier to most of them (understandably). I was always curious whether there is a way to flatten the hierarchy of openEHR data to fit into regular RDBMS tables. This is why I’m trying to do it (what can I say - I get bored sometimes :man_shrugging:). I’m happy to share my findings with you or the data analyst but please be aware that according to other responses it might not work. However, learning about Better's module “that allows to extract data to a relational database” gives me some hope. --- ## Post #8 by @bna I would recommend the people you talk with @varntzen to try out the openEHR ETL solution we have developed in DIPS. It takes annotated AQL queries and extract openEHR data and store them in a relational database. And it even collect metadata as demographics to build a star datamodell for analytics. Yes it is a learning curve to learn openEHR reference model and archetypes. Yes it is an even steaper learning curve to master clinical data. But you can go stepwise. And you can learn from experts around the world. Welcome to our strong community! --- ## Post #9 by @ian.mcnicoll Interesting..! There is definitely a challenge in making complex, contextual tree-shaped clinical data suitable for downstream analytics systems that expect data in tabular formats. @Seref may have some thoughts here. It is not restricted to openEHR - the whole basis of SQL-on-FHIR is about simplifying FHIR REst search into something close to SQL tabular output and FHIR CDRs have a similar processing challenge. So ETL and replication out to a separate, probably SQL datastore is pretty standard practice and IMO not at all problematic. Creating the AQLs to populate the hat datastore is a bit of a specialist role as it requires pretty deep knowledge of how the data has been designed, so that the correct queries can be run to satisfy the analytics need, but as has been said already, this is possibly even more of a barrier in a complex, native SQL system. I would always encourage the health provider team to get familiar with AQL and the data design but this is often not practical due to lack of training or resources. The kind of tools that the various CDR vendors provide to support ETL are very nice but it Is not rocket science to do a fair bit if this with standard AQL and normal integration tooling. The challenge of change with archetypes and templates is just the challenge of change which I would expect to be even more difficult to manage with traditional methods. --- ## Post #10 by @birger.haarbrandt a) I would never do analytics queries on any live system. You always should use a copy b) If you copy the data, you can make sure that it can be in an optimized format for your purpose. c) While it is a bit tricky, you can come up with a model that maps compositions to relations. Of course you will lose your ability to use AQL, but that's a trade-off. d) In fact, I had good experiences AUTOMATICALLY populating all kinds of analytics tools including tranSMART, I2b2 and Triple Stores in the past e) We also have a tool "Cohort Explorer" to select patient populations and export the data to CSV. f) There is nice tooling to include AQL data into "R". And yes, for some queries performance might not be optimal directly on the EHR, but for others it's fine. Especially if we compare to any proprietary database (and I have been there with Hannover Medical School and HiGHmed), Data Extraction is WAY easier. --- ## Post #11 by @Seref [quote="varntzen, post:1, topic:6566"] If some of you out there with experience in analysis on openEHR data can provide examples of how to ease the burden or other comments to the text, I’ll be most grateful. In nice language, please [/quote] Sigh. Colleagues and members of the community already gave good answers and pretty much explained why secondary use of openEHR data (and FHIR for that matter) is bound by constraints that also apply to every other information system. I wish I had better news. I think there is a bigger issue here: managing expectations. This person is clearly frustrated because ... water is wet. What I would like to know is why they think it would not be? I would appreciate if someone could ask them the following: What were your expectations from openEHR re secondary use? Do you have a reference approach or solution against which openEHR pales? Do you have an ideal solution that you could describe at any level of detail? I am asking these questions with no intention of hostility. I would love to know what someone was expecting if they were so unhappy with a solution that, based on what I'm reading sounds more or less the state of the art: not only for openEHR, but for any data analytics solution fed from above-average-complexity data, let alone healthcare data, which is like Cthulhu compared to a Chihuahua in comparison to retail or finance. The answers, if we could get them, would help those of us who design, implement and offer these solutions improve our communications. There may be a need for it. --- ## Post #12 by @ian.mcnicoll [quote="Seref, post:11, topic:6566"] healthcare data, which is like Cthulhu compared to a Chihuahua [/quote] Love it 👏 --- ## Post #13 by @vanessap Well I saw this post and just reading the concerns I feel this is more of a case of "you got a bit tangled by lock-in *strategies* from your *vendor* and you are quite disappointed because you didn't notice it before and now its too late" and "did you get enough openEHR training on how to deal with the data afterwards, including training on tools provided by your vendor, in a timely manner"? > "This results in a sole dependency on consultancy services from the vendor to access the structured data entered by the hospitals employees, and necessitates a robust plan for maintaining structured views when there are changes to templates/forms or archetypes" This sounds to me like the "low code vision" in the long term in my opinion - rapid development, waste of time after in endless support because these are usually not generic for every use case - you need to know where it fits and its consequences. The case pointed should not happen in a proper openEHR environment - these should be way more easier to fix. But all depends on what is being used at the end > "The extraction (of data from openEHR format to normalized relational database structures) is expected to be significant and demanding, requiring sufficient allocation of resources, both technical resources and expertise in openEHR/archetypes." I really don't understand this as a concern - are they expecting to extract a typical csv file with columns with variable names and their values and thats it? what about the relationships between those? there is always some work to be done in between anyway, using openehr or not. what do they usually use at their end? redcap? innovaclinic? other EDC app that stores this data whatever it feels like it is? and then put all this together in a "data warehouse"? I always like to understand these cases since I work in research and sometimes I don't understand if the concerns are: - I am used to working this way for the last 20 years and I don't want to change it, or - this is new, I don't understand it or have the right background for it, nor will I take the time to learn it, so I will be 100% against it without a clue on how this can improve the data I will be working on. Those are the cases I am seeing the most. Better does indeed have an ETL module to extract openEHR to SQL databases, but that is the reality of all ETL engines we build in between - they take time and are not that simple. Perhaps for users that are not so tech savvy we need to provide some tools, more user-friendly. But we need to be aware of those needs. > requiring sufficient allocation of resources, both technical resources and expertise in openEHR/archetypes I wouldn't put anyone driving a plane without having a licence and sufficient training for it either . The whole sentence looks weird and not reasoned. Maybe there are more sensible reasons, but in the last 3y working in the research area I haven't seen a sensible position for a post like this. If someone comes and says "this is the reason why openehr doesn't work for me because X, Y, Z and i have a factual and proven reason for it. " then we can understand the case and work together. Would be great to understand the context that data analyst is working in. As Seref said it looks like something was promised as being the last cookie in the box and didn't end up being the most tastiest. --- ## Post #14 by @varntzen [quote="SevKohler, post:6, topic:6566"] You mean templates are rapidly changed in the environment [/quote] Yep. As requirements in the EHR tends to change, especially in the first months after go-live. Also, the agile approach to extend functionality in the EHR creates a "less stabile" environment for the data scientists. How bad, right? --- ## Post #15 by @SevKohler Are you versioning your templates somehow (and persist that information in the compositions), that would help at least. --- ## Post #16 by @varntzen Thank you all for your useful input, which was exactly what I hoped for from this wonderful community. :smile: I'm not surprised by the content of your replies (otherwise I would give myself a grade D- in openEHR and forever be shamed). I understand the text I started out with is similar to something you've heard yourself, so the pushback from some consumers of openEHR data is real. I get it's not uncommon. There are misunderstandings and wrong assumptions leading the authors of the text to write it as it is. Please do not interpret the described situation as the whole fact. What to do with it? Dunno. Maybe just demonstrate it will work, and persuade the non-believers one at the time. As for Vanessa's question, quoted below, I think I'll just reply, it's reasonable questions. [quote="vanessap, post:13, topic:6566"] * I am used to working this way for the last 20 years and I don’t want to change it, or * this is new, I don’t understand it or have the right background for it, nor will I take the time to learn it, so I will be 100% against it without a clue on how this can improve the data I will be working on. [/quote] Again, thanks a lot! --- ## Post #17 by @Olibou Hi Birger, totally agree with you. We would like to automatise the extraction from openEHR to analytic platform. I see you seemed to be ahead of us. Do you have references on c) and how you automatize the extraction in d). Actually references on e) and f) are also welcomed :-) --- ## Post #18 by @SevKohler https://github.com/SevKohler/EHRsuction Took the time to clean and upload. --- ## Post #19 by @NY_Frank Hi all, and thank you - grateful to be part of this community. I’m currently tasked with structuring \~20 years of clinical data for a solo neurology practice that has operated entirely in a flat-file system (well-organized folders, scanned notes, structured PDFs, neuro assessments, etc.). I’m evaluating how best to approach this - whether to lean into the full openEHR stack now (with .opt templates, COMPOSITIONs, AQL, etc.) or begin by implementing an openEHR-*inspired* model first: a normalized relational schema (Subject, Encounter, Assessment, Diagnosis) using DV_CODED_TEXT as lookup tables. This thread has been very insightful. Our **primary goal at this stage is analysis**, not clinical workflow or system interoperability. We want to: * Structure the data into queryable form * Enable patient journey views and cohort analysis * Possibly support a future concierge brain health offering, or explore commercialization paths for the data The option I’m leaning toward (for now) is to start with a relational model—lightweight, queryable, and inspired by openEHR’s structural discipline - while deferring the complexity of full COMPOSITION serialization and AQL authoring. Later, if needed, I could layer on .opt mappings or create one-way export to openEHR or FHIR resources. **Curious to hear any thoughts or cautions from the community:** * Have others taken a similar path and transitioned back into openEHR once value was proven? * Are there lightweight tooling options for creating .opt templates from existing tabular data? * Any pitfalls I might face in retrofitting openEHR onto an initially RDBMS-based model? Thanks again - and huge appreciation for what this community has already contributed to global health data thinking. --- ## Post #20 by @borut.jures @NY_Frank I’m sure the prevailing advice is NOT to use relational structures for EHR. As I mentioned above, I was bored and wanted to gain a deeper understanding why not :blush: It turns out that it is possible to use plain SQL (and all the standard SQL tools) for openEHR. I’m using a widely used “Rapid Web App Dev Platform for Java Developers” to let **non-openEHR developers** work with **low-code UI studio** they are accustomed to and get ACL, DB migrations, integrated BPM engine and designer, search, reports, audit,… And the framework is open source too :wink: Here is an excerpt of a body temperature archetype: ![oee_obs_body_temperature_v2|690x497](upload://aY1igvR37kTYFNCqmDX3tGBRonx.jpeg) The Studio is using EclipseLink (JPA) and an Entity Designer for quick data model design. All openEHR data types (like DV_CODED_TEXT) have their own types that are embeddable into other entities: ![DvCodedText|582x500](upload://cX7qNpxq1dw0Xh0IqnuH0jz7Yn1.jpeg) This approach makes openEHR just another “domain” that is solved using existing tools and frameworks. It remains a difficult domain, but we get the benefit of using the tooling that developers are used to and operations specialists can optimize for performance. Data is queryable with standard SQL. Please let me know if you are interested in further details so that we don’t hijack this thread. --- ## Post #21 by @pablo [quote="NY_Frank, post:19, topic:6566"] Are there lightweight tooling options for creating .opt templates from existing tabular data? [/quote] This is actually the most time consuming task, but it’s totally independent from the source of the data, it could be a plain text file, csv, SQL, whatever, even non-tabular. The challenge is to annotate the source data with enough information that can actually be mapped to an openEHR reference model hierarchy. With that part done, it’s very easy to generate an OPT. So the challenges are: 1. Design the right metadata, that’s to consider the openEHR entry ontology and the data value model. 2. And actually doing the source annotation. With those two steps done, generating the OPT is trivial. [quote="NY_Frank, post:19, topic:6566"] Any pitfalls I might face in retrofitting openEHR onto an initially RDBMS-based model? [/quote] Our openEHR CDR implementation (https://atomik.app/) is 100% relational, so there’s no problem there. It will always depend on how you design your relational database (you need enough semantics to support your use case and also to extract openEHR-valid stuff from it). You can get creative and use a mixed approach too, like relational+files or relational+document (note that could be in the same DBMS or by combining two different things). --- ## Post #22 by @borut.jures openEHR archetypes use 2-level modeling. All(?) CDRs implementations store data at the “RM” level. The 100% relational approach would be to store the data at the “domain” level. There are research papers on using object-relational-mapping to achieve this but I’m not aware of any implementation using ORM to store openEHR data. What I want to test is transforming RM data to primitive SQL tables and storing openEHR data using traditional RDBMS approaches. Additionally all relationships between RM objects are also using native foreign keys. This way clinical modelers use the 2-level modeling to model archetypes, which are then stored as “low-level” RDBMS tables (I don’t have a good name for this “traditional” way of storing data in relational tables). This means there is a “blood pressure” table in the DB. This has been controversial (but requested by the followers of the Domain Driven Design). I believe @NY_Frank has a similar approach in mind with “an initially RDBMS-based model”. I want to implement a CDR using both approaches and then compare their performance. It is just an interesting project to avoid boredom :blush: --- ## Post #23 by @pablo Well, Atomik uses 100% relational with ORM. It's based on the EHRServer approach and tech stack, though it had been improved, extended and optimized in many areas. I think you checked the EHRServer already, which was the first approach to an open source openEHR CDR back from 2012-2013, and was derived from my thesis work from a couple of years earlier, which was called EHRGen and worked with archetypes instead of templates. In that one I merged together the CDR and the data entry, so it was an app with openEHR storage, with all forms autogenerated on the flight. I think that's also in my github if you want to check it. EHRGen is also on the same tech stack and uses ORM. In fact I've been using that tech stack with ORM since 2007. --- ## Post #24 by @Seref [quote="NY_Frank, post:19, topic:6566"] while deferring the complexity of full COMPOSITION serialization and AQL authoring [/quote] It’s OK to defer these, but you’ll almost certainly find it too challenging a task bolting these on later to your relational implementation. That’s OK too. I’ve long advocated benefiting from openEHR at whatever level works for you. If you look at the models in the international CKM and draw something on a napkin inspired by those, that’s a win in my book. That being said, analysis on openEHR data is a different beast. I’ll be talking about this in the upcoming Barcelona event: we’ve been running an analytics server for various clients for years now. You can represent openEHR data in a relational format, sure, but scaling that in different dimensions (usability, performance etc) is not a trivial task. You can certainly drop some of the design concerns related to OLTP in your design, but I don’t want to say that it’ll be a walk in the park when you mention patient journey views etc. The more you optimise for analytics, the further you’ll have to go from designs that can support AQL, composition persistence/retrieval in a performant way. Others may disagree, but that’s my 2 pennies. So my feedback is: you can build something for analytics inspired by openEHR, but I’d say you’ll find it difficult to grow that into a more single EHR/care centric CDR later. --- ## Post #25 by @ian.mcnicoll Great advice from @Seref . I’m a ‘clinical hacktitioner’ not a developer, and I can see the attraction of using openEHR archetypes purely to guide the clinical content, but build in a more traditional RDBMS fashion, particularly if a key output needs to play nicely in the SQL world for reporting purposes. I can also understand how diving straight in to ‘full openEHR’ can seem very daunting, anbd indeed if you try to build your own CDR, it is a significant challenge, to do in a performant way. However there is a third way, which is to use an existing openEHR CDR, amd I guess in your situation an open source example such as Pablo’s Atomik, or Ehrbase. The major advantage is that you can then get into the really tricky areas of designing your data content. via archetypes/templates with real-time deployment and all the advantages of full versioning, queryability via AQLetc . --- ## Post #26 by @ian.mcnicoll and even if you decide to revert to RDBMS or even DIY CDR, you will have a much better idea of what the openEHR CDR ecosystem give you for free and how easy or otherwise it is to export AQL resultsets to more SQL-flavour outputs. And definitely as Seref said , building for reporting/ analytics and then back-building to support direct care, is IMO almost impossible, going from direct care to reporting is doable. We have been working with a multi-national specialist renal medicine provider who are using openEHR to normalise outputs from local existing legacy EPRs into a normalised data platform using openEHR, primarily for analytics purposes but with a view to support direct care, and the archetype and template shaped accordingly. So not too far from your use case. --- **Canonical:** https://discourse.openehr.org/t/data-extraction-from-openehr-a-demanding-and-challenging-task/6566 **Original content:** https://discourse.openehr.org/t/data-extraction-from-openehr-a-demanding-and-challenging-task/6566