openEHR archetypes as SQL Tables

The output artifact will be something that you can verify and understand clearly. The GUI tool we have today are meant primarily for authoring, tools with AI would be primarily for reviewing. But we would make this possible using something simpler honestly - an abstraction layer on top that makes it easy to build GUIs, and has all the bells and whistles you’d expect of a modeling tool. But the output internally would just be slightly different (which again, you don’t really care about).

It would probably be an uphill battle, for sure. But IF we were to solve for this, this makes everything downstream so much cheaper and simpler - implementation of CDRs, building and maintaining apps, integration with existing tooling.

This is a very good point. But I’d assume the future models created by the modellers sort of needs to tackle how old data would be migrated as well - for example, in your tooling, you might draw connections from a previous version and map how the old data would be transformed when the new version of an archetype is applied - uphill battle for the clinical modelers and tooling, yes - but technically simple since it’s already been solved.

Which again, absolutely - there should be no compromises about the safety. It’s just a different way of thinking about the problem. Instead of the clinical modelers exactly standing where they are and asking the rest of the community for better tooling, the clinical modelers might have to bend a little bit to new practices (while still keeping everything safe, of course).

I don’t think this is going to make that much of a difference. Because one could estimate that FHIR has had 100x more funding than openEHR. And it’s still fundamentally a problematic system to work with. openEHR will still have it’s fundamental challenges - maybe things might be more performant, and tools could actually help build things faster.

But we’ll still be far behind the rest of the industry in the tooling and technology available on PostgreSQL. Because those tools would have received 10000x more funding. It’s always going to be a moving target. The software engineers of the future would still look at openEHR with 100x more funding, and still feel that it’s not as good as what they’re used to with other technologies they use.

I understood that Ian was talking about the prototyping phase with no real data. Just quick iteration when a template change is made.

In production, the schema change can be completed “quickly” using the standard tools that SQL DBAs are using in other large databases. No such tools or experienced “CDR administrators” exist for the existing CDRs.

And to be fair, how long does existing CDRs take on the same 200Gb? I often read that they don’t even try to optimize what happens when a new template is uploaded. This means that there is no “schema” update of the existing data but lets not forget that every AQL query will return incomplete data (many times unknowingly to the users which might “kill” patients).

Sidharth has experience with both approaches:

Existing CDRs also want to avoid any changes to the RM that would break AQL queries. This is not sustainable and not what openEHR promises. Are we going to stay at RM v1.1.0 indefinitely? Combine that with some CDR vendors declaring that they don’t have plans to move to ADL2. I would be careful as a customer.

Focusing on PostgreSQL would definitely make sense and speed-up the progress. However national procurements would grind to a halt with endless objections from other DB vendors :wink:

3 Likes

This seems to be a key objection you have about openEHR - APIs being ‘bespoke’. The content of any API is always ‘bespoke’, but I would say that the general concept of an API as the connection point between an application (or some other component) and the system is the simplest, most expected thing in the world. There are dozens of frameworks for talking to APIs, as well as creating them. I don’t understand why you consider APIs (in general) ‘bespoke’ but not a bespoke DB schema.

The API shields you from how the persistence is implemented. If you are in the business of building a back-end yourself, then you want to put a standard API on top of it, so that app devs can talk to your system and not care about your DB approach.

Your DB approach could be the one you propose, but it still needs the API and a service on top. There is potentially a lot of logic in the service, e.g. data validation, prioritisation, threading, anything really, which is not going to be much fun to implement in the native DB. If you do implement it in native DB logic, and you decide to change to a different persistence back-end, you have to rewrite the service. That’s why we use multi-tier layered system architectures after all - to reduce impact of changes up and down the stack.

I feel I am missing something here.

1 Like

I’m not saying it’s not possible, I’m saying its difficult to implement CONTAINS in a pure relational technology with pure SQL in a generic way, that is that: 1. it works for all cases, 2. for any number of levels, 3. for any data structure. So if you implement the CONTAINS with JOINs, which would be the mechanism you are thinking about or currently using, it could be not very performant in some cases (lots of data, a lots of levels in the JOIN), though it will work if you have 5 minutes to get a query result.

I do not agree that the solution in JSON/XML databases, that have the CONTAINS implemented natively, would be less performant than the relational approach, which isn’t by design for hierarchical data. I’m actually curious about that statement, since seems biased, but you might have some data I don’t have.

In general apps will have some local persistence (I mean via a direct connection and not through an API). My comment was about: if you where thinking about having the app persistence also based on openEHR, which then would make the integration to the backend openEHR CDR much easier. That’s actually a common pattern in the imaginology flows with PACS, and I personally designed EHRServer and then Atomik with that in mind, so we can use the same repository as local for an app or as shared/aggregated for many apps (we even have a sync API in place to automate that data flow).

At least on EHRServer / Atomik we don’t have any schema changes, since the schema is based on the RM not on the archetypes. The RM doesn’t change (1) but archetypes do.

(1) In general the RM is static though it’s updated through the SEC to add fixes and new features, but you can just fix your implementation on a single RM version so the underlying schema of the persistence doesn’t change at all. Then, since all archetypes are based in the RM, that persistence implementation can store ANY data structure defined in ANY archetype without ANY modification to the database itself.

Archetypes have a different management process and tends to be more dynamic than the RM (have more updates, more frequently). A persistence solution based on archetypes than requires more frequent changes to the database schema. That’s also why we opted for the RM instead of for archetypes to generate our SQL tables.

I think you might be applying rules and patterns without having all the options on the table, I neither considered all the options, though back when I did my own research and tests I found some patterns that worked, and refined them over time (it’s been actually 19 years since my first database implementation of the specs). Anyway it’s interesting that you actually go through the process of researching, comparing solutions and openly discussing about them. This is a very interesting discussion.

3 Likes

I’ll just talk about my personal journey here.

What really started challenging my view about “API shielding persistence” was Hasura. It’s a project that, given a database schema, will automatically generate GraphQL on top of it. And it’ll also ensure all queries you’re doing via GraphQL are performant and convert them to native SQL.

There are other projects that have similar functionality - PostgREST is an excellent one, widely adopted and supported by Supabase, and they make a similar claim about APIs as a leaky abstraction:

Database as Single Source of Truth
Using PostgREST is an alternative to manual CRUD programming. Custom API servers suffer problems. Writing business logic often duplicates, ignores or hobbles database structure. Object-relational mapping is a leaky abstraction leading to slow imperative code. The PostgREST philosophy establishes a single declarative source of truth: the data itself.

Once our team adopted and started using Hasura, there was no going back to writing REST APIs for every service by hand. We were able to cut down development times by close to 5 - 10x. We didn’t think we were really losing out on anything - we had full type-safety in our applications, and you could autogenerate the types from the schema in seconds. As soon as someone changed the schema in the database (again version-controlled and applied via CI/CD), all types in your system would update and indicate all the places in the application logic you have to update. The application had a GraphQL layer that can let it do WHATEVER can be done in SQL, without writing a separate API route for each functionality by hand. This saved us just SO MUCH time.

Almost all of the SQL queries generated were super performant. We only had to stop and create indexes sometimes - but even these would be suggested by the API engine - “consider creating an index on x”. And our developers promptly did that.

This real-world experience made me believe: Automatically generating scalable APIs on top of a database schema is a solved problem.

What wasn’t still solved well was the security model around these APIs - who were allowed to do what on which resources. This is why we were still building wrapper APIs on top of the Hasura GraphQL APIs to expose to the end user.

But 2 major innovations changed that:

  1. Hasura came up with its own Authorization layer, which had granular row-based / column-based security
  2. PostgreSQL’s Row-Level Security

The reason we didn’t go all in on option (1) was that all of the rules had to be authored as YAML that only worked with Hasura. What if we wanted to move to another API generation engine later? We didn’t want to build all our business logic on Hasura that much.

Option (2) RLS was especially interesting, because this was already widely supported by other API generation projects like PostgREST (and Supabase / Lovable by extension) having the core business logic like the security model in SQL just made sense - it was portable, not very vendor specific (given we agree on the premise “Postgres is a standard”), and we were able to use our CI/CD migration pipeline to apply the security model along with the data models.

Previously, RLS had a bad reputation for not being performant, but with recent releases, it’s probably faster than implementing authorization in the API layer.

This also gave us “deep security”. If all your policies about who has access to what, and the data models are present in the database, you can actually give anyone full access to your APIs (GraphQL or SQL, even) while still maintaining peace of mind that they can’t really access what they’re not supposed to access.

Taking this idea even further, I re-read the whole openEHR technical architecture, and the requirements of a health data plaform. And I realized that every single one of the requirements could be done directly in the database. We built triggers and procedures to handle audit logs, we built versioning into the database - so even if someone makes an API call through GrahpQL or SQL, all these operations will ALWAYS be done at the database level.

Combine this with clinical data models in plain SQL (https://github.com/medblocks/health-tables private repo - ping me if you’re interested in checking this out), and this gave me something that I was always looking for (in openEHR and other standards) - a platform that will implement the hard bits for you and guarantee that all the security + compliance is taken care of, have uniform information models that’s widely accessible, I can open this up to any tools that already works with SQL, I can have a regular developer not learn anything about openEHR and have them working on an app.

We’re also experimenting with mapping these simple relational data models back to openEHR CDRs (via REST APIs), FHIR CDRs, and OMO - and this works really easily with patterns that are super easy to implement on SQL, like the transactional outbox. And app developers can optionally set up their common data models in one set of tables, and have their own internal application logic in another - and they can write to the common data model tables and their app tables within the same transaction to ensure all apps stay on the same page - something that openEHR has always been about, but technically doing this with multiple system would be a very complex process (SAGA, Even sourcing are not operationally simple to implement).

So this is how I came to the conclusion that “A Postgres database directly as the health data platform used by multiple apps is our best bet”.

By adopting this, we get extreme operational simplicity - anyone can run this on any cloud / on-prem - you just need to apply some special SQL migrations. It’s easy to maintain over a long period of time - we are almost guaranteed that we’ll get future updates to Postgres, and our data models, as they evolve just need to think about how to move existing data to newer schemas upfront. All of the complexity and compliance requirements of healthcare are hidden away in database triggers and procedures. Regular developers with simple tools can just use the Postgres database as they always do - just ask, they do use the common data models while persisting shared clinical information.

Of course, this would definitely be a thing. But before we get into that, I just wanted a system that worked well for me and my team. I want something I can confidently suggest to healthcare organizations. Something that would scale, be maintainable, and be reliable over time. To me, that answer is “just use Postgres”.

That’s why I think being “technology-neutral” also sometimes makes openEHR “technology-unaware”. And this is never the case:

The vision of a specification can only be achieved when it knows the engineering possibilities and the constraints:

visioneering2

I think we can implement the principles of openEHR and what it stands for directly on top of Postgres - we can’t implement everything as is, and we might need a rethink of what openEHR should look like, given the engineering constraints.

2 Likes

Thank you for sharing your journey to Postgres @Sidharth_Ramesh !

I went through a similar one in 2020. You will see where my love for generators started if you check it at PostgresApps.com :wink:

Check the “condensed” CREATE TABLE approach. I believe this would be helpful for your DDLs in the health-tables repo. The CLI automatically finds table relations and generates foreign key constraints for them (and many other boilerplate).

It uses Annotations and Snippets to simplify writing repeating patterns. In openEHR we would add annotations for DV_ elements. For example DV_CODED_TEXT would have an annotation:

--- Exercise intensity.
--- @DvCodedText intensity

This gets expanded to (simplified for readability):

--- Value.
value text,

--- Defining code.
terminologyId text not null,
codeString text not null,

--- Internal raw DV_CODED_TEXT value.
rmData jsonb not null,

--- Internal ADL path.
path text not null,

--- Search information used for full-text search (`terminologyId` + `codeString` + `value`).
--- @index using gist
search db.search not null generated always as (to_tsvector(‘english’, terminologyId || ’ ’ || codeString || ’ ’ || value)) stored,

Notice that we can keep all the hierarchical RM data in the rmData column and add primitive type columns for indexing.

The CLI also generates row-level and column level security and grants permissions to the roles (see all annotations).

One thing I didn’t like was writing procedures and triggers using PL/pgSQL. In your proposal using PL/pgSQL is another part of using Postgres as a standard – all logic is written in the language that is part of PostgreSQL.

2 Likes

Hi all, this is my first comment in the community, and I’d like to share a personal perspective (not an official MongoDB position).

Pablo’s point really resonated with me:

I completely agree. Defining a concrete database, or a database schema in the spec would constrain openEHR’s dual modeling philosophy, which is why openEHR is so special. Nothing matches archetypes in representing the complexity of clinical data and putting every piece of information in the right clinical context.

I’m enthusiastic about what openEHR solves. But, yes, the extreme flexibility of archetypes and AQL can feel daunting to build an efficient persistence layer: sparse data, hierarchies, path semantics, CONTAINS, etc. It’s tempting to reduce the complexity of the implementation by constraining how it’s used, but the real lesson, aligning with Thomas’s statement:

then.., pick the right physical design, indexing, and query strategies to honor the model and meet operational needs.

From my side, about a year ago I began experimenting with a document-first approach to support both patient-filtered queries and cross-patient queries at scale. We initially tried entry-level collections, as some of you suggested in this thread, but hit trade-offs like not allowing cluster-level queries and more. While entries often carry more clinical context, that’s not always the case when it comes to analyzing the data. After iterating, we landed on a model that keeps one document per composition while enabling performant queries over indexed paths. I’ve shared pieces of this with some of you already, and we’ll share more (and some tooling) soon—hopefully at EHRCon25 (currently a short paper under review).

This is not about replacing existing approaches, but showing that a document-first persistence layer can be a viable, cost-effective alternative: one that preserves canonical fidelity, supports cross-patient operational queries at scale, and is naturally ready for semantic enrichment and AI.

Sidharth also made a great observation:

I’d add that technology is evolving quickly. PostgreSQL, MySQL, and others are no longer “pure” relational databases. They’ve grown to support JSONB, GIN indexes in PostgreSQL, and more. Likewise, MongoDB isn’t a “pure” document store today; it evolved early to natively support Lucene-based search and continues to add new capabilities (e.g., vector search).

In my view, when we talk about these edge use cases, we’re not really comparing pure database technologies anymore. We’re talking about data platforms that have become hybrid, bringing together multiple capabilities to meet the requirements of today’s data reality.

Additionally, IMO, while openEHR is an incredible foundation, it might not be enough on its own for the future demands of precision medicine and AI. We need a data-first strategy where organizations truly own, enrich, and extend their data, then choose the persistence strategies that fit. Otherwise, solutions built for interoperability risk creating new silos—where data is locked behind the limits of exposed APIs. In this view, APIs (AQL, FHIR, or others) should act as façades over richer internal data models, minimizing data duplication while preserving full clinical context and giving teams room to evolve their data to future requirements.

The journey isn’t easy: indexing, query design, and data modeling all matter—but I’m convinced that persistence models which preserve hierarchies, semantics, and context together (like the document model) have a lot to contribute. I’m here to learn from you all and see how humbly, can add some value to the future of openEHR.

9 Likes

Certainly the more semantics such persistence models preserve, the smaller the gap is between the data and its computational and semantic view. 20y ago we didn’t even have the inheritance relation in DBs (or maybe just, in Oracle and/or Postgres, but it was not in wide use); now there are features which we previously had to engineer in the layers above. So the pain level is coming down.

What I think DB builders need to do is to consider exactly what semantics are needed in DBs - e.g. Allen logic would be very useful; null-safety testing in code; and so on.

2 Likes

The proposed approach is full of traps. I just hope that those who go down this path don’t lose faith in openEHR itself, but rather take it as a lesson to know better next time.

3 Likes

@matijap please be explicit, shared experience is of great value for this community, the goal is to help, specially if you already went down that path.

2 Likes

Well, there’s a long thread above that contains, I believe, everything I could say and then some, so repeating things would not really be productive.

And yet…

The amount of tables (to properly handle multiplicities on all levels), indices (if you want to support arbitrary queries), joins (just to reassemble a single composition!) and model misalignments (e.g. strings in openehr are by default unlimited in length so you’ll end up storing and querying clobs) are just overwhelming. I mean, in such a model, no sane person would want to actually write SQLs manually (or review AI-generated ones), so at the end of the day an AQL-to-SQL converter will be needed, as well as converter from SQL result (which will contain a cartesian product for what should be 1 object) to RM objects.

I know we were told modern RDBMSs can do all of these joins efficiently (indexing was not mentioned, I think) but I have yet to see one. Granted, I have more trust in Postgres in this regard than any expensive commercial database, but nevertheless I have hit limits too often to believe this approach would work.

That said, I am very happy to see anyone try it out with some serious amount of data and report! (Serious amount is at least several thousand compositions, although any serious installation quickly hits at least millions. For anything <1000 I’d say just serializing all compositions into whatever storage is available, and having them all cached in the memory to be able to filter and use them quickly, is a much simpler and possibly even more performant solution.)

9 Likes

We also conducted a bit of work beginning of the year in this area. We already dumbed down the approach as we aimed to define “integration tables” based on DTOs generated from Templates. These should only serve as a target structure to allow more broad use of ETL-tools without XML or JSON capabilities, just plain SQL (@pablo will remember another attempt we tried 5 years ago or so).

Even with this model, it was very difficult to manage all the optional fields and cardinalities. It only worked if you narrowed the use-case down quite a lot. And this use-case did not even consider any real-time application needs, only non-real-time integration scenarios. This is where I was also a bit disappointed about the SQL-on-FHIR initiative which did not come up with a proper ORM approach but with a glorified CSV exporter (a very clunky and un-elegant way to mitigate the missing SELECT clause if you ask me).

I can also second what @matijap said about then being able to write any sane SQL queries. With EHRbase, you can technically write SQL yourself (and that’s actually fine if you need to retrieve single values), but typically you would want this to be derived from AQL because its quite abstracted from the domain.

2 Likes

I believe the path you’re talking about @matijap has been taken many times, and I’ve read a few papers on people trying to implement the entire openEHR protocol or ISO 13606 using a SQL RDBMS - and they typically have all the pitfalls you and @birger.haarbrandt mention. Usually building the AQL engine is very challenging, and not even attempted. The SQL queries are pretty complex too - even when implemented on JSONB.

Here’s where the approach I’m suggesting diverges significantly from previous efforts in the space:

I’m NOT saying we should implement the openEHR RM and all of its requirements and nuances into a SQL database and expect everything to just work.

But instead, recognise that the real value that openEHR brings to the Health IT community are the maximal clinical data models not the accidental complexity it has introduced in the implementation of this idea.

The spirit of these data models can be implemented in a SQL database trivially if the data modelling techniques were modified to reflect traditional database modelling techniques instead of the complexity we’ve introduced with archetypes and templates. We know because - our team has already translated close to 800+ archetypes into simple SQL tables and it’s possible. There are still many issues in how archetype doesn’t fully respect normalisation principles in a lot of places, but we had to fix these.

This approach is more similar to what OMOP has done, but instead of a few observational entities for research, think all clinical entities.

And an organisation doesn’t need to import all of the table definitions at once and have 1000+ tables. They only pick and apply those they need.

Once you have this logical SQL table definitions in place, my argument is that API technology, RLS, and Postgres wire-protocol native databases have advanced enough to use the authenticated SQL layer directly as the API for multiple apps to interoperate. Or you can have teams pick a REST/GraphQL layer of their choice that autogenerates APIs from the data models (PostgREST, Postgraphile, Hasura - pick any).

This way, openEHR provides what it is really good at - the data models in a format that’s accessible to the general population of developers and this works with the general tools and frameworks available to everyone.

2 Likes

As new to OpenEHR, although I strongly agree with OpenEHR’s philosophy, my biggest afraid is that after investing a lot of time, I might have to abandon it because its performance doesn’t meet my needs. However, there is very little publicly available information regarding its performance.

@Sidharth_Ramesh Could you please describe it in more detail?

@borut.jures How serious? Do things get better now?

@pablo Does it mean in your opinion EHRServer is a more reasonable implementation than ehrbase? A paper made a comparison (Comparison of openEHR open-source servers), are its conclusions still valid?

I apologize, my question may be off-topic. I’m very interesting to know more.

Thank you.

Hi Michael, I relate to the concern, performance is often the make-or-break point once you move beyond single-patient retrieval in large clinical repositories (not only openEHR). That’s where we’ve put most of our focus.

We also have a paper that describes the approach and some measured results. It’s not publicly available yet, but an abstract should appear in Springer in the coming weeks. Happy to connect privately and share the details.

1 Like

@Michael_Lin that’s a trick question: you can’t say one implementation is more “reasonable” than others, and I wouldn’t say that about my product, it is unfair for colleagues working on other implementations. I would also call out anyone that says that about their implementations, because they will be lying to the community (in fact I did that with marketing materials released on LinkedIn around openEHR some time ago).

First, you need to have a set of requirements, then look for a tool that fits into your requirements, and that tool might be “better” than other tools, but just for YOUR requirements. So any comparison can’t be done in abstract and in a generic way, it should be done in a context.

Second, each tool is designed with different principles and the builders have different experience and specializations, so even implementing the same spec, we end up with tools that look and feel really different.

So my tools reflect my requirements, my experience and my principles, and I love my tools. Then If my tools fit in those points for you, then it is a more reasonable implementation than others, but for YOU.

@pablo Thanks for the reply. I might not have described my question very well, so apologies if it came across the wrong way—no offense intended. I totally agree that a solution doesn’t have to win on every metric to be reasonable.

I’m still getting familiar with openEHR ecosystem and just wanted to quickly understand the current status. For instance, regarding the instability of the EHR server during concurrent writes mentioned in the paper: was the experimental finding flawed, or has the issue been fixed since then? That is the kind of information I am looking for.

No worries, no offense taken, it’s just to set the right context where I can actually give you the info you need.

We did concurrent load testing without problems some time ago CaboLabs Blog - I tried to kill the EHRServer

I’ll recheck the paper to see what they found, though it seems they didn’t raise any issues on the repo, which should be the normal practice for OSS.

In any case, I wouldn’t recommend using EHRServer now for anything else than learning. We are focusing on our commercial offering, https://atomik.app/, which is EHRServer on steroids: support demographics, totally rewritten query engine, demographic duplicate detection, EHR merge/unmerge procedures, better support for the openEHR REST API, updated tech stack, etc, etc. So Atomik is for serious work, our customers use it mainly as secondary analytics repositories (they run reports, queries, and data services on it, and we integrate data from external sources).

Anyway, if there is a problem with the OSS it can be surely fixed, though we need a report on it with steps that we can reproduce.

But again, it would be useful to actually know what you want to do or build or test.

Very thanks. I have a much better understanding now.