openEHR archetypes as SQL Tables

I’ve been thinking about this for a while now.

I understand openEHR’s maximal modelling approach and how the data models are what really set it apart. After a few serious implementations on the ground, and teaching a lot of new-comers on how to build with openEHR, I also have a good understanding of where it lags behind.

I have some questions about why openEHR is the way it is. I’m just looking for answers from the community. Maybe some of you were here when the decisions were made - @thomas.beale , @DavidIngram, @Sam , @ian.mcnicoll , @heather.leslie , @sebastian.iancu , @jpieraj , @birger.haarbrandt , @Seref , @sebastian.garde

First and foremost, there’s good consensus in the Industry (also ref this) on the fact that the relational model with SQL is the dominant data model of past, current, and future generations, as far as we know - we see almost all implementations of openEHR based on top of PostgreSQL (and sometimes MySQL). We are also seeing a rise in “New SQL” databases, which are all predominantly wire-protocol compliant with Postgres (CockroachDB, YugaByte, a lot more) or MySQL (Vitess/PlanetScale). But the Relatinal Data model and SQL is here to stay, and it’s only growing stronger with vendor-neutral Open Table formats like Apache Iceberg.

The archetype + template model, as it is right now is quite challenging to implement directly on relational SQL with normalization. Almost all implementations I know of represent the complex nested data points as JSONB or XML, and query the data using JSONPath or XPath queries. This however, comes with major performance consequences and limits scalability significantly. Data normalization issues (sometimes within Archetypes and templates directly) could cause data anomalies - along with the issues in the long term maintainability, scalability, and reliability of the data system (I’m defining as per Martin Kleppmann).

There’s a clear implementation advantage to SQL systems - familiarity, battle-tested schema management, strong developer tooling, framework support in popular programming languages, automatic REST / GraphQL API generation, and low-code and BI tool support.

After we had some performance issues in production with openEHR, we experimented with flat SQL models on Postgres to represent the fields in archetypes, and we were able to acheive 90% of what we were able to do with openEHR REST APIs including multi-level queries, and a lot more that the REST APIs don’t support (more complex queries, more direct optimizations, more direct integrations with our tools). I don’t think this is a new idea - there have been other attempts already.

Recently, we also found that it was much easier to have our developers (and LLMs) build new applications on these SQL schemas we defined (heavily inspired by archetypes), and build applications directly on top of them, and then map them to openEHR / FHIR than it is to build directly on openEHR. It’s also much easier to train new developers to build on SQL they know and love, with whatever programming language + framework they already know instead of teaching them specialised tools and SDKs for openEHR.

My question to the community: When the initial foundations of openEHR were laid, why weren’t the data models (archetypes) just modelled as simple normalised SQL tables? And why wasn’t the query language just SQL?

Another way to phrase it: The idea behind 2-level + maximal modelling is excellent - however, we could implement this pretty well in SQL - why didn’t we?

4 Likes

Thank you @Sidharth_Ramesh for this post! I’ve been working on this and tried to get feedback from others and published an ERD excerpt for the flattened body temperature tables.

We heard about the performance issues in Catalonia which are the consequence of the “two-level” data model used for storage.

I asked myself whether the two level openEHR data model needs a two level storage model? What if tables are generated and migration is done using tools like Liquibase? Using CI would allow for automated inclusion of OPT changes without human intervention (I heard this is something clinical modelers like about the current implementations).

We can generate SQL tables based on OPTs and archetypes. And we don’t have to stop with SQL. We can generate “Archetypes as code” (see here, and here for TypeScript version). I imagine developers would just use these SDKs when interacting with openEHR CDR but be able to use their own UI framework and backend in Java/Kotlin/C#/TypeScript/Dart/PHP (these are the languages I have SDKs for).

CDR vendors might not be able to publicly post anything that would contradict their current implementation – it would be bad for their business. New CDRs being built are free to take different approaches.

2 Likes

Definately understand your approach and your recommendation. Thank you for the take.

well, this is the view of data-centric people (note the first author of that paper you reference). But real data in many domains just isn’t very tabular; indeed, the nature of data the represents things in the real world (anatomical, parts explosion of an aircraft, weather patterns) is:

  • at least fractal i.e. may contain endless further levels of detail, and
  • usually includes temporal structures

The thing that people who think everything is ‘really relational’ get wrong is that the largest databases in the world (taxation, travel bookings, census, banking systems etc) were built on models whose data is presupposed to be tabular, and indeed, is pretty close to tabular.

Much of this data is no longer tabular any more it is fractal - think flight bookings with endless chains of options, each dependent on the last; tax data with exceptions and conditional subordinate information; banks with 150 different mortgage ‘products’ each with 200 choices on average.

The openEHR information model (RM) and archetypes based on that RM are designed to reflect the fact of the fractal nature of reality, and how science sees it. Anatomical structures reported in endoscopy are not tabular at all, they are hierarchical. Notes recording a complex diagnosis are not inherently tabular in structure.

Some health data can be pretty close to tabular - e.g. real-time vital signs data comes in time-series structures that can be made into tuples such as {date_time, systolic BP, diastolic BP} and similar.

If you analyse a lot of health data, you see that there can be a mixture of:

  • coarse-grained tabular structures;
  • fine-grained fractal structures
  • hierarchically arranged documentary data
  • hierarchically arrange questionnaire data
  • temporal structures of various kinds
  • etc

There are various tricks to ‘unwind’ some of these kinds of structures to fit into fixed-column DB designs, and if the right tricks are used, especially on the 1..N fractal fanout problem, an RDB schema can be used quite effectively.

So the real question is how to use relational products (which are most definitely not going away, and nor should they - with object-oriented additions like inheritance, they are very powerful) can be used to represent this kind of data. There are various strategies, including:

  • 3NF conceptual schema for higher structures and data blobbing for fractal detailed data - as in EhrBase
  • a key-value approach can always be used to convert hierarchical data to columns, as per this old post of mine
  • mixed, e.g. using different native models of representation as supported in SurrealDB
  • etc

This is because in manually built databases, you don’t want any domain semantics in the schema. If you have that, you are up for endless schema evolution, data migration and so on. However, in more modern times, it has become realistic to generate db schemas from other sources, such as archetypes. So you could just do that. You have to live with the following forever:

  • Each new archetype you need will create a new table;
  • each new minor version of an existing archetype will a) add columns and b) might relax some referential integrity rules (e.g. make something previously mandatory optional);
  • each new major version of an archetype will require a new table
  • each template and template version will require either a view table or a join query.

So the DB schema will be changing forever. This might be ok - with sufficiently sophisticated tools to manage all this, it can potentially be made to work. And if it works, it should be fast. The Chinese group (I don’t have the citation at hand) who presented at MedINfo in 2015 followed this kind of approach and as far as I know made it work.

The golden rule is that if you can losslessly get the data in the conceptual openEHR form in and out of the DB, for all time, and the data remain perfectly interoperable with the same data represented in different persistence engines, then there is nothing wrong with the persistence architecture.

If you can improve on querying performance, and/or run (some) analytics directly on (a copy of the operational) DB, even better.

In sum, it’s not because ‘all data are really just tabular’ (nope) that one should consider interesting persistence architectures based. It’s really just a mathematical / computational question: can it be made to work over the long term, without compromising any semantics or interoperability? Can it be used to make querying and query-based activities (reporting etc) faster?

8 Likes

To make this work really fast you need at least a column with the archetype path map of the templated represented in an optimised (probably numeric) form, and this column needs to be indexed. This allows a query processor to detect if the archetype and specific paths being sought are even in that data row. That will avoid nearly all data blob materialisation, except the candidates needed to resolve the WHERE clause processing, i.e. value comparisons.

If you add an adaptive capability that dynamically adds (and removes) value-based columns (e.g. BP values or similar) and indexes them based on frequency of previous accesses, then WHERE clause processing will also speed up.

DB engineering is always about optimisation…

More things can be done. In a next generation of openEHR reference model, I have separated out multiple different Observation sub-categories, which should each have their own tables. This has the effect of putting Scores, realtime data, Questionnaire responses, imaging results, and lab results, all in distinct tables, rather than some giant Observation table. As you can guess, each of these kinds of Observation have very different data shape, and also usage in querying.

Such improvements are likely to improve overall performance by at least one order of magnitude, if not more.

2 Likes

Thanks @thomas.beale this does clear up a lot of questions I had in my mind.

I’m not really questioning the real nature of data. There’s been a lot said on the topic, but it basically boils down to:

“All models are wrong, some are useful”.

The relational model initially proposed by Codd et al was mostly to accommodate the needs of enterprise business data banks (they were IBM Research after all), but this fit the pattern of a LOT of other industries well enough to be adopted and used widely to become the dominant database model.

In my experience, what I’ve seen in healthcare IT is that we have a real constraint in the number of software engineers who even join our industry. And as a result, there can only be so much innovation when it comes to the fundamentally complex engineering challenges like database engines, query engines, storage access patterns, etc. It’s not a mathematical/computational question, but one of scale economics and amortization of fundamental technology.

We’ve heard how open-source and easy-to-use tools are paramount to people actually adopting a standard. I’d argue that Healthcare IT, and other industries that are not able to hire enough tech talent, in a way, are forced to use other industries’ years of tech innovation and stand on their shoulders to build something easy to adopt and use - HTTP REST over a protocol like MLLP, Simple JSON over ER7, and so on.

Even IF healthcare data is not tabular (and is fractal, or really a graph), I’d argue, if there’s a simple technique to model the main ideas of openEHR as simple tabular data, we should. The sheer number of tools that just work with tabular SQL, and the community around it, is just mind-bendingly larger. When it comes to long-term maintenance and performance, SQL will probably always win against any new data model innovation, even if it’s a better fit for the domain. The same argument is made by Stonebraker et al. in the paper, and Martin Kleppmann in Designing Data Intensive Applications - yesterday’s CODASYL vs SQL argument is todays’s NoSQL vs SQL and tomorrow’s GraphDB vs SQL.

Most of the object-oriented additions like inheritance in popular databases like PostgreSQL are actively being discouraged, stating the following:

Why not?

Table inheritance was a part of a fad wherein the database was closely coupled to object-oriented code. It turned out that coupling things that closely didn’t actually produce the desired results.

When should you?

Never …almost. Now that table partitioning is done natively, that common use case for table inheritance has been replaced by a native feature that handles tuple routing, etc., without bespoke code.

One of the very few exceptions would be temporal_tables extension if you are in a pinch and want to use that for row versioning in place of a lacking SQL 2011 support. Table inheritance will provide a small shortcut instead of using UNION ALL to get both historical as well as current rows. Even then you ought to be wary of caveats while working with parent table.

And again, I’m not saying that one could never represent the openEHR RM on a SQL database effectively - It’s probably a matter of effort, resources, and time.

What I’m concerned about is, I think we are fighting years of industry best practices when it comes to managing data.

I’ve had first-hand experience building on multiple NoSQL systems, as well as FHIR and openEHR native databases for relatively large projects (all dealing with more than a million patients per installation). For a system that stores data, there’s always a schema, and schema, regardless of which layer it’s implemented. You almost always want the data evolution and migrations to be UPFRONT and as clean as possible (schema-on-write). There’s a lot more academic literature on this, but I have to lookup up citations, but it’s basically the “schema on write vs schema on read” argument.

As soon as you have v1, v2, and v3 of documents in a database, you move all the complexity to the application that now has to remember and reconcile all these in the queries, which keep getting more complex over time (schema-on-read) - this is sort of what happens with openEHR with multiple versions of templates with multiple versions of archetypes.

We once had a major error on a production system with openEHR because someone decided to model an openEHR template with the pulse v1 archetype and all the AQL queries were made for pulse v2.

I want this sort of maintenance nightmare to be dealt with ONCE - when I’m evolving and updating the schema. There should only be 1 representation of pulse in my system. And when I write a simple SQL migration script, it takes care of all my old data, and moves everything to the new schema in one nice, simple step with excellent snapshot isolation guarantees and rollback in case something fails. With openEHR, it sometimes takes us days to move compositions from a particular template version to another one through the REST API on a live system (with no transaction isloation guarentees and rollbacks possible!) - because we didn’t want our AQL query complexity to grow as we made changes to the template.

I think the complexity of data management and schema evolution always exists. By moving the domain semantics from the SQL schema to the openEHR Archetypes+Template, we’re just moving the complexity one step up - and there aren’t that many tools and industry best practices to deal with this complexity.

What I’m saying is not that we think about different theoretical persistence engines, but actually try to make the existing models work really well with the major persistence engine of today - tabular SQL engines.

This becomes even more compelling when the analytical folks are converging on other open-standards based on tabular data models as well!

The Chinese work that you’re mentioning is probably - Archetype relational mapping - a practical openEHR persistence solution? I read the full approach, and I think there’s a more fundamental issue here. When we shift the data models as archetypes with their own rules, we also don’t implement the correct normalization principles that usually apply to relational data models.

For example, in the Blood Pressure archetype, “pulse pressure” depends on “systolic” and “diastolic”, and is yet a field in the same data model which can be filled out separately as well. This already violates 3NF’s Transitive Dependency, and as a result, you can have data in a composition that can have insert and update anomalies. This is just one example, my team found a lot of such data modelling issues on the CKM when trying to model these as SQL data models.

To me, using simple SQL data models and explicitly evolving them upfront with migrations seems like a much more maintainable and long-term solution that doesn’t compromise semantics. And just because so many tools connect to SQL by default, and you can get the data in and out fast, you’re not blocked by the “it’s open, but you should learn the openEHR composition format first” argument. And trust me, I’ve taught a lot of people openEHR, and it’s still quite difficult for them to use it fully. With SQL, I can actually get data out from the system, hire talent on the market, integrate it with thousands of tools, and actually use the data in the system.

I think what we’ve done with openEHR as a community is great. And we probably have the largest number of mature clinical information models. I just want to get these models in the hands of new users of openEHR faster, with the tools they know and love, without learning anything about openEHR. If, for example, we were able to make a SQL implementation of every archetype that exists (correcting for the normalization issues), and model all the audit logging, authorization with just SQL commands (RLS is super performant now), won’t that be really “vendor neutral”? Pick any PostgreSQL database on the cloud or on your machine (Postgres is a standard), use these SQL data models, auto-generate your APIs, or use SQL as the API, and you have your very own health data platform!

All SQL queries across all platforms are 100% portable. And yes, you can of course think about federating multiple data platforms via a real-time Iceberg endpoint that syncs with the tables in real-time.

All of this with radically less complexity and code that needs to be maintained.

2 Likes

Hi @Sidharth_Ramesh

I think there are mixed concepts, though I understand your view from the development side of things.

First, when talking about SQL today, it’s not pure SQL anymore, each DMBS provides their own extensions and mixed technologies in a SQL box.

For context, all my implementations of openEHR (EHRGen, EHRServer and Atomik) have SQL persistence without extensions or custom plugins, it means it’s as pure SQL as you can get. Lately I’ve been optimizing a lot on Atomik and needed to use some MySQL-specific keywords, but nothing that other RDBMSs don’t have, though the syntax might be a little different.

My take is that the openEHR spec is persistence-agnostic, so even talking about SQL or any database technology or standard in the specs is out of the scope. That’s also why I reject the marketing materials that were circulating in the last couple of years about openEHR being a “EHR persistence solution”.

On the other hand, modeling is a process that has nothing to do with persistence, so why mentioning SQL there? I think the modeling process should be abstracted from the implementation technology as much as possible, so modelers don’t have to even think about it. That’s a core requirement for openEHR IMHO.

Now talking about implementation, the main challenge of persisting openEHR data (= hierarchical in nature) is the CONTAINS operator in AQL, and how to translate that to SQL, because the CONTAINS doesn’t check just for a direct child, but for all the descendants, and that’s an operation that is already provided by JSON and XML (document oriented) DBMSs, and also by relational databases with JSON extensions. Another approach would be the object-oriented database, which was tested and left behind by the original version of EtherCIS, which is EHRbase now (I think the main reason was the cost and that the performance wasn’t great).

We (EHRServer / Atomik) came to a compromise solution where we don’t have CONTAINS and don’t represent all the levels in the hierarchy in the database, but represent meaningful levels that make queries expressive enough that can be mapped to SQL easily. I can make a small video showing how queries work.

So if your main point is development, I would say yes, you can do it, though I wouldn’t recommend the approach taken by some projects that generated tables based on OPTs, because depending on how that’s done, you might end up with multiple disconnected pieces of data that are difficult to query together, for instance if you have two OPTs with the BP archetype, and generate different tables for the BP on each OPT, and when you query you might query one table but not the other. For instance, we have one table for all the DV_QUANTITIES, that means all the QTY from all the OPTs will be there.

Another item to consider is if you want a native openEHR database for each app, or you want an openEHR backend supporting many apps. Because it might be faster to develop and more performant to have a specific openEHR DB per-app, optimized for the data managed in that app, and then the generic openEHR backend that aggregates all the data from many apps.

3 Likes

I do agree, there’s no “SQL-dialect” free implementation of SQL today that works. However, I’m seeing MySQL + PostgreSQL wire-protocols starting to get wide adoption.

I completely understand this is how the spec has been proposed and developed. But my question is - should it be?

Given that we’re building a solution for data management, we are almost always limited by what’s available in the database implementation, aren’t we? And a standard can still be “technology-agnostic” if the SQL dialect is a standard. Today, there are a dozen Postgres wire-protocol compatible products on the bleeding edge of databases -CockroachDB, YugabyteDB, Google Cloud Spanner, Google AlloyDB, CedarDB, ClickHouse, AWS Redshift, Firebolt, Yellowbrick, Greenplum, Materialize, TimescaleDB, QuestDB, CrateDB.

And I see this movement in the MySQL space as well, but not to this extent.

This is already more “vendor-neutral” implementations and technology options to pick from than the total number of openEHR companies today.

I tried to really dig into this and try to find out WHY this is the case. For instance, I found the earliest description of this in the ANSI-SPARC Architecture of 1975. However, they only separate the physical representation (how the bytes are organized on Disk) from the conceptual/logical representation, which is done pretty well with SQL already. Of course, there’s another layer on top exposing this data to the user via a “view”, which in today’s terms would be the API.

The earliest reference to abstract logical domain modelling that’s separate from the logical database schemas (SQL doesn’t talk about how bytes are organised, so it already qualifies as a conceptual model according to ANSI-SPARC) seems to be the invention of ER Diagramming in 1976. I couldn’t discover exactly when people started separating the logical domain models from the SQL schemas, but I suspect this is because DB admins were a scarce resource back then, and you could still find people who could capture the nuances of the domain as simpler ER diagrams, and have them converted to specific SQL dialects by the DBAs. Please let me know if you know exactly when and why this practice started happening @pablo, maybe I’ve missed something here.

Today, I would argue, it’s actually easier for domain experts and clinicians to author and maintain SQL directly using the domain knowledge they have given LLMs, opening up the domain to everyone - just try a tool like https://database.build, or Lovable and as a domain expert, you’ll never think the same again. With a quick list of rules on how to model different entities in SQL (unrolling as @thomas.beale put it), one can start actually putting their ideas and concepts into an interoperable, computable format - SQL CREATE TABLE statements. And many a times, it’s actually FASTER to iterate using a declarative piece of text + LLM assistant, than it is using GUI tools. ADL2, in my opinion, is too hard for LLMs to author directly, and they’ve just been trained on so much more SQL.

The way we modelled this is to just represent different archetypes in their own tables, and have all compositions go on another table. The archetypes reference which composition they belong to. Most of the queries we did in the real world clinically were almost always either at the archetype-level (give me all pulse readings) or template level with a single level contains (give me all pulse readings under this template) - both of these can be super effeciently done in SQL directly. While the other, more complicated CONTAINS could be a bit more verbose in SQL, they are definitely possible, and probably more efficient to perform on simple normal tables as opposed to JSONB / XML (research needed).

Honestly, neither would I.

The way I’m thinking about the problem is not to convert existing templates and archetypes to SQL tables to “fit” with openEHR, but to completely create all these models in SQL inspired by the data models in the openEHR CKM - take what’s best about openEHR (the data models) and use it with a data platform that’s already battle-tested (SQL). Without introducing any additional layers of accidental complexity.

This again is something I’ve been reading up on and researching a lot about. The problem in question is that of Database coupling or Shared Databases - and I think we run into the same issues that the author talks about, EVEN IF we use an openEHR REST API on top of a single database - it’s still the same shared database pattern again, but instead of SQL, it’s just HTTP REST on top of a single SQL instance now.

Let’s take a look at the arguments against this pattern again:

The drawbacks of this pattern are:

  • Development time coupling - a developer working on, for example, the OrderService will need to coordinate schema changes with the developers of other services that access the same tables. This coupling and additional coordination will slow down development.

  • Runtime coupling - because all services access the same database they can potentially interfere with one another. For example, if long running CustomerService transaction holds a lock on the ORDER table then the OrderService will be blocked.

  • Single database might not satisfy the data storage and access requirements of all services.

Development time coupling - you’ll need that to ensure semantically meaningful data in healthcare. In a way, having a process to reconsile the different data models for different projects is the nature of how data modelling should be done - so this is the case even with Archetype modelling.

Runtime coupling - Performance on traditional DBMS has come a LONG way. And I don’t think this should be more of a concern (at least when comparing it with the alternative - that’s to put a REST API on top of a single database). And recent advancements, especially Postgres 18, have actually directly addressed this concern (PostgreSQL 18 Is So Good, You Might Rethink Microservice DB Isolation, PostgreSQL 18 in Microservices: You Don’t Need a Separate DB for Everything)

Single database might not satisfy - this has so far not been true. Almost all openEHR services we know of are built directly on top of a single SQL database. And when comparing the alternative, which is hosting a REST API on top of the DB, again, you’ll have more expressivity (stored procedures) on the database with the data models directly, than using a REST API on top. This was one of the main reasons a system like Epic decided to go with something like MUMPS. I’d argue that, with the ecosystem, extensions, and customizability of databases today (specifically eyeing Postgres here), we can easily do everything the healthcare domain needs.

2 Likes

There’s definitely the argument that you should structure your data differently for transactional (3NF) vs analytical (dimensional/star schema) use cases.

I want to clarify here about my take on the “health data platform” as defined by openEHR is primarily a transactional system (OLTP) as opposed to analytical (OLAP).

This gives domain experts enough context to work on 3NF models directly.

It’s also possible convert 3NF models to dimensional models (Kimball et al) for analytics within any loss in context or semantics by just denormalised a bit - while this can’t be automated, this has been proven to be possible for any 3NF representation if you follow a few rules.

1 Like

Everything in a data management spec, like openEHR, is an abstraction, and you need to define the scope and limits of each semantic layer somewhere, all based on the specified requirements. As I see it, persistence isn’t a requirement for openEHR, so yes I think it should be that way. Though if persistence definition would be a requirement, then you’ll get a different answer.

On the other hand, there are so many ways of designing the persistence that sticking to one way or technology could generate problems for some part of the industry using a different approach.

For me the current specs have the right balance of defining enough semantics in an abstract way that allows many implementation technologies, though for full implementation of, for instance AQL, you MUST have some kind of hierarchical support in the database for solving the CONTAINS, which is why we don’t implement AQL. So even if the current spec is pretty abstract, it still has some technology specific constraints.

UPDATE: sorry lot of typos for typing on the phone.

1 Like

It’s hard to imagine to me as a domain expert SQL + LLM is actually easier than Archetype Designer GUI + ADL. No doubt it’s easier to get some output, and that output is a lot better for SQL than ADL. But the combo of the modeller creating a model they don’t understand (understand only through the LLM, which is non understandable itself) with LLMs themselves being inherently faulty (probabalistic) is quite scary. Additionally you overload clinical modellers with database design responsibilities, they are unsuited too.

A key value proposition of openEHR is it’s excellent library of domain models, that’s a direct result of design decisions at the spec level, like ADL as a language oriented to clinical modellers suitable for GUI first editing, two level modelling, implementation independence etc. Reversing those decisions will lead to worse clinical models.

Now I do appreciate your problems around scarcity of developers and tools. So I really like having this discussion.

But I’m very much unconvinced by the basic premisses: (But I’d love to be wrong.)

  1. Developers understanding openEHR are scarce. No doubt true, but clinical modellers understanding SQL are just much more scarce (multiple orders of magnitude).
  2. LLMs makes database developers out of domain experts. In addition to the two arguments above (inherently faulty and expert not able to verify models), if it’s true, it solves your main problem, you’ll no longer need technical developers.

Finally I’m not against doing direct SQL implementation of archetype, templates or whatever. That’s the developers domain. What does matter to me as clinical modeller and clinical data repository manager. Is me having the tools to do that job. So I need to be in control of updating the models, queries and data. I can do that with ADL and AQL. I don’t think I (or people like me) could do it with SQL. Now, whether you/the CDR implements the models as JSONB or fully modelled out ‘tables’, I don’t care. As long as I can do live updates. (With modern CI/CD that should probably be at least as good as the ‘traditional’ approach I’m used too of uncoupled application on model updates in both CDR and client app). And if I would except any additional constraints on my work, what would you like from me, and how would that help us as a team of clinical modellers and software engineers?

4 Likes

This is the perfect kind of rebuttal I was looking for.

I actually don’t need clinical modellers to fully understand SQL and its nuances and be database developers.

Assume that I’m able to provide you with a simple tool where you can talk about your clinical requirements. You can upload photos of real-world forms, talk about the data points you are capturing, and the AI generates something that you can clearly see the output of - for example, a simple table (or a mindmap) with different fields, and what kind of fields these are, which you can easily verify, and check off on.

The next step would be to pass this on to an actual rigorous database modeller for review. We’re already seeing a lot of this in the vibe coding space - Vibe Coding Cleanup as a Service. Instead, here, we’ll be doing it for clinical models - because there are real database implications when a clinical model doesn’t follow traditional data modelling principles like proper normalization (you get data anomalies, and all other sorts of messy issues). However, because the input from the clinical modeller is already computable in SQL (and the LLM would have already tried its best to stick to best practices that we’d prompted, and it’ll probably be 90% correct), a single full-time database modeller might probably be able to review and approve/modify almost all the clinical data models that the CKM has today.

Approving and modifying 1000 archetypes/year for a full-time person dedicated to doing this is easy. If they are part-time, probably you’ll need a maximum of 2 or 3 of these folks to catch up with the modelling requirements of the whole openEHR community as it is right now. What was always painful before is the back and forth of clinicians and developers, where you had to sit them in the same room to really understand the requirements, and then have the developer implement things, and then come back to only figure out “this is not what the clinician actually meant”.

What we’re seeing now is clinicians/domain experts who have no experience with programming vibe-code their requirements directly first and figure out quickly on what they actually need - think of it as a prototyping at the speed of thought. This is true for entire application workflows and UIs, not just data models. Next, once they are sort of okay with what they have, they send it in for clean up - a team takes a look quickly and ensures it’s fit for production. We’re already doing this with a lot of our customers, and the time + effort it takes to do this is 10x less than traditional “spec-driven/milestone-driven development”.

In a sense, it is exactly what openEHR promises - a clinician should be able to customize their workflows and data models how they like. There’s a lot of argument about Malleable Software in healthcare, too.

But the way I propose we do this is different. We currently have developers building and maintaining platforms + separate toolchains + GUIs that are “specific to openEHR”, and we have clinicians and modellers using these tools to output computable artifacts that are then accepted by the platform. Instead, we could just get rid of all the “specific to openEHR” tooling + platforms, and just use what is already working really well for normal developers, and align our specs, principles, platform, and tooling to just stand on the shoulders of these giants.

My reasoning behind all this again comes down to this: We don’t have enough IT professionals and architects in healthcare to actually successfully maintain and build excellent tooling for what we need to do. We have a few companies that even contribute back to the open-source community. We can’t afford to invent our own tools and our own way of doing things, if more than 95% of what we can do can be done really well on existing established data standards like SQL. I say, it’s sometimes better not to implement the remaining 5%, and have something that works remarkably well, most of the time, for most people, OVER having something that could theoretically work 100%, but only sort of works really, sometimes, and might break down at scale, is maybe not that maintainable, and is hard to really hire people to work on from the market…

@joostholslag I’d really like to take it up as a challenge. I think we have tools today that can provide you with a good enough experience that you can do this. Not just to the data models, but to the entire software’s workflows and UI - while still sticking to openEHR’s promise of consistent, semantically meaningful data. My goal is for you to have your own version of “malleable software” that you can customize to your liking.

It would take a bit of time and effort to get there, but if I could deliver the same or better experience than what you have now, in your opinion, you wouldn’t really care if it’s SQL data models or ADL archetypes?

For me one of the big bonuses of the current CDR approach (also shared by FHIR native CDRs) is that they are domain-driven, and do not require any db re-engineering to fulfil any new of updated clinical requirements, and the RM is optimised to avoid breaking changes which do require some kind of low-level db updates - though even then this is about path changes not actual db schema changes.

I completely understand that this does set a real challenge for CDR developers and is no-trivial engineering, and which is why I regard CDR development as niche but very high-value.

The idea of the ‘headless EHR’ will IMO only really ever work if the client also owns the ability to alter the content, without recourse to low-level engineering for every change. That for me is the real breakthrough - true domain-driven datastores - as @joost says it is the live updates that are critical.

As others have said though, there is nothing to stop someone building an openEHR API compliant datastore that does require manual schema management for every content change.

However, I think you are underestimating the complexity that will result as you scale up to even something like a typical GP system.

I certainly don’t have problem with adding a simplifying layer on top that feels like a client developer is working with a table -as I understand it, that is part of the goal of the SQL-on-FHIR project and I think we should pick up on that as I think our model-based approach should work well beneath a SQL-on_FHIR but the ‘tables/clumns’ are going to be very dependent on the exact requirement of the client and the mix of templates and archetypes, not he optimal phyisical representation. So , if there is a need for SQL to appeal to a set of developers, I would put that on the outside as a facade, not on the inside.

BTW I’m pretty sure you are not correct in asserting that most current CDRs use XML or JSONB

I have some sympathy with this, but let’s do that on the outside and probably increasingly via simple FHIR interfaces, not on the inside.

2 Likes

Hey @ian.mcnicoll really appreciate your insights here.

I want to clarify on this point:

I don’t fully understand where the fear of making database schema changes comes from - maybe bitter experiences of schema changes causing destructive changes in production? And it’s something regular devs, let alone clinicians, are never supposed to do? It’s just the DBA’s role?

We’ve had significant development in safe ways to automatically evolve SQL Schemas in the past decade, with tools like AtlasGo, Bytebase, among others. It’ll literally be impossible to destroy data even if you try really hard with relatively simple CI/CD safeguards with such tools.

I’m trying to reason about what really makes SQL migration “low-level engineering”, whereas a template update is “high-level domain model” by a domain expert.

And it’s not without consequences - In any system that evolves over time, we are bound to have instances of data that go out-of-date due to updates to the underlying model - whether the rows that now have a new SQL schema or an openEHR composition that has a new template version.

I’m arguing that by shifting the complexity of evolving data models away from SQL into templates + archetypes, we are just forced to tackle the same problem, but at a different level, but without any of the battle-tested tools.

For example, because SQL can tackle migrations head-on and move all the old data to the new fields, you only ever have to deal with the complexity of new data versions once - when the migration is being done. Whereas, in a schema-on-read system like openEHR (or other NoSQL stores / FHIR), you have to move this complexity to the application every time there’s a change - I once had to write a REALLY long AQL query that would take all possible paths for a particular data point across different template and archetype versions. And even then, it’s only time before something goes wrong, as I mentioned here:

My main question now is - what really makes “templates + archetypes over the openEHR REST API” safer or more high-level than just “SQL on a SQL database”? Is there a way to define this property?

  • Maybe it’s guaranteed non-destructive changes - the ability of new changes not to cause data loss? (Which openEHR does provide)
  • Maybe backward compatibility - new data model version should not break existing applications? (which openEHR sort of provides if you create a new template, but the old application will lose out on the newer data from the newer template)
  • Maybe forward compatibility - old data model data should continue working with all future applications (openEHR doesn’t provide this properly unless you manually migrate old compositions to newer template versions. SQL does this much better by migrating data to newer fields)

And @ian.mcnicoll what you said here, I couldn’t agree more:

openEHR CDRs are probably one of the most non-trivial pieces of engineering in the Healthcare industry, hands-down!

But should the limited development resources available in healthcare really be building and maintaining this level of systems in the first place? Would we have been better off having all these developers just bite the bullet and actually run the “low-level” migrations for every single archetype change, and we just focused on building and maintaining a maximal global data model that is already compatible with existing databases?

Despite all the investment, we’re still quite a distance away from scale, performance, and simplicity.

I think the answer to simplicity cannot be “build more tools” - because

  1. We’re already severely constrained by development time and effort - compare any open-source tool in healthcare, even by a tech giant like Google (Open Health Stack), vs the usual stuff of what they put out (Kubernetes), and you’ll see vastly different usability and reliability.
  2. Even a project like SQL on FHIR adds MORE complexity to the entire ecosystem (ViewDefinitions, mapping rules, and targets) - it’s MORE things for new developers to learn and maintain. Not less. A really good article on this - GitHub - zakirullin/cognitive-load: 🧠 Cognitive load is what matters

What I’m trying to do here is go as deep as possible into what the real ideas behind openEHR are, and where it really shines, and what’s accidental complexity / unnecessary cognitive load.

If we are able to strip away all the extra complexity (which is going to be really hard for a lot of us), what will remain will probably be the real complexity of data models of the healthcare domain. And we can at least attempt to build for the next generation from there.

1 Like

Even if Sidharth and I implemented openEHR models on SQL in the same way (one table per archetype with compositions in another table(s)), I would keep the models in ADL.

Transforming ADL to SQL is automated and others are free to use non-SQL approaches based on the same ADL. There is no downside to keeping ADL as the source of truth.

In case some archetypes could be improved based on their SQL form (as Sidharth has mentioned), this should be done at the ADL level and re-generated to SQL (or other approaches).

I expect even with normalized SQL implementation, there would be some differences among vendors. For example I have two approaches already: a) fully flattened RM and b) a single JSON column for each RM element + separate columns for the searchable fields of that RM element (something that Thomas mentioned too).

This discussion is mostly about persistence. For me the important part is also how the openEHR data is used in the UI. This is another argument to keep ADL. I’m trying to cover the “full-stack” openEHR platform – CDR and UI. This is why I invest in openEHR generators. BMM allows us to generate RM and AM classes for the chosen programming language (i.e. Kotlin for the CDR and TypeScript for the UI). This gives the developers classes like DvQuantity to work with. Next step are UI components for DvQuantity,… The final step is to use ADL to generate domain classes and forms for OPTs (with constraints and validation).

2 Likes

This topic reminds me of @birger.haarbrandt ‘s “database from hell” example of trying to (if I remember correctly) get all the assorted blood pressures from all around their legacy system :fearful:

Seriously though, while I far from understand all the technical aspects of the thread, one POV I haven’t seen mentioned is the unique context of each template. Tables for each archetype may or may not work well, and tbh I don’t have an opinion about the best technology to implement openEHR. But when querying data we very rarely do so based on an archetype alone, but rather an archetype in a specific context within a specific template (which is also why the CONTAINS operator of AQL is so uniquely useful). This is because archetypes are (mostly) designed to be universally applicable models of their respective concepts, while much of the context is applied at the template level.

3 Likes

Hey @borut.jures this is a valid point. And I agree that there’s definately some semantics in the UI that is still not that easily representable using SQL directly. But I’m not really sure if Archetypes + Templates cover these well either. We might need another layer on top of the templates (after trying to annotate templates for everything). One answer to this could as well be code directly - a React components / Flutter widget that can be quickly assembled with a library of UI components/widgets for respective data types - this is sort of what we did with Medblocks UI.

But what I’ve learnt is that it finally comes down to each use case, and building the forms using AI / or by hand was the fastest way to practitioner delight. And the validation and constraints at the persistence layer that really dictate what a permissible UI should look like - we could reuse library components or build one from scratch, but as long as it fulfils the field’s validation contract, you have a valid UI.

Things like MCP that interact with different servers, with user approvals, are still sort of a user interface.

Now this is something we’ve really tried and experimented with using SQL a lot - almost all major languages have a tool that’s able to convert SQL definitions to classes in the programming language. Some examples - Drizzle for TypeScript, JooQ for Java amoung others. And not to mention projects like Hasura, Postgraphile that generate a GraphQL schema, and then you can use all the GraphQL code-generators too!

I definitely think ADL or another high-level language is useful and has a lot of value, but what I’m worried about now is the complexity of the implementation when it comes down to the data platform and persistence.

1 Like

The beauty of openEHR is that ADL2 has all the information on how a template constrains the used archetypes. Based on this constraints can be implemented on the CDR side and also in the UI.

The data for every archetype (=table) has the information which template was used to create the data. Searching for the archetype(s) contained in a template is “direct” when using the proposed implementation. We should do proper benchmarking, but this “direct” search should mean a significantly faster querying.

@siljelb Do you also search for Composition CONTAINS ArchetypeA AND ArchetypeA CONTAINS ArchetypeB?

@Sidharth_Ramesh I actually use Drizzle :wink:

Generating RM classes allows for much cleaner code:

export interface OeeCluLevelOfExertionV0ExerciseIntensityItemData extends ItemData {
    description: DvTextData,
    intensity: DvCodedTextData,
}

export interface OeeCluLevelOfExertionV0ExerciseIntensityData extends ClusterData {
    items: OeeCluLevelOfExertionV0ExerciseIntensityItemData[],
}

export const oeeCluLevelOfExertionV0 = oeEhrSchema.table("clu_level_of_exertion_v0", {
    ...idEmbeddable<OeeCluLevelOfExertionV0>(),

    exerciseIntensity: cluster<OeeCluLevelOfExertionV0ExerciseIntensityData>("exercise_intensity"),

    phase: dvCodedText("phase"),

    ...baseEmbeddable,
})
1 Like

Hey @siljelb excellent point.

This is pretty straightforward with relational tables.

Just for illustration: You could imagine a composition table with id, name, template_id, other attributes, and have another blood_pressure table that has id, composition_id, systolic, diastolic fields.

The CONTAINS with normal SQL will just be:

SELECT bp.*
FROM blood_pressure bp
JOIN composition c ON c.id = bp.composition_id
WHERE c.template_id = 'your_template_id';

Now this could be more complex, and you could have recursive searches as well - it’s just a matter of writing the right SQL.

Moreover, there’s another pattern I’m seeing quite a bit in the organization I’m working with - almost all instances of an archetype used within all templates of the organization tend to have the SAME exact constraints. For instance, if the Diagnosis is limited to a SNOMED CT Value Set in one template, they actually want the Diagnosis archetype to be used similarly in other places as well. Same with Medication Order - use the same pattern with the same clusters with similar constraints everywhere. Has this been your experience too, or maybe something different?

This makes me think that there are probably constraints that apply globally on a CDR that needs to constrain all archetypes (table-level) to certain rules (which, using SQL, can be done with CONSTRAINTS / FK relationships) - and these can be applied globally on the CDR AFTER importing the base table definitions (which will still be maximal and try to cater to all use cases)

Then there’s the specific template-level constraints - that shouldn’t apply to the archetype globally in the CDR - this can be implemented again using CONSTRAINTS and checks tied to TRIGGERS when the composition matches a particular template_id or precondition - what we’re able to do with SQL here is also super expressive, and we can pretty much cover anything an openEHR template can throw at us - including Terminology validations (we just model these as FK constraints).

So we’re able to cover:

  • Querying archetypes/templates in context using SQL joins
  • Global archetype level constraints (not possible with openEHR today) using table-level constraints
  • Template level constraints using TRIGGERS with additional checks

And most of the actual complexity of the implementation can be hidden away from the clinical modeller. I don’t know how many people who have contributed to the CKM today actually read/write ADL. It’ll be similar - there’ll be a tool that abstracts the complexity, but the output artifact would be something that works with a lot of the existing ecosystem and tooling.

1 Like