openEHR archetypes as SQL Tables

But the same archetype may be used several times in different contexts within the same template. For example, OBSERVATION.spirometry_result within one SECTION for the baseline before some procedure was done, and then again within a different SECTION for after the procedure. This is a routine and crucial template modelling pattern.

I normally do this as Composition CONTAINS ArchetypeA CONTAINS ArchetypeB, but yes.

2 Likes

Ah I see now what you mean now.

So we’ve so far been experimenting with the order attribute on the archetypes - as as they fit inside the template, we know which goes first and which second, and so on, but we never implemented different constraints on the archetypes within the same template like this.

We usually separated the template into multiple smaller bits - especially when they will be entered at different points in time by different professionals.

1 Like

I’m pretty certain this isn’t the case in our organisations (regional hospital authorities). We have a wide variety of value set, terminology and occurrences constraints as well as different ways of nesting archetypes within one another to express specific information and information contexts. A select few archetypes (mostly the “simple” vital signs) do have some more commonly occurring constraint patterns, but it’s far from universal.

1 Like

To @siljelb, point about templates, my understanding is that at least from a validation POV, you almost have to treat each template constraint of an archetype as a separate table/column e.g I might have a different set of constrtaints for a paediatric BP vs adult. In this way of working, where are the constraints applied?

And for people looking in - this is not really an openEHR discussion but about the merits or otherwise of a domain-driven datastore with a fixed schema (CDR) and which accepts everything it gets from clinical content authors vs. a datastore where the physical persistence has to continually has to adapt to new or altered content. openEHR does not actually care which approach is used, but Ocean pioneered the CDR, Better and DIPS followed. I understand that others have used Hibernate or indeed some kind of hand-crafted SQL ?? Chinese datastores.

The FHIR native CDRs face exactly the same challenges, possibly even worse. The root cause is the highly nested and fractal nature of clinical data. If your iuse-case is somewhat constrained, I can see how a more ‘crafted’ approach might be attractive, as long as the datastore consumes/emits/queries data in a standard format .

@Sidharth_Ramesh (Great discussion BTW!!) The case for a domain-driven CDR for me is easy - speed, adaptability and control. When we are working, particularly in integration projects, there is lot of template updating/ adaptation. If we had to wait for the dbSchema to be updated each time, it would hugely slow the project down. Similarly n-going updates to Basically - what @joostholslag said - you can’t avoid the complexity of managing the increasingly complex semantic pot inside a mature EHR, which is not just about the db Schema but terminogy valuesets, quite complex clinical ideas etc and it feels ot me that this must be the responsibility od dedicated clinical informaticians.

But ultimately, I’m perfectly happy with what you are suggesting - more like the choice between a manual and automatic gear-shift, s long as the ‘standard’ is applied at the edges.

4 Likes

Another regularly occurring pattern, often used together with and/or nested within intra-template “micro-contexts” like mentioned above, is repeating a group of archetypes, often within a 0..* SECTION. An example of this is the combination of an INSTRUCTION and the initial state corresponding ACTION.

3 Likes

Using the same archetype in the same template is fine. Each use has its own ADL path which is used to differentiate the uses (I store the path for each PATHABLE element).

Note also that there is one table per archetype, but every data instance for that archetype is represented as a separate row in that table. Each row has a template id.

1 Like

I agree @siljelb that is absolutely not our experience. Especially that, at least in the UK, community prescribing works very differently from hospital prescribing.

For things like Diagnosis, sure there may be a fall back to a large common Valueset but that does not work in specialties like Renal who want to apply a much tighter or conditional Valueset. Or e.g Cancer care where there is a need to maintain a very different level of Diagnosis recording, capturing detailed staging, morphology etc.

Similarly for Vital signs - that can be very simple and common usage for routine Vitals but much more detailed and demanding e.g cardiology or ICU (the Pulse/Heartbeat debate!!)

2 Likes

It may have the same path (or not, if the name node is adapted at template level) but it may not have the same constraints on the end of the path - where is that captured for validation purposes?

2 Likes

In the approach proposed in this thread, the schema is automatically (no humans involved) updated in less than 1 min. That should be acceptable :thinking:

The constraints are at the API level or as triggers/procedures directly in the database as Sidharth mentioned (this is extremely low-level @Sidharth_Ramesh :wink: ).

I prefer the API level since most validations can be done with generated SDK. And the entire composition is still validated as in any other CDR.

2 Likes

I’d agree with this: I don’t care about how things are implemented at the technical level as long as I can add templates, persist and read compositions, and do AQL queries like normal.

As an example of the templating patterns I mentioned above, this is a template used in hospitals across Norway for documenting an allergen immunotherapy treatment appointment. It’s somewhat old by now and I’d probably do some things differently today, but I think it demonstrates the patterns pretty well while being small enough that it’s possible to get an overview: Archetype Designer

2 Likes

So the only reason for suggesting TRIGGERS here is the appeal of having all the data modeling work exported as a single computable output that’s widely supported across vendors.

The use cases that @ian.mcnicoll and @siljelb brought up are quite valuable. Thanks a lot for sharing the examples. So to this:

I say the tools we have available define the output of what we produce. So, with a different output artifact and technique with different constraints, how we model, persist, and query may not EXACTLY be the same, but I’m confident we can get 90% of the functionality (at least as far I can see) - this might mean breaking up some templates, changing how things are modeled, and constraints applied.

What we might get may not be an openEHR CDR as it is defined today, but something that embodies the principles of openEHR that’s scalable, performant, and maintainable. Most important of all, widely accessible to every organization and developer who can use SQL, BUT at the cost of some modeling techniques not being possible (the same data constraints can still be represented using other techniques).

How do you think the community (and you personally) would feel about something like that?

From my personal POV, and to be perfectly honest, that sounds like a lot of work for little gain. Despite some quirks our current CDR is working well enough, largely doing what we need it to and apparently scaling fine across millions of EHRs.

Another thought that just hit me: How do the compute requirements compare between a workflow using a less performant persist and query method, and one highly dependent on LLMs, both in order to abstract away a lot of technical complexity?

3 Likes

Hahaha, this is quite an interesting question. So your argument is that using a lot of LLM generation will offset any performance gains made by a newer technique, and we’d actually be spending more on computing resources, right? :laughing:

I never thought of it that way. I think we would probably have a future where there are LLMs on all systems - one where it performs well, and another where it doesn’t. I think a faster API would actually induce a lot more LLM use (to do more useful stuff) because of something like Jevons Paradox. The LLMs on the less performant API would probably not get used a lot because the user experience sucks, and people just use less of it.

All the above is during runtime.

When authoring and creating applications, LLM-driven development should not be compared against compute spend, but human resource spend.

1 Like

For the record, this wasn’t an argument but a genuine question.

1 Like

I cannot go into all details, but want to contribute some perspective beyond openEHR and FHIR. There is often the idea that only these technologies allow for flexible representation of data at runtime. However, I have worked with quite a few clinical databases and systems at Hannover Medical School.

In my experience, there is also some part in the data model which is using either good old MUMPS / Cache, which is quite optimized for sparse(!), hierarchical and complex data. There is a good reason this niche and arcane technology is still used to this day in major systems. I actually supervised a master thesis where we explored using Caché to store openEHR, but openEHR RM broke Intersystem’s Java ORM framework (maybe skill issues, but back then it was a big issue).

Then there is also databases using the EAV schema. Databases also typically support hierarchy datatypes to help represent this data in rows. Both Caché and EAV databases are complemented by materialized views, indexes, helper tables etc. to optimize for certain use-cases. Of course there is also the purely relational part for more static data. JSON is just a serialization format and says nothing about storage engine or performance.

So this discussion has been around in healthcare IT forever, independent of openEHR and FHIR. Always good to discuss and maybe LLMs can bring innovation, but in the end there is data structures and the laws of math and physics which won’t go away.

3 Likes

It is agnostic as to how persistence is achieved, but the base assumption in openEHR is that the main informational artifact - the EHR - will be persisted, since that’s what an EHR is - a persistent record of care. The ‘how’ is up to the implementer.

I guess what you really mean here is that each Entry instance is written to a table specific to the archetype (major version) it is based on, with a foreign key for the owning Composition. The problem with this approach is that the data at the RDBMS level is not interoperable with another system that doesn’t know about that archetype, whereas in an RM-based schema design, the data is interoperable, regardless of what archetypes are in use. The SQL queries are unlikely to be interoperable either, unless the archetype tables are absolutely identical in each system.

Whereas AQL queries are completely independent of persistent representation, and can be safely used for CDS products.

That said, I don’t think anyone should be discouraged from exploring this path. It’s interesting industrial R&D. It might produce some interesting new methods.

Not quite - the API (whether REST, Kafka, Google Protobuf or other technology) doesn’t know about SQL or any other concrete persistent representation, only the abstract formal representation (typically available in JSON) and AQL queries. It’s essentially a VM design.

Having multiple independent services accessing the same tables is a pretty bad idea… the usual point of a data service is to manage the transactional changes to specific tables - all requests for change need to go through the service, which will ensure the correct semantics.

Yep…

That bit is definitely doable, and almost certainly will happen in the near future.

That would be an error in my view. Archetypes are models of information semantics, not DB schemas. I’ve actually just finished being in a company that pursued exactly the proposed pathway. Disaster. DB specialists have no clue about the realities of the domain, what the dynamics of different kinds of data are, what the querying patterns are.

Well SQL doesn’t know how to represent constraints directly, and archetypes with first order logic expressions are going to be even harder - SQL only has elements of FOL in the WHERE clause. ANd that’s not even thinking about translations or terminology or ontology binding…

If that’s happening, either the clinical models don’t correctly represent domain intent, or the developer has deviated from the models. Now, it has to be noted that today there is a gap between application logic and the clinical models - the process / workflow is not yet encoded in clinical models. But it will be, and at that point, an application is close to generatable from the models.

Yes and no - having individual clinicians create private / local models of what should be shared semantics is how data interop standards work today - that’s why they create so much mess (aka ‘profile proliferation’). Creating archetypes is not about designing applications, it’s about stating the definitive semantics of the information.

I don’t know what this means - specific to openEHR clinical models? That just means specific to a definitive model of healthcare information. But you probably mean something else here…

There are all sorts of tools and DBs that work well for normal developers. But they mostly produce terrible solutions that do not match the requirements, are not semantically scalable, or even comprehensible (i.e. most semantics hidden as code, DB stored procedures, config files etc). Each ‘nice’ app we see is its own silo, and can’t connect to other apps. I don’t think newer tools and languages will change any of this; the only thing that can is shared, computable domain semantics (= information, process definitions, decision pathways) used to drive info systems. The current state of HIT is still pretty much as bad as it was 30y ago, just shinier.

I think this is the main thesis you are making. When our original group developed archetypes, templates, and used terminology and ontologies, we did look at existing ways of representing complex models that include constraints and FOL. SQL didn’t even get on the list of formalisms contemplated. Times have moved on of course, but the technologies I see starting to edge toward satisfying the representational requirements we have (both at design time and runtime) are things like GraphQL (still has significant limitations), W3C Shacl, and other semantic representational approaches.

Once again though, I don’t want to discourage any R&D or product building in this vein. But it’s worth (I think) contemplating the pros and cons of the proposal (which I think is your intent with this discussion :wink:

Having spent some years on that part of the problem space, including extant standards addressing it, I’d like to see such tools!

Exactly right.

  • a) Size of the work - endless
  • b) errors will be made
  • c) some changes will be too hard to make and will thus not be made - causing deviation of the system from the requirements
  • d) cost of data migration and ensuring data integrity (some data migration is always going to be necessary, even in openEHR, but it should be absolutely minimised)

We all have experience of DB schema evolution in health for decades. It’s been very expensive and limited.

There aren’t really any battle tested tools to do the job. Semantically, most of the IT industry is stuck in the Jurassic permafrost…

It doesn’t do this automatically. I have seen the nuclear DB core of a very famous product in use at a very famous provider, and you would not believe the complexity of the SQL just to enrich and map the data for EDWs, migration purposes or similar. DBs with 20k tables and typically 100-column tables.

That indeed is a problem not currently addressed well, but it can be, with coding of every node in every archetype. This is an essential next step in openEHR. Then queries will be based on those codes, not the paths.

The first is abstract formal representation of information; the second is a particular concrete data representation. The former can have numerous concrete representations, which is why it exists - to standardise the semantics. You can’t standardise anything at the SQL level, unless you take SQL as your primary / only modelling formalism and convince everyone else to do the same. Just consider trying to use SQL to represent a model of chemotherapy process like this. It’s lacking all the main primitives needed - tasks, events, wait states, decision gates.

Well the question of limits on funding being applied to semantic architectures like openEHR, as opposed to semantically naive / brute force architectures like most technology isn’t a design decision, it’s a commercial outcome. Today’s systems have been built in all those technologies for decades - and they’re terrible (not my opinion - I just spent 2y in the US finding out what US HCPs think of their ‘modern’ IT).

There is necessary cognitive load for a few people - back-end vendors (you only need 5 - 10); tool developers (same number) and clinical modellers (< 1% of all clinicians). IntelliJ is a seriously complex and sophisticated tool; so is Visual Studio, plus a handful of other competitors. You only need about 5 such tools in the world to ensure that truly sophisticated developer requirements can be met. Outside of those 5, the next 100 IDE products are all amateur level.

The problem in the HIT space is that the thinking is decades out of date, and the money is rarely directed to the places where it would generated real value. And that’s changing too slowly, I agree. It’s not a technology problem.

3 Likes

SQL → code is not that interesting - you can do a lot more in BMM than in SQL, so you can get a lot more from BMM → code.

BTW, we trialled GraphQL as an AQL replacement. It can’t represent WHERE clauses with OR operators in them, nor quite a bit of other stuff, without chains of GraphQL transformer shapes. Not necessarily bad, but needs to be understood.

Well it depends on whether you want to determine data fitting within constraints at runtime or not. That’s of very great power; that’s what ADL does. It can be made optimal for runtime execution by transformations of course but mathematically, still the same thing. SQL DBs don’t have any of that built in, so you can either drop it altogether (but then data validation is hard) or you have to engineer it in another technology. XML schema gets you a bit of the way, but it’s ultimately not very strong and has a horrible meta-model.

The key question is: what is the level of semantic sophistication of the representational formalism? Can it represent recursive structure, constraints, FOL, temporal relationships, terminology binding and so on?

It is.

a key question :wink:

With … let’s say 200Gb of data in the affected tables?

Exactly.

You might get close-ish on just data, although I don’t yet know how you are implementing constraints (generated stored procedures?), but when you hit FOL and workflow representation, it won’t be close.

That might still be OK - but you’ll be up for implementing workflow / process semantics on top of the data platform, and making it all work off the same model base.

I personally don’t see the attraction in SQL as a primary representational formalism, I see it as a must-have generated target.

Sure, it’s not that persistence isn’t a requirement for the EHR, it’s that the persistence mechanism itself is not part of the specifications, which would be more a physical spec (which specifies the “how”) while the current openEHR specs are at the logical level (the “what”). So the specification of the persistence layer is not in the scope of the openEHR specs.

Though it imposes a hard requirement on querying hierarchies because of the CONTAINS clause, so a technology that doesn’t have a way to check for contains on ANY descendant, can’t actually implement AQL.

1 Like

I’m mostly concerned about the Quality Assurance: how do I limit the risk of this app killing patients? If I can’t understand the thing that’s actually implemented because it’s AI generated, I can’t be responsible. Now I vividly remember you stating ‘India is a free for all’ (regarding clinical safety regulation), we probably have diffent needs in this regard. And I’m very much aware of the huge downsides of EU’s regulation heavy approach limiting innovation, which definitely also kills patients. So not arguing one of the other, just trying to explain my point and why it’s so important to me.

Because it’s designed that way. Not trying to be unhelpful. But it just end up in everything. ADL is doable for GUI tools to model, it’s designed around constraints (not tables), it’s doable for non engineers to edit as text etc. Etc. SQL is designed for developers of tabular data. Now this isn’t to say SQL with tools and additional assumptions and constraints couldn’t be made to do the same thing. But it will be an uphill battle I expect.

True. And this is a good thing. Because this lets developers share the burden with domain experts. It’s quite scary in non-openehr systems the data is being manipulated by engineers that have a pretty limited understanding of the clinical impact of what they’re doing.

Your argument are correct but mostly technical, and thus to me not that important. If you can give me the same safeguards using SQL I’m ok with it.

There’s definitely a lot of complexity in openEHR, though in my estimate not very much of it is unnecessary. Only some part are in the ‘no longer necessary’, like ODIN being no longer necessary in a world with YAML. I’m still hoping OpenAPI (or a follow up,) can one day replace BMM.

My take on IT evolvement is not to try to ignore inherent complexity, but to model it properly, and then try to abstract away most of the complexity by layers of abstractions. I think openEHR is proving itself to be on the right track here. You state there’s been a lot of investment in openEHR, I would say there’s been crazy little. If we would use openEHR for all healthcare software we could spend 10X time on tools and CDRs and still have 100x more time available for front end innovation, instead of spending most of it on modelling the same thing 1000s of times than trying and consistently failing to integrate the two. I do agree we would need the 10x spend on CDR + tooling. But I’m quite happy with recent (10y) progress (ehrbase, Model context protocol, nedap archetype editor etc etc).

It’s not sql absence that’s the issue, it’s the inherent complexity and counter intuitiveness of healthcare data. I’d like for us to create , as Ian suggested to create a SQL interface on openEHR data. Probably over the rest api.

2 Likes

This is where I’m diverging significantly from the standard service/database model. I’m not new in proposing this - there’s been a good number of companies like Supabase, Convex, and Hasura that have built on this idea. But here’s the key idea:

PostgreSQL IS the API.

PostgreSQL IS your Data Platform.

And PostgreSQL is a standard

The tables, triggers, procedures, and extensions are the “metadata/artifacts” that drive the data platform.

When you ask:

Yes! And I say they should be exactly the same - with the exact same semantics in all systems! Except for the additional constraints on top that could be customized to each setting, all the data in the database should be EXACTLY the same.

BUT, we now have database-level procedures that can ensure these exact semantics - SQL becomes the API, and all the consistency and transactional patterns are built straight into the database.

Fair. But what we have now is clinical domain experts not being aware or even “able” to optimize for query patterns even if they know what to do. Most queries are done on JSONB because the data schemas need to accomodate anything the clinican might possibly throw at them in the future. What exactly happened with the DB specialist approach? What exactly was the approach? How was the modelling done? I’d like to learn what didn’t work, so maybe we don’t make the exact same mistakes again.

Not ANSI SQL, but PostgreSQL is pretty darn good at representing pretty much any constraint you can throw at it. And we’ve had success with terminology / ontology bindings using foreign key constraints on a valueset table as well - performance is really good too. About translations - have not really tried to solve this - but I’d say this is more of a presentation problem, and not a data persistence and modelling problem? A certain column in the table could still have metadata associated with it that can enable some of this.

This is why I think @borut.jures ‘s point about having something like ADL as the source of truth absolutely makes sense, but the ADL-like artifact should definately be opinionated about how it’s actually implemented in a standard Postgres-like database - because then you can just use the database as your health data platform without reinventing tools and CDRs specific to healthcare.

I absolutely agree with this. What I’m concerned about is our trying to make EAVs and JSONB work with traditional databases (with poor performance/tooling maintenance support) because of us having to support the ever-evolving needs of a clinical modeler to provide new data models in the future. However, if we go back to the basics and use the best RBDMS systems of our time (Postgres) in the way that it’s supposed to be used with simple flat tables, and have the clinical modelling process be a little more technical (output SQL), we can literally just use an PostgreSQL compatible instance as the data platform.

What I mean here is this - imagine if tomorrow we were just able to provide a developer who knows nothing about healthcare or openEHR data models access to a PostgreSQL database with simple flat tables and ask them to build a new solution on top of it - they can pick their favorite programming language, the ORM they want, a low-code (like Budibase) or BI tool (like Apache Superset) that they want, and everything just works with these simple tabular data models. They can also use a simple database visualisation tool like DBeaver to access this data and play around with queries.

Now compare that with openEHR REST APIs - you suddenly need bespoke tools for everything. The developer starts asking questions like - Has anyone built an openEHR plugin for DBeaver? Is there an openEHR connector for Budibase? or Superset? How will my Python program interact with openEHR? Are there any libraries I can use to ensure I generate valid compositions in Python?

That’s why I say, we standardize at the level of the database tables - your database is the data platform, and all semantic constraints are in the database, SQL is your API. This is, of course, not how things are done TODAY, but IF you use the database like this, don’t you technically solve these data silo problems, while keeping the wide ecosystem of tooling available for use?

Absolutely. My only goal here is - can I get enough bullets through this idea early on so I can either 1) solve, 2) abandon and move to other ideas. @siljelb , @ian.mcnicoll and @joostholslag have already given me specific clinical requirements to think about.

And you’ve given me this:

Which are all super valuable in terms of the problems with going down this route. (But I’m probably going to give it a serious try anyway).

Again would love to know your experience with what went wrong at your previous company that decided to do this.

YUP, exactly stored procedures and basically hacking what a database can do to its limits while still keeping it a standard PostgreSQL database.