[EHRbase] Storing and querying data without an standalone archetype

damoca · 28 March 2024 17:45

System: EHRbase
Version: v0.32.0

I’ve come across two problems that probably are closely related. I was doing some tests and I created a single archetype containing the full structure I needed. That is, a COMPOSITION, which includes an OBSERVATION, which includes several ELEMENT.

I know that’s not the common good modeling practice, but it is completely legal.

With that archetype I created the template, and then some instances using the Cabolabs openEHR toolkit.

To my surprise, when I tried to load those instances in EHRbase, I got this error (at0004 is the OBSERVATION):

{
	"error": "Unprocessable Entity",
	"message": "/content[at0002, 2]/items[at0004, 1]: Invariant Is_archetypeRoot failed on type OBSERVATION"
}

My deduction is that since the archetype_node_id in the OBSERVATION instance has an atNNNN code and not an archetypeID, it is thus rejected. I can understand that this kind of instances could impact in the indexing and querying processes, but nothing in the specifications says that it is an incorrect instance.

This deduction was supported later, when I was trying to do an AQL that returns all ELEMENTs in all instances.

SELECT ele FROM EHR e CONTAINS COMPOSITION c CONTAINS OBSERVATION o CONTAINS ELEMENT ele

First I got no results. Then I made another test creating an ELEMENT archetype, inserted it into the OBSERVATION, created instances, and only those identified elements were successfully returned with the previous query. That means that an archetypeID is needed to index and query data, although ELEMENTS without it are accepted by the server without complaints. I see there a lack of coherence:

I cannot store OBSERVATION instances (I guess that it will happen with any kind of ENTRY) if they are not defined as an independent archetype.
I cannot query ELEMENT nodes unless they are defined in an independent archetype, although they can be stored being “anonymous” (as it is always the case).

I insist, I completely understand that there could be efficiency reasons for these limitations, but they are affecting valid use cases of the specifications.

Any thoughts about this?

birger.haarbrandt · 28 March 2024 21:58

Would you mind attaching template and an example composition?

Edit: I think the validation is done by Archie. I would have to check how the new AQL engine deals with the structure.

birger.haarbrandt · 28 March 2024 23:32

@damoca Alright, I was able to reproduce this.

In EHRbase you can set a configuration like this:

  # Option to disable strict invariant validation.
  disable-strict-validation: true

This will allow you to store the composition. We might need to discuss with @MattijsK about any implications for Archie.

Then I also checked with EHRbase’s new AQL engine and you can to do the following:

{
   "q": "SELECT o FROM EHR e CONTAINS COMPOSITION c CONTAINS OBSERVATION o[at0002] CONTAINS ELEMENT e "
}

and your query also works:

{
   "q": "SELECT ele FROM EHR e CONTAINS COMPOSITION c CONTAINS OBSERVATION o CONTAINS ELEMENT ele "
}

joostholslag · 29 March 2024 08:30

I don’t think it’s Archie. Since I made a similar archetype in the Nedap archetype editor, which validates archetypes using Archie (iirc) and it doesn’t give any errors. Sign In with Auth0

I thought ehrbase only used the RM classes from Archie, not the archetype processing, since that’s adl2 based. But maybe I remember wrongly.

joostholslag · 29 March 2024 08:35

Since in adl2 templates are technically archetypes, I use this pattern of in line definition of archetype able structures regularly (mostly sections). So I definitely feel it’s a useful pattern to support.

birger.haarbrandt · 29 March 2024 10:15

@joostholslag thanks a lot for taking a look into it! Then I think we need to investigate a bit on the EHRbase side

thomas.beale · 29 March 2024 12:25

The information structure is certainly valid. It’s not necessarily even ‘bad modelling’ - it’s just that the typical archetype joining (created by slots of use_archetype references) is not being used, so there are no archetype ids at the usual locations such as the root node of OBSERVATION.

A vanilla system built on published principles should therefore be able to deal with this. However… to make querying work in a reasonable way we tend to add some extra (non-published) requirements - that root points of ENTRYs for example are always a new archetype. Without this, it’s hard to expect querying to work properly, since some ENTRYs won’t be found by the usual methods.

Note that we already do both kinds of modelling with CLUSTERs, and this is assumed. Thus, an AQL query looking for (say) CLUSTER device under OBSERVATION abc will not find it if there are inline CLUSTER structures inside the containing OBSERVATION rather than device data always being represented by a device CLUSTER archetype.

There are two possible solutions I can see:

we say that an openEHR ‘system’ (CDR, however we want to call it) breaks information up according to certain levels, driven by some ‘querying profile’ or ‘configuration’
we say that querying needs to be able to deal with any information structure, regardless of where the boundaries are, and provide ways to make it work.

Following the first option, the ‘levels’ would presumably ones that have semantic significance. We generally agree that ENTRYs make sense as stand-alone statements of truth for example. The story gets more complicated with CLUSTERs, so it’s not quite a question of ‘levels’ but of independent entities having their own archetypes. Hence, a CLUSTER archetype is required for device, but not (generally) for its subparts.

The second option implies that querying would be complicated by trying to allow both modes of representation for what we consider to be information entities representing independent (and therefore independently queryable) entities.

I would therefor favour a modelling approach approximating the first option.

birger.haarbrandt · 29 March 2024 16:05

@thomas.beale Not sure about Better and DIPS, but with the new AQL engine we seem to be able to deal with option 2 in EHRbase. So depending on the general experience of implementers, I think option 2 can work as well.

damoca · 29 March 2024 18:15

I attach the archetype, the template and a JSON instance. As I said it was just a quick technical example.
openEHR-EHR-COMPOSITION.proves_tecniques.v0.adl (3,6 KB)
Proves tecniques.opt (18,3 KB)
Proves tècniques - instància.json (6,2 KB)

thomas.beale · 29 March 2024 18:20

I don’t know that option 2 would be that hard to implement; what we have to think of is the numerous queries that are written over long periods of time, and managed as knowledge resources. Having inline structures representing separate entities will be problematic.

I believe we should have a convention that says that an information object that IS-ABOUT an independent entity (thing or process - including an event, as understood by e.g. BFO2) should always be modelled by its own archetype. WHereas parts of entities (things, process segments etc) may be modelled inline or (for re-use purposes) as separate archetypes.

damoca · 29 March 2024 18:23

Thank you! I’m using the docker distribution, and I imagine that option corresponds to this docker run parameter:

SERVER_DISABLESTRICTVALIDATION --> Disable strict validation of openEHR input

I will try it, although it sounds as it will accept anything you send, I hope not

damoca · 29 March 2024 18:51

I think it is important to distinguish between functionality and performance of AQL. It is reasonable to expect that in AQL you can query at least for any LOCATABLE children because, well, they are locatable. Then, if they are archetyped, then they can be indexed more easily and then the query performance is better… that’s fair but just complementary.

BTW, this is a good example of what @pablo mentioned some weeks ago about the need to improve the AQL specifications. What can be queried in the FROM clause? Now it is a quite informal definition:

Does that include other non-LOCATABLE classes?

birger.haarbrandt · 29 March 2024 19:09

Hi @damoca,

it won’t accept anything, it will still check for most constraints from the OPT + RM.

I checked with your OPT and example and can confirm that you can do stuff like this:

{
   "q": "SELECT tree FROM COMPOSITION c CONTAINS OBSERVATION o[at0004] CONTAINS ITEM_TREE tree[at0006] CONTAINS ELEMENT e[at0007]"
}

damoca · 29 March 2024 19:28

I tried with the parameter you mentioned, and now the instance is accepted

The AQL is not working though. I’m willing to take a look to that new AQL engine implementation

{
	"error": "Bad Request",
	"message": "Could not process query/stored-query, reason: org.antlr.v4.runtime.misc.ParseCancellationException: AQL Parse exception: line 1: char 54 mismatched input 'at0004' expecting ARCHETYPEID"
}

In any case, my idea was to not even needing to put the atNNNN code in the AQL, but filter it by a term_binding code. But that’s another story.

birger.haarbrandt · 29 March 2024 19:37

In any case, my idea was to not even needing to put the atNNNN code in the AQL, but filter it by a term_binding code. But that’s another story.

Can you share some details on what you have in mind?

damoca · 29 March 2024 19:50

There are many facets of the problem.

We are working with multilingual templates and instances, thus we have to avoid any filter using textual names.
We are working with clones of the nodes, so the atNNNN code is repeated, but the term_binding changes for each clone, so we can use that for filtering results.
And finally, if we really believe in semantic interoperability, we should be able to locate any structure by its term_binding, no matter where it is located in one or multiple archetypes.

Pushing the technology to its limits

thomas.beale · 29 March 2024 23:25

Something we are doing in Graphite is to use LOINC codes to name every data node in every archetype, including archetypes created from openEHR archetypes. In ADL2 these will appear as term bindings. These will be published openly soon (need to make a bit more progress on obtaining codes, which means quality checking natural language keys) and I would recommend we think about the same approach in openEHR.

pablo · 1 April 2024 13:37

@damoca did you check the generated instance was correct?

Just double checking if this is not an issue in the openEHR Toolkit instead of in EHRBase. Though I don’t remember making any assumptions in the instance generator about where SLOTs should be used.

damoca · 2 April 2024 07:10

Not in all detail, but I took a quick look to it and seemed right.

siljelb · 2 April 2024 08:54

That’s a very interesting approach! What happens if there aren’t any semantically identical LOINC codes?