Named element - and occurences

Seref · 25 January 2023 15:20

Not exactly, though we’re significantly overlapping . Bjorn’s use case for querying and referencing data is different than what I assumed it is, but I don’t have enough of an understanding to qualify my comments further.
That’s why I keep insisting on hearing the use case: to be able to think whether or not an approach is a good one in that context and how it effects other contexts/use cases.

A UI form’s runtime’s use of a path (which may be a full EHR path/url) to identify an ELEMENT with DV_TEXT value is different than the use of a path in an AQL query, at least to me. You can justify for having sequence_ids or guids or whatever in the former, but I cannot see why they’d ever end up in a path snippet in an AQL query as I discussed above. Admittedly, I’m rather opinioned about what you should and should not do with AQL so I’m reluctant to reiterate, I’m obviously coming from a different perspective here.

pablo · 25 January 2023 16:02

bna:

Given a template with i.e. two openEHR-EHR-ACTION.procedure.v1

The first one is given the name “P1_NO” and the second “P2_NO” in the norwegian language and “P1_EN” and “P2_EN” in the english language. Seen in the user interrace of the archetype designer.
The OPT is generated with norwegian as the primary language.

In Better Archetype Designer the OPT will be given a constraint on the name to “P1_NO” for the first instance and “P2_NO” for the latter.

We think the data then MUST have the given name. And if there are multiple occurences each instance MUST have the same name. And also that the same name will be used to define each element for any language based on the OPT.

As a consequence there is no way to adress a specific locatable by using the name. To be able to adress a specific procedure instance using i.e. AQL or EHR_URI some other mechanism must be used. We suggest either:

a) Use the index in the list, i.e. 0 for the first item like /content[0 and name/value=‘P2_NO’]
b) Use the UID for each locatable - /content[uid/value=‘some_guid’ and name/value=‘P2_NO’] (here the name is not needed since the guid will identify a unique instance

Note that the term definition will use the localised name. We think this is used for user-interface purposes like the archetype designer or a form renderer. But the data MUST have the given constraint for name.

I’m late to the party, and not sure I understand all the points. What I see is:

the modelling time issue: giving constraints to names
the store time issue: storing data in a way that makes each object with the same name uniquely addressable
the query issue: how can I get one specific instance if there are many occurrences with the same name in the same compo
the reference issue: having an EHR_URI or LOCATABLE_REF that reference one specific instance if there are many occurrences with the same name in the same compo

For 1. I prefer to define DV_CODED_TEXT constraints, having a code is more robust than a name, name could change, codes change rarely. Even better, it will be the same code in different languages, so instead of 4 options for P1_NO, P2_NO, P1_EN, P2_EN you have P1 and P2 as codes.

For 2. besides storing the code as constrained in the archetype, we store an instance path, though decide to store the LOCATABLE.uid is also valid.

An instance path would look like this: /content[archetype_id=openEHR-EHR-OBSERVATION.lab_test-blood_glucose.v1](0)/data[at0001]/events[at0002](0)/data[at0003]/items[at0005](0)/value/value

Note we use the archetype_node_id and the index in the collection attributes (kind of the sequence_id mentioned above). So different instances of the same archetype node will be in the same collection with difference index number (0), (1), …

We chose parentheses to differentiate from the archetype_node_id square brackets.

For 3. use the code to query, not the name. Names make queries not portable, and make them language dependent.

For 4. I would use the instance path. If that could be standardized so we all use the same format, would be cool.

In general I try to stay away from names in paths, like in some JSON flat formats are using paths with names, a name changes in the archetype, then you need to change the paths. Same will happens with storing (2), querying (3) and referencing (4). The real problem is a name could change in an archetype and the version of the archetype doesn’t changes, so everything will still reference the same archetype version though all your previous uses of the name are invalid. If the archetype version would be updated for name changes, then this won’t be a problem, but is not how things work today, and I think nobody wants an archetype with v58345 for each single name change that happened through the archetype history.

I hope I understood the problem, if I didn’t get it, please ignore my message

damoca · 25 January 2023 16:08

Just FYI: interestingly, the ADL path specification already talks about supporting the ordered predicate access:

Archetype Definition Language 1.4 (ADL1.4) (openehr.org)

thomas.beale · 25 January 2023 16:59

Yep. Until I see a reason I understand, I agree.

thomas.beale · 25 January 2023 17:42

The first path won’t on its own be an instance path. Are you doing something like this:

/content[archetype_id=openEHR-EHR-OBSERVATION.lab_test-blood_glucose.v1]/data[at0001]/events[at0002](2)/data[at0003]/items[at0005](4)/value/value

This isn’t standard syntax… but it’s easy to convert to a form that is.

The problem with just using the position in the collection is the position of a specific data item can change from one version to the next. So the same instance path won’t necessarily point to the same thing across versions.

Yep, they do. But if you are querying for historical data, it must be in some language

Indeed!

pablo · 25 January 2023 19:38

It’s discourse formatting, it removed the (1)

Yes, it was that, just updated my comment.

pablo · 25 January 2023 19:49

Is that a requirement?

The index is just a way to identify in the same collection not across versions. The only cross-version identifier is LOCATABLE.uid (we keep coming at this and still keep avoiding it)

Data could be in any language, historical or not. If you have compositions in different languages in the same CDR, then a query using names won’t get you all the data, it will just miss a lot of records. If there are different CDRs, only one language each, again, you can’t use the same query to query in all CDRs, you need to create as many queries as CDRs with different languages you have in your infrastructure. See the problem? This is a issue huge for cloud platform providers, that is why I decided a long time ago to avoid names in queries at all costs, and avoid text constraints if possible to use coded text constraints instead (kind of the CDA approach with OIDs to identify each content item, because there are no archetype node ids or paths in CDA to identify the type of, for instance, an observation).

Anyway, I’m not sure I fully understood the original problem or if I’m going off-topic rambling about the pains I had and how I solved them in my own platform.

thomas.beale · 25 January 2023 19:56

I would say so. It is going to be confusing and eventually dangerous for a path that points to systolic pressure to wind up pointing to diastolic pressure when resolved against a different version of the data.

Across versions, sequence_id will work fine - if populated correctly. The basic rule is that you always increment the sequence_id for each new sibling data node, no matter in what version. That means that:

the sequence id never changes for a given node
deletions can’t cause problems

Certainly true. I was just indicating that simple querying with names will work, e.g. if you know all your text data is in Spanish, and was created in some reliable way, so you know spelling etc is the same for the same words etc. But - you’ll never know if there is some record with a mis-spelling, or in Portuguese (some Brazilian guy comes to Uruguay) and so on. So your technical argument is correct.

pablo · 25 January 2023 22:25

The concept of instance path is to be locally unique, not to be unique across versions. The only thing that is unique across versions is the LOCATABLE.uid, it should be used alongside with the instance path (each has it’s own use case). The instance path is actually a valid absolute (from the locatable root) archetype path when the index parts are removed, so it’s easy to get the archetype path to get a constraint from the OPT. We are also using those paths in our flat format, because those don’t depend on names and allow to transform back to the original locatable instance from the key/value structure.

That would be a challenge, because how can you know which node is which without the UID? This will always depend on the data instance. If the sequence_id is set per object instance then it is playing the role of a secondary ID while not using the primary instance ID that is the UID. So why not using just the UID?

That’s the issue: you don’t know!

Let’s say I develop a system, then my client populates the system, and my client is from Switzerland, then records could be in French, German, Italian, Romansh, etc. Then an IT guy needs to get some data to populate a report and he creates a query using names in French. See the issue? Then the report is crap.

I don’t think that is an issue, because the case we are talking about is when the name constraint is in the template, so the composition instance will have the name values coming from the template not typed in. If something is misspelled, the error is in the template itself, which could happen too!

In general, language dependent conditions are not a reliable way to query data, language dependent paths is not a reliable way to represent data. I’m not convinced of many things, most of the time I doubt each step I make, but I’ll argue with anyone on this single point

damoca · 26 January 2023 08:05

I totally agree with Pablo here. We also try to avoid any natural language in paths or queries. For example, in Catalonia each clinician can decide if they write their reports in Spanish or Catalan, and probably freely choose a matching language for the user interface of the EHR system.

yampeku · 26 January 2023 11:16

Seeing that names can change in minor versions and potentially paths can change with names, should name changes always be considered major versions/breaking changes?

thomas.beale · 26 January 2023 11:39

I’m not disagreeing with you here really. Just pointing out that for now, there is very little cross-border interop in most openEHR CDRs or other EMR, and although you as a supplier won’t know what could be in a customer’s system that you built, there will certainly be customers that can know what is in their systems because they know what it is connected to, and they probably know that all apps are in e.g. Slovenian.

I am just saying that if this is the case, then you could use names in querying. But it’s the only circumstance. As interop improves, and/or multiple languages are allowed in the one system (almost certainly happening in health tourism locations), then all bets are off.

Language-dependent querying is definitely not to be relied on in any long-term sense, and no-one should build queries or paths they care about using name/value. But some still will in the short term, and it might well work OK for a while.

thomas.beale · 26 January 2023 11:42

That is an excellent question. I would say not, but we can only say that if we have a formal rule that 'runtime paths are constructed using predicates of the form [at/id-code, sequence_id='x'] or similar. We don’t yet have that sequence_id, but if we did, we’d be able to state that as a rule.

Then - breaking changes to constraints on other fields like name are by definition not breaking changes to an archetype or template.

ian.mcnicoll · 26 January 2023 12:34

I agree the use of ‘name’ as a differentiator is problematic but is forced on us by .opt1.4 Archetype Designer does correctly provide codes for ‘renamed’ nodes but has no way of populating the .opt with these.

I don’t think name changes should be a breaking change, at least not in archetypes. We can often minimise any impact of an archetype name change by retaining the same name in the template.

So practically speaking, we have found this much less of an issue than it might seem, even where tools like Better Studio do use name-based paths to hook up the form controls to the templates.

However we do need to get template level name/value codes into .opt and CDRs ASAP.

pablo · 26 January 2023 14:33

David’s example of Catalonia is a intra-border, multi-language case. Though, if that is not a problem now, it will be a problem in 3, 6, or 12 months. Which in specificationtime is kind of now. We need to plan ahead, not solve issues as they come IMHO.

There are different areas, in some cases might be connected, in other cases not.

Platform provider: creates the platform
Clinical modelers: create archetypes and templates
Content creator: users (clinicians, patients, etc)
Maintainers: could be the same platform provider or some local developer, they create apps, queries, etc.

If modelers take into account language, content creators enter data in different languages, and maintainers don’t take into account language and create content using flat formats with paths that depend on names or queries that depend on names, then that could cause issues, like missing data in query results.

I think you are making too many assumptions, which are understandable but not applicable in all contexts, while my view is to plan ahead for the worst case scenario (i.e. make no assumptions). I know what you mention could work on 80% of the cases, an I try to focus on the other 20%.

pablo · 26 January 2023 14:36

Exactly, that is what I mentioned above which is kind of the CDA approach to differentiate between entries in the body. So to use DV_CODED_TEXT constraints for the LOCATABLE.name instead of DV_TEXT at the OPT level (could be text at the archetype level).

ian.mcnicoll · 26 January 2023 15:31

I’ve been trying to get my head round the various sub-threads/perspectives here on handlimg multiple occurences in AQL paths

There are several different places where this might be or seem an issue

AQL-based querying

@Seref is coming at this from an AQL perspective i.e using an AQL path as part of a SELECT or WHERE clause, where I agree it is very unlikely that you would want to do WHERE bp[2]/systolic/magnitude > 120 and that use of predicates could usefully be extended.

Unique identification of a node for referencing purposes

Either for external use e.g to populate a FHIR resource or as the target for an EHR_URI.

e.g. A reference from a medication order Indication element “Asthma” to the problem-diagnosis entry instance where the original Asthma diagnosis is documented.

In this case there is a list of problem-diagnosis entries with identical name/value ‘problem/diagnosis’ and I still believe no generic way to construct unique paths to each of these entries without including potentially the values of every optional element with in the path.

In this case, we get no help from template-level coded name overrides, as these will be identical.

So either we use uids or some kind of sequenceId. Using the #suffix approach is how this works right now but would be much better if we had a specific attribute, rather than hacking the name/value e.g name/value = ‘Problem/diagnosis#1’ name/value = ‘Problem/diagnosis#2’ etc.

My instinct is to use uid at Entry level but sequenceId below e.g for multiple events/activities or multiple clusters and elements, where these have not been renamed explicitly in the template.

I think that lets us work most smoothly with the outside world e.g a FHIR Id might be a uid plus a short AqlPath/sequenceID to the event instance that equates to a FHIR Observation.

Design-time paths.

Subtly different is where we want to reference a specific template constraint path for use in the tooling space e.g form building.

Previous comments about not using names are correct but this is a limitation of opt1.4. Most of these issues would be resolved by getting coded name/values in there.

I’ve not come across a situation in the design space where we would need to work with multiple occurrences of a design-time path where the name/values are identical (problem-diagnosis example above) but perhaps @bna can give an example. So for now, I’m not clear where having sequenceIds would be of assistance.

In summary I think this comes down to

Getting code-based name/value into .opt
using uid and/or sequenceId to allow multiple run-time occurrences of a path based on the same name/value text/codedText.
I favour a mix - uid at Entry level, sequenceID below that

bna · 26 January 2023 22:24

You are right. This is not a design time issue/problem. The problem appears when you at design time duplicate i.e. an archetype root and give it a new name and at design time defines the node with occurences 0..>1. Current OPT generation of archetype/template designers is to define the given name as a constraint on the node. Which means the name can’t change which leads to not-unique names in the COMPOSITION.

When generating an EHR_URI to define a unique path to a node in a COMPOSITION our implementation use the name of the node to define a unique path to the node. If two nodes have the same name you will, of course, get two nodes matching the path.

I am sorry to confuse you with AQL in this manner. That’s not the most important issue. When querying you will most likely expect multiple results matching some criteria.

The most important thing to get consensus on is, IMHO, how to adress a unique nodes. This is item 2 in your excellent post @ian.mcnicoll :

I will also support you on this. This is very seldom and unlikely. Still I argue that is should be possible in openEHR to to this.

I follow @pablo and @damoca on this. We should plan ahead an establish patterns to make application portable between systems, installations and languages. This is, IMHO, where openEHR has it’s strengths. When we use plain-archetype based templates this is solved today. The problem seems to arise on the Template design and generating part of the ecosystem.

If we solve the unique path problem descibed above we will more prepared.

pablo · 26 January 2023 23:49

ian.mcnicoll:

In this case there is a list of problem-diagnosis entries with identical name/value ‘problem/diagnosis’ and I still believe no generic way to construct unique paths to each of these entries without including potentially the values of every optional element with in the path.

In this case, we get no help from template-level coded name overrides, as these will be identical.

So either we use uids or some kind of sequenceId. Using the #suffix approach is how this works right now but would be much better if we had a specific attribute, rather than hacking the name/value e.g name/value = ‘Problem/diagnosis#1’ name/value = ‘Problem/diagnosis#2’ etc.

My instinct is to use uid at Entry level but sequenceId below e.g for multiple events/activities or multiple clusters and elements, where these have not been renamed explicitly in the template.

That is clearly a requirement for a data identifier, paths don’t work for that, even the instance paths I mentioned if we consider the need of identifying something across different versions of the locatable (compo, status, folder, person, etc). I think this should be in a best practices guide, not in the spec?: “if you have this case bla bla bla… then if you want to reference a node … then you should set/use locatable.uid …”. Just because the uid is optional.

Not sure why we need to mix mechanisms (uid + sequence), since we can use locatable.uid at any level from compo/folder/status to element.

A recommendation about the name constraints would also be useful, also as a guide not in the spec, mentioning it would be better to use coded text instead of text constraints because of the potential problems that could occur having texts in a specific language (mentioned several on this thread). So using codes, better paths could be created, though those won’t identify a specific item in a multiple attribute collection (content, events, items, etc). If the uid is not used, then data should be included in the path, creating a different type of path, the one with conditions on it, not a static one, and for this, a processing is needed because it’s basically a query over the data, like xpath predicates. So if we go further on that way, we might want to have a complete spec for these predicates in the path, so we can filter data by any attribute, making it an approach to solve this issue and to provide a generic way of filtering data in documents, an openehr-x-path kind of thing, but across the RM, not an XML. IMO without the UID there is no static identifier of data, and using a path will require a dynamic method (something needs to be evaluated).

Note the idea of the instance paths is to identify the position of a data item within one locatable, not across versions. It’s a local secondary identifier for nodes, and the format helps to get data at that position and, removing the indexes, helps to get constraints from the OPT, because without the indexes it’s just a valid archetype path.

thomas.beale · 27 January 2023 13:00

Because Guids take a lot of space, and putting them on every node significantly increases the space cost of the overall DB. Nearly all such Guids are a complete waste - they will never be used for anything, because direct refs to nearly all data nodes are never created. I’ve done space calculations for an openEHR DB in the past with and without Guids (let’s ignore Guids on Compositinos and Entries, that’s probably 2% of the possible total) and the size increase is significant for ‘average’ data. I priced out the difference for longt-term ITIL3 data-centre RAID 10 persistence - the difference was significant.

In addition, Guids on all LOCATABLE nodes make a mess of data for any kind of human reading (testing…), and they also don’t tell you the order of accession of the sibling nodes.

There is a deeper semantic argument for only using Guids on Compositions and Entries. In openEHR, these structures are designed to be semantically stand-alone and have safe interpretations. But a data node like just a systolic pressure is not safe on its own - it could be a measurement, a target, or something else. A procedure might mean it was done, not done, recommended, not recommended. And so on.

I think we should always therefore treat such objects as coherent wholes, and only reference internal elements via paths.

It is not intended that name fields be required to be coded. It might be nice, but it won’t usually happen. We have to accept that names of things, like text fields, will be in some language. The key thing is (as you and @Seref pointed out earlier) is to avoid the use of such fields in querying or reference paths (e.g. in UI forms).

Using a sequence_id will actually work across versions, as long as the ids are monotonically increasing over time and never re-used.