Named element - and occurences

Seref · 9 December 2022 09:35

I was being really picky, but that’s not important, let’s leave that aside. My point is that the interpretation of a fixed name that implies uniqueness of the object with that name may be valid, but IMHO it is not compatible with the most common interpretation of containers with elements: a list. Your interpretation is that of a set. The reason people close to implementation are inclined to think in terms of a list most of the time but then they see set semantics enforced (well, they saw, now that it has changed) in a tricky way, they get confused.

There are some interesting question here though. When is the set semantics required? I’d love to hear some clinical examples. @heather.leslie @ian.mcnicoll ? . If the uniqueness of elements of the set is determined by the name, does this also indicate their terminology/term mappings should be unique? The use of lists in the RM types is ubiquitous, but I have not seen a dedicated container type (set) used in entry subtypes etc. So how do the modellers express this? Clusters with different elements with cardinality 0..1 ? Something else? Just say “you’re lost” if I completely lost the plot here and I’ll go and do my homework

thomas.beale · 9 December 2022 10:15

Partly right. Sets usually imply that order is not significant, but we always treat order as significant in openEHR, hence the use of List<> containers. (See your second question).

Just for reference - consider the definition of LOCATABLE.name in the spec. Let’s ask the question: why at runtime would anyone give two sibling elements identical names, if the name field is the one intended to identify at runtime each element? If it’s not the name field, what other field makes sibling items uniquely identifiable at runtime (or you could equivalently ask, what set of fields taken together form a unique key for an object)? That was the original design reasoning. I still think it was correct

Seref · 9 December 2022 10:45

hmm, that’s the XSD:sequence semantics. I’ll take your word for it if you say that’s how it’s defined.

Please travel to the original question that started this thread. I revisited this thread because I came across a number of name_1, name_2 instance in data, which made sense to have the same names but ended up _1,_2 etc due to the now removed restriction. Example: a list of infection control surveillance actions (not RM actions, but actions as in human actions,) taken. Common sense to expect something like “a list of actions taken by the infection control professional” to be represented in a container type in the RM. So I think this is proof that someone may indeed give identical names to two sibling elements during runtime.

thomas.beale · 9 December 2022 12:43

Well the user would, like ‘sample’ or whatever, but if it were me, I’d want the system to either add ordinal numbers or time-stamps, so I could later see the difference. Consider I come back 3 months later (as a clinical user) and want to understand what distinguishes those (say) 10 samples - the names are useless (I know they are ‘samples’) - I have to search for something else. Hopefully the application knows that in Observations, timestamps can be found on the Event objects, but a generic viewer (e.g. a patient portal) that might not even have the archetypes can’t display the samples in any meaningful way to the user.

Removing the unique name rule doesn’t mean systems can’t do unique naming of course. Anyway, I lost that argument a long time ago, so implementers are presumably finding other solutions that work. As it should be.

bna · 24 January 2023 10:57

Given a template with i.e. two openEHR-EHR-ACTION.procedure.v1

The first one is given the name “P1_NO” and the second “P2_NO” in the norwegian language and “P1_EN” and “P2_EN” in the english language. Seen in the user interrace of the archetype designer.
The OPT is generated with norwegian as the primary language.

In Better Archetype Designer the OPT will be given a constraint on the name to “P1_NO” for the first instance and “P2_NO” for the latter.

We think the data then MUST have the given name. And if there are multiple occurences each instance MUST have the same name. And also that the same name will be used to define each element for any language based on the OPT.

As a consequence there is no way to adress a specific locatable by using the name. To be able to adress a specific procedure instance using i.e. AQL or EHR_URI some other mechanism must be used. We suggest either:

a) Use the index in the list, i.e. 0 for the first item like /content[0 and name/value=‘P2_NO’]
b) Use the UID for each locatable - /content[uid/value=‘some_guid’ and name/value=‘P2_NO’] (here the name is not needed since the guid will identify a unique instance

Note that the term definition will use the localised name. We think this is used for user-interface purposes like the archetype designer or a form renderer. But the data MUST have the given constraint for name.

Note also that the name of the item is defined by the primary language when generating the OPT. This means that the “same item” will have different paths (name/value) depending on the primary languate of the OPT. This is something the person generating the OPT must consider. There is a risk to define data that is not compatible between languages and installations.

Two differences between Oceans Template Designer and Better Archetype Designer:

a) Oceans does not update the term definition and only the constraint on name
b) Better allows multiple occurences of items with constrained name.

thomas.beale · 24 January 2023 12:30

Option b) I think is very unpalatable, since it would force the addition of GUIDs absolutely everywhere, which could nearly double the size of EHRs, but also causes other problems (which I have analysed in the past for ISO 13606).

Option a) would be an ‘accession number’ or ‘sequence id’, the topic of SPECRM-63. If this were done, then it is a short step back to just appending that number to each new node when it is created, and we are back to names like ‘name_2’, ‘name#2’ or similar. If all openEHR systems assumed that a trailing ‘#xxx’ in the name field was the sequence id, then it becomes easy to figure out the ‘name’ (i.e. the bit before the ‘#’) and also no complexity is needed to find either a specific node (query with ‘name#N’) or all nodes with a certain name (query with ‘name*’ or ‘name$’).

If we don’t do that, we need a specific field, and the current proposal is sequence_id. (I’m not sure if the proposal in that CR is too complicated though… we should revisit that).

Seref · 24 January 2023 13:13

That’s what I advocated as well.

I always find this requirement intriguing. Can you expand the AQL scenario a bit here? Specifically, how do you known the specific locatable is in index N ? What makes it specific compared to others sitting in the same collection?

If there’s something about the element in index N that makes it unique, would not it be more robust to use that criteria in the AQL query? The other issue is the ordering of elements in a result set in AQL: would not json serialisation/deserialisation roundtrips and the technology used by the aql implementation make it hard, if not impossible to assume/guarantee your aql query will always return element E in index N?

ian.mcnicoll · 24 January 2023 13:16

The main value of having a specific identifier for a specific path would be for referencing that item from an external system e.g FHIR or even internally.

I agree less likely as part of AQL , other than just to pick yup that reference directly

yampeku · 24 January 2023 13:21

also for transformation and validation

bna · 24 January 2023 15:06

I wonder what is implemented on this? Do all follow this principle?

bna · 24 January 2023 21:52

I am not sure if I am able to explain here - but I will give it a try.

To be able to query an item in a list like the example given with one or more ACTION.procedure placed in content some logic must be applied by the CDR. It has to read the incoming composition and build an index of the items. The index must be stored along with the data in the specific version.

The compositions must, of course, be immutable for each version. The serialized form with the index must not be changed.

The use-case for such queries are not common. I have used it a few times to query the n-first items of a known collection. In these use-cases the data within each item is not important - it was only needed to query and display the n-first items.

Most use-cases with AQL will query for specific data, i.e. the procedure name/type defined by some terminology.

bna · 25 January 2023 07:17

An obvious response to this could be: “Why not use order and limit in the AQL?”

I could have done that - but what to order by? Since the only knowledge I have about the data is the ordered sequence in the composition I have nothing else to order data on.

Seref · 25 January 2023 09:55

Thanks, I appreciate the time you took to respond @bna

I think I can see now that your response above was not a specific question, but more of an elaboration of the situation one could end up with if unique name constraint is not enforced, which we both agree that it should not be (enforced).

The clinical scenario you have in mind seems to be (I’m beginning to sound like clippy), I have a bunch of activities recorded, and they all have the same name, and if I need to reference one or more of these, then what?

It is a pretty good question and I think the answers would be different for AQL and EHR URL scenarios. I think AQL can do pretty well here given ACTION is a type likely to be supported in the FROM clause and AQL gives us the predicate constraint syntax. So you could eliminate/choose using ... ACTION act[atcode AND something_other_than_name/value = 'criteria'...] Syntax-wise, predicate constraints can be any property and logical operators are also supported, so it may help you (assuming I got things right).

EHR URL/path is more interesting because (as far as can I can remember) it does not have predicates, which leads to a new can of worms:
what if we considered paths with predicates, and made EHR URLs Xpath to our Xquery (aql)? If my memory is wrong and it does support predicates in paths, then the same solution above applies.

I’ll stop now, in case I’m building a tower of assumptions here, but if I got it and you want to discuss a specific use case for accessing activities, happy to discuss that.

ian.mcnicoll · 25 January 2023 12:20

From memory, the original requirement to have unique names was really driven by the need for Template designer to distinguish unique paths for different constraints on a ‘cloned’ node, rather aa run-time requirement primarily. However that behaviour drove CDRs to place pseudo- sequenceIDs onto sibling nodes names ‘diagnosis#1’ etc.

That, in theory has started to go away at least in AD where with in the raw templates, cloned nodes are given a different nodeId (ADL2 syntax) but this does not find its way into the opt.

So let’s park the design-time need and concentrate on the run-time need to distinguish sibling nodes with identical paths, which is primarily about having a unique ID or path that can be referenced correctly. e.g to provide a FHIR resourceID, or a reference/path as part of an EHR_URI.

Whilst it is theoretically possible to construct a unique path based on predicates, I don’t think this is really practical, as the distinguishing attribute is largely impossible to predict

Lets say the use case is a list of problem/diagnosis archetypes in a problem list, where each entry needs to be uniquely referencable

Example procedure list

Date	Procedure	Laterality	Comment
1995	Hip replacement
1996	Hip replacement		Tricky procedure
2021	Hip replacement	Left	First one failed
2021	Hip replacement	Right	Right implant failed

In this example the patient had a hip replacement in 1995 (laterality not documented) - the same in 1996, then in 2022 those replacements had to be re-done.

Other than basing your predicate logic on every possible attribute/element in the Entry, it would be impossible to have any sort of generic approach to unique identification.

My gut feeling is that we need to go for uid (or uid_compositionId) at Entry level plus sequenceIds on multiple nodes like events, activities as well as clusters and elements - that is not so far from what is actually happening right now, other than that the sequenceId name/suffix aproach is a bit hacky and non-standardised.

Seref · 25 January 2023 12:50

Allow me to disagree Ian

this,

and this are about the point of uniquely identifying data, but that requirement is context-free in a sense. Why are you in a need to uniquely identify these nodes? What is your “querying or referencing data” context? Your approach sounds to me to be predicated on the assumption that if you have some unique identifier for the node, that guarantees all requirements for querying and referencing sibling data that has the same at code sitting under a collection.

Guess what, it does not Because the unique identity of a node does not necessarily imply any clinical data semantics, such as the suggested guid, and both querying and referencing is based on clinical data semantics. Your suggestion has a chicken and egg problem: you need to pull all the siblings first, filter out the data items (siblings) that fit your criteria, then use their guids to reference them, so you have to establish the association between the semantics (query criteria) and data identifiers (guids) for them to be useful. So you have to fall back to using a predicate criteria to get to the guids you’re interested in in the first place. If the data is missing as in your hip replacement example, you won’t even get to its guid btw.

It gets worse, you now lost the semantics if you build a url or a query (following the first one to get to uids first) using the uids because if you have EHR_URL_1 and EHR_URL_2 pointing at the same set of siblings using uids, just looking at the path of the URL won’t give you any clues as to what that/those guids are fetching. As in
ehr://../../action/../[ae213-2323-...]
and
ehr://../../action/../[ae213-2323-...]

vs

ehr://../../action/../[outcome = 'outcome 0']
and
ehr://../../action/../[outcome = 'outcome1']

(to reuse the examples above to clarify: you’ll need to use the second set to get to guids in the first place)

What entry level uids may help is allow system implementers to distinguish between two sibling nodes both of which match the criteria of ehr://../../action/../[outcome = 'outcome 0'] but that’s an internal concern, much like the .getHashCode() or similar methods supported in mainstream OO languages. It is used all the time when you add/search items in collections, but you never see them exposed or used in actual code.

In light of all the headache I’ve given you above, reconsider this statement please My objection is: it is not only practical, but also sensible and useful to use predicate based paths, because they express meaning, stay closer to level 2 of two level modelling, unlike using guids, which fall down to level 1.

If I missed the point as I often do, I’l buy you as per the usual arrangement…

bna · 25 January 2023 13:34

ian.mcnicoll:

Date Procedure Laterality Comment

1995 Hip replacement

1996 Hip replacement Tricky procedure

2021 Hip replacement Left First one failed

2021 Hip replacement Right Right implant failed

This is a great example - and the simple question to be implemented is: How do you make an unique EHR_URI to the first item in the table?

I am not sure how the table is generated. Some possibilities is:

Multipe instances of the same procedure archetye in a single composition
The result set from some AQL to get a list of procedure archetypes matching the procedure Hip replacement

It’s not important how the table is generated. My only concern is how to make a single and deterministic identifier of the first item. My use-case could i.e. be to send Ian some background data about the first replacement in a structured way to make it possible to update the laterality.

IMHO we need some rule to share between implementations and it can’t be based on data. The item might be selected by some user-interface where I click on the row. The application has no way to tell which data attributes defined my selection. Thus we need an index number where siblings are at the same level with the same name or the UID based approach.

IMHO all implementations today use the #-based approach building a pseudo-index of siblings with same name. This works reasonably well in our system and I assume in others as well. This pattern is based on the assumption that all name on siblings must be unique and as such adressable/locatable/pathable. If we remove the unique constraint on name we need some other approach to get adressable paths.

yampeku · 25 January 2023 13:41

Well, in principle you could get away with defining GUIDs to every part that can be potentially problematic with current usage. In particular I think it would solve almost all problems if you put guids just in the sibling ad_hoc sections (or in the ad_hoc structures with no real identifier)

bna · 25 January 2023 14:10

I am looking for a level 1 approach. My intention for this thread was to give my platform developers an answer on how to identify siblings with the same name. They should not know anything about the clinical context , only the specifications.

The identification of the siblings will be used in multiple software components like our Form Designer and Renderer, lists of clinical data with references back to the original, integration layer like two-way FHIR binding, etc.

Seref · 25 January 2023 14:27

Thanks @bna , my response so far was mainly addressing this bit of your input:

I consider these to be different than this:

A form designer processing a specific piece of data is quite different then the context I was referring to. In that case you do have a need to identify and distinguish data, which is what I briefly mentioned as:

Read the above as … allow form designer implementers to distinguish between two sibling nodes… and it justifies having uids.

I’ve been specifically referring to use cases for AQL and EHR URLs, which I have strong opinions about. Your form designer may be built on these and then uid would leak into these features/concepts, which is something I’m strongly against, but I’m not working at DIPS, so I’ll leave it at that

thomas.beale · 25 January 2023 14:46

Well, let’s hope that the ‘dates’ are a bit better than just the year

Well that’s probably true, hence the idea of having a sequence_id attribute, so you are guaranteed that that will always be there no matter what could be in other fields.

With the addition of Guids on Compositions as well, that’s pretty much my view as well. It’s guaranteed to work and doesn’t overload the EHR with billions (yes) of Guids, mostly useless.

Well, the main needs were mentioned earlier - to be able to provide a reliable id of a specific atom (e.g. the systolic pressure of the 38th BP sample in a series) to some other system or service, or even for internal use - to generate a citation to just that path.

You are right in your general supposition that you won’t get to the atom via normal querying. Instead a normal AQL query will bring back (say) 50 BP samples, and maybe graph them. You see a 180 mmHg on the 38th sample, and choose it somehow on the UI. Now the application can obtain the runtime path of just that item, and to do so, it would use the sequence_id field (plus Guid of enclosing Entry etc), assuming we had such a field.

I think the first bit is correct; but you don’t need Guids on each item, you just need some reliable id - hence the sequence_id proposal.

Good analysis - this is another reason the universal Guid approach isn’t a good one. Whereas with the sequence_id approach, you will use predicates of the form [atcode='at0051'] or [name/value='BP measurement'] to get the siblings in the first place, and then you can create fully unique paths containing predicates like [atcode='at0051' and sequence_id='38'] or you could do [name/value='BP measurement' and sequence_id='38']. In fact, you would just be able to do [sequence_id='38'] if you really want, since those sequence numbers are always unique amongst siblings. And it might make sense to generate such short paths as a kind of direct ref. But as you say, the semantics are no longer visible.

If I understand correctly, your objection #1 is

don’t avoid putting the semantic signifier in the predicate, otherwise we don’t know what it means

And your objection #2 is:

don’t use name/value='BP measurement in such predicates, and you are saying that we should use only the semantic signifier (at- or id-code) and some unique qualifier.

I agree with both of these. The name field will not tell you anything more than the at- or id-code (prove me wrong, someone ;), and it could be anything, whereas the at/id code is by definition the semantic indicator.

The 3rd part of your position is: to get a unique path, use predicates of the form [atcode=X and some_other_field=unique_qualifier].

Also correct in my view.

@ian.mcnicoll is saying: yes, but it can’t be just any unique qualifier, because you never know which one will be unique. Probably not 100% literally true, but it is practically true - query authors don’t have unlimited time to figure out which one it could be.

I also agree with this.

So, marrying all that together, we should:

use Guids in key places, generally Composition and Entry
include sequence_id on every LOCATABLE node (only useful on nodes inside containers, but that’s most nodes)
initial querying could be on at/id-code OR name/value. At/id-code will get all the siblings; name/value might select just a subset, and so generate a more relevant initial selection list
construct runtime path predicates using the form [atcode='at0051' and sequence_id='38']

Are we all on the same page?