Found an interesting case when an OPT might be invalid and I think modeling tools are not checking it

pablo · 20 February 2023 20:35

Someone reported a problem with the RM data generator and validation on EHRBASE on our openEHR SDK [INGEN] generated instances are invalid - attributes have 2 occurences whereas OPT tells 0..1. · Issue #149 · ppazos/openEHR-OPT · GitHub

Context: I understand in AOM 1.4 if a sibling C_OBJECT has the same node_id, the name should be different, so we could have CLUSTER.items containing an ELEMENT at0002 that has name “xxx”, another ELEMENT at0002 that has name “yyy” and yet another ELEMENT at0002 with name “zzz”.

If that is correct, we need to consider what happens is someone adds two ELEMENT at0002 with name “xxx” in the archetype or template. That IMO should be a invariant rule for AOM 1.4 assuring that shouldn’t happen.

The problem gets more complicated because the C_OBJECT constraining the ELEMENT.name could be a constraint for a coded text instead of a string constraint for a single value like “xxx” or “yyy”. So we could have a list of possible names that are coded. Then, if two sibling nodes at0002 have C_CODE_PHRASE constraints for the name: DV_CODED_TEXT defining_code, the first ELEMENT could allow at1111, at1112, at1113 as the name codes, and the second could allow other codes, or even the same! this creates an analogous problem as allowing two sibling nodes to have the same node_id and same name.value.

Or even more complicated, one ELEMENT could have a String constraint for the name like “xxx” and a sibling with the same node_id has a code list, which one of the codes correspond to the name.value “xxx”.

The issues are:

Modeling tools don’t seem to check that
I’m not sure in the specs we have such rule
Any data generated by such template might lead to validation errors, for instance the OPT allows 0…1 occurrences for each at0002 node and if they have the same node_id and same name, then the occurrence in the data is 2, because the validation checks node_id and name.

On my side, the data generator based on OPTs is not checking if the OPT is correct in terms of the name constraints consistency, so it generates some data that is detected as invalid in EHRBASE, but the main issue I believe is in the template itself and in modeling tools allowing such cases.

To protect my users from generating wrong data I would add an extra rule to verify OPTs so when they generate data they can know if the OPT is not valid.

ian.mcnicoll · 21 February 2023 09:14

I think the tools do at least some checking/enforcement Pablo. but may need clarified in the specs. It would be good to nail down the rules for the use of ‘indexed names’

If I clone a node in a template, AD will automatically rename it e.g. ‘value #2’, so that the design-time path is always unique, and you are correct that this does need to handle different types of leaf constraint or even datatype. It will not allow cloned nodes to have the same name/value.

In other words, the design-time paths are always unique, by adding some sort of numeric, or enforcing use of a unique name. We all agree this is less than satisfactory but that is another story!

You can only clone nodes which were originally 0…*, which are then set to 0…1 but can be re-cloned.

As far as I’m aware , the correct constraints do end up correctly in the .opt.

pablo · 21 February 2023 14:17

Thanks @ian.mcnicoll the most difficult case to detect is when two siblings with same node ID have DV_CODED_TEXT constraints in the names and are both lists, then de modeling tool would need to verify the lists don’t have items in common.

About AD, if you see the template attached to the issue reported in the openEHR-OPT project, that was actually generated by AD, and there are two problems:

Exactly the problem I mentioned above: two lists with codes in common for the names of 2 sibling nodes with same node ID
For the name, they added a list constraint for the code_string AND a string constraint for the name.value, which shouldn’t be allowed (the name.value should depend on the name.defining_code.code_string)

So I’m guessing some checks are not there, and too that some spec clarifications might be needed.

That OPT is what you think will never happen, but since it’s allowed somewhere, it happens

The person using the OPT told me it was downloaded from the CKM Clinical Knowledge Manager and it seems you are the author

pablo · 21 February 2023 23:04

@ian.mcnicoll since the OPT doesn’t have the right indentation on some parts of the XML, I think that was generated using the AD, right?

What I can do is upload it there and try to validate it, do you know if there is any functionality to run a technical validation of the OPT?

With that we will know for sure if the AD is actually allowing to generate these invalid constraints in an OPT.

I’m working on adding this specific check to the openEHR SDK and the openEHR Toolkit in order to detect potential issues before running any operations using incorrect OPTs.

pablo · 21 February 2023 23:34

@ian.mcnicoll I have tested this in the Archetype Designed and there is a serious problem in the OPT export there. I have shared a video privately, but I can publish it, maybe it helps the colleagues from Better to check the AD and fix what is wrong.

It seems the CKM isn’t detecting the issue either.

thomas.beale · 11 July 2023 23:15

Sorry only just saw this… we may need to clarify a bit here. In an ADL archetype, two sibling nodes cannot have he same node_id. But in data, two data siblings could be derived from the same archetype node, and they could therefore carry the same node id. They should be distinguished in the name field.

This could never happen in a tool that implements ADL2 properly.

bna · 12 July 2023 05:42

Correct - this will break things.

I think this is related to this thread: Named element - and occurences . Here the issue is that it is possible to give an element (in a Template) a specific name and still have the element occurrences to unlimited.

Then we find a situation where we have two ELEMENTs at0002 with name “xxx” in the template.

For data we use the “well-known” algorithm of adding a suffix. For all elements where there are occurences greater than one; add suffix #{n}-