OPT2: node identification scheme

Reviving an old thread… we’re looking at OPT2 details right now, and I’m thinking about @pieterbos 's issues on OPT2 (reachable from this PR).

Originally (= 10 y ago) I thought we would just generate a simple OPT structure consisting of C_COMPLEX_OBJECTs and C_PRIMITIVE_OBJECTs (and any remaining ARCHETYPE_SLOTs).

Then we ran into two issues (at least the first described by @ian.mcnicoll in the past):

  • sometimes there are slots with very specific node-ids, e.g. something like ‘work contact’ and ‘home contact’ (maybe in some SDOH kind of archetype) that would both be filled by the same generic CLUSTER.contact_info archetype. We would want to retain the owning archetype’s id-codes (i.e. the work and home) because if we lose them in the data, we can’t distinguish which is the work and which the home data structure - the archetype id CLUSTER.contact_info is no help.
  • the reverse situation can happen: two generic slots e.g. something like ‘other_details’ and ‘extension’ (ok, bad modelling, but still…) being filled by two very specific archetypes, e.g. ‘care_plan details’ and ‘financial data’. In this case, the owning archetype node ids are not very helpful, it’s the filler archetypes that tell you what the data really are.

In summary we may need both the id-code and the filler archetype id at any archetype root point in an OPT in order to form useful paths for AQL queries.

The real problem is that our current approach to marking nodes in an OPT and therefore forming paths is not quite good enough. The current approach has the LOCATABLE nodes being filled as follows:

[openEHR-EHR-COMPOSITION.encounter.v1]
  /content
    [openEHR-EHR-OBSERVATION.ear_exam.v1]
      /data
        [id29]
          [id40]
            /items
              [openEHR-EHR-CLUSTER.device.v1]
              etc
          [id72]
             etc

It is at those root points with the archetype ids that we want to retain the id-codes from the parent archetype nodes. We could do that in the OPT if we change the rules, but not (currently) in the data.

To maintain the id-codes on all nodes (and not get confused as to which archetype any id-code belonged to), we would need some scheme like the following

[archetype=openEHR-EHR-COMPOSITION.encounter.v1]
  /content
    [archetype=openEHR-EHR-OBSERVATION.ear_exam.v1
     node_id=openEHR-EHR-COMPOSITION.encounter.v1::id4]
      /data
        [node_id=openEHR-EHR-OBSERVATION.ear_exam.v1::id29]
          [node_id=openEHR-EHR-OBSERVATION.ear_exam.v1::id40]
            /items
              [archetype=openEHR-EHR-CLUSTER.device.v1
               node_id=openEHR-EHR-OBSERVATION.ear_exam.v1::id47]
              etc
          [node_id=openEHR-EHR-OBSERVATION.ear_exam.v1::id72]
             etc

In the above, the ‘node_id’ meta-attributes can be used like proper terminology codes, no need to go looking for what archetype they came from. So something like node_id=openEHR-EHR-OBSERVATION.ear_exam.v1::id47 is a fully qualified code telling us what the node is statically defined to mean in the archetype. We also don’t lose id-codes from slot nodes.

The above would of course make for very long paths, but the node codes would be fully specified, just like ‘snomed_ct::123456789’ or similar. That could be shortened by using some replacement for the long archetype ids, e.g. 10-digit codes similar to SNOMED.

Another difference of this scheme is that we would not have [archetype_id] path predicates as such, only predicates of the form [archetype_id:: node_id].

At the root points we also have ‘archetype=xxx’ (LOCATABLE.archetype_details) to indicate we have changed archetypes via slot filling or external reference. This might be statically defined (i.e. modelling time in a an archetype or template) or dynamically filled.

The above would lead to different archetype paths than we have today although the two varieties are easily machine mappable.

Thoughts on whether this or some other improved scheme is worth looking at?

In the OPT itself, there are archetype roots in place wherever an archetype root is encountered - also a filled slot. That can contain both a node id and an archetype id. In Archie we preserve both when creating an OPT 2.

Then in data of course, that’s the big question. Because we encountered this same problem, and did not want to resort to ADL 1.4 style name-based uniqueness constraints, we did the following in data:

  • store the node id in archetype_node_id
  • store the archetype id and template id in archetype_details.

That’s a bit non-standard at the moment, but it is also very easily converted back to standard data.

Then we made the path lookup so that it works either on the node id or the archetype id - since both are stored in the data. The node id is always unique and you don’t need an additional archetype id in the path queries.

This approach would require mean minimal adaptations in the specification and no changes the AQL or path queries.

Using simple node name overrides in 1.4 to disambiguate slot-fills with the same archetypeId is clearly problematic, but I thought ADL2 solved this since the renames are backed by a specialised node/overlay.

Do I have that wrong? I’m not sure that forcing the child node to carry the slot id is the answer, and at least is some cases might cause even more confusion e.g when there are 2 slot fills with the same archetypeID.

I think I prefer something similar to Pieter’s suggestion which as I understand it, is carrying both the child and parent Node Ids ?? and a specialised node ID in the child??

I don’t think the ‘reverse’ situation i.e. different archetypes in a generic slot is at all problematic, or even avoidable. Clearly, over time , we are likely to be able to extend archetypes to include more specific slots but I do not see it as a particular issue right now.

It does; the node identification in source artefacts (including templates) is not the issue, it’s what gets into a) OPTs and b) the data, and then c) what is visible to Querying.

Ok.

Hmm - but I thought those specialised NodeIds did get into the opt and the data, clearly I’m wrong!

I checked the OPT spec. It does not list anywhere that node Ids should be removed or changed. So that does not need a change because of this, I would say, and I would consider an OPT 2 without the specialized node id as not adhering to the specification.

Then there is in the RM spec, Common Information Model
:

The only exception is at archetype root points in data, where archetype_node_id carries the archetype identifier in string form rather than an interior node id from an archetype.

That is a problem, and it would need a change. That is not the ADL 2 spec having the issue, but the RM (common package). We ignored this bit of the specification in our CDR and in bits of Archie, and we do store the specialized node id in the archetype node id field of Locatable.
Of course converting this back to ADL 1.4 style data is very easy.

Even at an archetype root point? So the paths formed by going down the tree would be misleading then (unless archetype_details was rigorously being taken into account), since it would be id-codes all the way down, but from different archetypes.

I still think we might consider a safer system that uses only fully-qualified node-ids in Locatables - the current schemes rely on inspecting more than one attribute, and could easily be misused.

Late to the party here, are you suggesting we need two different attributes, one node_id to have always the node_id and (maybe) an archetype_id just for the archetype root nodes? If that is correct, I guess we don’t need the archetype_node_id.

Yes. And that attribute is already there. It is called archetype details. together with archetype_node_id that stores all we need.

Path lookup in Archie works on both fields. But the archetype id in paths is not necessary for consistent lookup.

AFAIK archetype_ids in paths are included only for absolute paths, not for archetype paths. Why is the archetype_id in a path? (maybe it’s something about ADL2 that I’m not familiar with)

If you are processing an ADL2 OPT built this way (no archetype_ids in the archetype_node_id attribute), e.g. to visualise it with correct terminology meanings, it means the code is checking if archetype_details is null or not, and if it is not, using the archetype_id it finds as the new scoper of the id-codes starting below that node?

Doing this at a minimum requires using a Stack implementation so that when the traversal drops out of a particular subtree, the correct scoping archetype is again established.

This would obviously work - I just want to be sure that’s what you are actually doing. It’s a bit obscure to the unwary programmer or data user, because having id-codes all the way down the ‘spine’ of an OPT structure that actually contains multiple plugged-in archetypes would normally imply that all of those codes are within the same qualifying scope (i.e. archetype).

If you mean ‘given an instance of something in the RM, find the corresponding node in the OPT and its archetype term from the terminology in the given language’, then yes. Likely the code is quite similar to what is necessary for validation of data instances against an OPT as well.

For retrieving the correct terminology in the OPT, looks like in Archie I made some kind of path query syntax with both a node id and an archetype id in the predicates. This might very well be non standard and in need of a better version. At least, that’s what it looks like when looking at the implementation of getChildArchetypeId as called in archie/aom/src/main/java/com/nedap/archie/aom/OperationalTemplate.java at master · openEHR/archie · GitHub , and looking at the toString() method of PathSegment.

1 Like