Named element - and occurences

Seref · 27 January 2023 13:17

I’m going to shoot a horror movie for programmers one day, about what the guids do to pretty much all persistence technologies in terms of indexing… The current working title is something in the lines of “Attack of the entropy: the demise of the inverted index”

ian.mcnicoll · 27 January 2023 14:07

Stoopid question (asking for a friend?) but is that not quite challenging to achieve for a specific node, across all possible composition versions?

thomas.beale · 27 January 2023 15:36

when a new node is added (e.g. imagine some Elements under some particular Cluster) it gets a sequence id (say, 5). That never changes from then on. The next sequence id to be used needs to be recorded somewhere. In a new version, some more nodes get added. Each time the sequence_id is assigned, and incremented. In the next version our original node (id=5) gets deleted. New nodes get added. And so on.

As long as the ids are never re-used and the next id to assign is always remembered correctly - necessary because the current highest id node(s) could be deleted in some version, e.g. 20, 21, 22. The next node assigned will still be 23.

So it works on the same logic as terminology - never re-use a code, and always generate new codes in a reliable way (incrementing, or some other more interesting scheme).

We do this with id-codes in the ADL Workbench, and in any other ADL2-based tool, so that archetype paths are preserved over versions.

The interesting question is: where does the ‘next id to assign’ get stored? There are various possible answers. But the scheme is not complicated.

pablo · 27 January 2023 16:03

Sorry, this cant be an argument in 2023. In terms of % of the bytes needed to store a full composition, the UIDs is just a small percentage, and today storage is cheaper than ever. Also the LOCATABLE.uid could store OIDs, which if well designed, could use less space than a UUID. Again, if you store UUIDs as strings, that takes more space than storing as 128 bit numbers, which is what they really are. So we are talking about 4 bytes per UUID stored as a number.

Then the key is to use them where those are needed, not everywhere. Again, the question to ask, and data should be given, how much % of the total space does having UIDs where needed requires? Then compare that to provide a hack to the model in order to avoid storing that amount of extra data. Finally, is the solution really worth it? I mean 1. adding extra complexity, 2. really saves that much data?

If you show me a real estimation of how much extra data is needed, and that is significant, I’ll shut up.

BTW, you need to store the sequence_id too, which is that is an int, it will take 32 bits, long will take 64, vs 128 bits needed by the current UID. Are we really optimizing at that level?

pablo · 27 January 2023 16:05

I think it’s trying to play the role of the locatable.uid, with the uid you know which entry is which even if they have the same name in a collection container. That is why I don’t quite understand the idea of using yet another identifier instead of our old friend the UID

thomas.beale · 27 January 2023 16:12

I know. Everyone always says that. I did a set of calcs for the NHS taking into account a careful size estimate of openEHR data, and worked out the incremental cost increase of RAID 10 storage in a data centre, with NHS pricing. It turns out that universal Guids cost real money when you are into terabytes, RAID 10 etc.

I’ll have a look around to see if I still have them.

Exactly right - just what I am advocating (Compositions, Entries, Parties, Plans…).

That is actually a good point. I would use character strings.

When you have terabytes (4TB was the size of one UK GP system storage requirement for probably 10m EHRs in the UK, about 10y ago), and peta-bytes over the long term, you always optimise. If you don’t, you’re always needlessly burning money and other resources.

I am however more interested in semantic reasons for not putting Guids everywhere, rather than reasons of space economy…

thomas.beale · 27 January 2023 17:38

One thing to note: LOCATABLE.uid is a String field, not an Integer field.

bna · 27 January 2023 18:21

This is a great and important discussion. In this post I put forward two postulates regarding sequence identifiers. Let’s see if we can agree on them:

Postulate one: One sequence for each archetype_node_id at the same level
Postulate two: There is no need to persist the sequence identifier across versions of the COMPOSITION.

Postulate one: One sequence for each archetype_node_id at the same level

Given a container on any level of an COMPOSITION which has the possibility to add more than one item with the same archetype_node_id
Then the sequence number must be shared for all instances of the same archetype_node_id.

One simple example
Given an archetype with a node with archetype_node_id at0003
And this node has the multiplicity 0…*

In a template this node might be cloned and given a name constraint.
The “original” node does not have a name constraint and can have any possible name.

Let’s say the node at0003 has the english term “Comment”.
In a template the node might be cloned and given the name “Other”
We assign A as identifier of the original node and B to the cloned node

In the data the client/user add multiple instances of the node.

A1 and A2 can have any name. Current best practices is to use the hashtag pattern like “Comment#1”.
Since the name of A is unconstrained it is also allowed to give A2 the name “Other”.
B1 is constrained on the name and must have the name “Other”

This gives the following names of the nodes

A1 => name: "Comment#1",  value:" openEHR is great"
A2 => name: "Other", value: "The RM shines over healthcare"
B1 => name: "Other", value: "DIPS provides great software for health providers"
B2 => name: "Other", value: "Norway has the biggest ski-jump hill in the world"

Given this dataset in a COMPOSITION there is no way to identify the that A2, B1 and B2 comes from different definitions in the template. Since A2 was given the name “Other” it looks equal to B1 and B2.

The sequence identifisers for this datasett might be:

A1 - 1 
A2 - 2
B1 - 3
B2 - 4

This gives us postulate one:
There must be one sequence of ids for all nodes at the same level with the same archetype_node_id

Postulate two: There is no need to persist the sequence identifier across versions of the COMPOSITION.

The sequence identifier is only a weak, local and version specific way to distinguish nodes at the same level with the same archetype_node_id. To the most extreme the clinical identitical node might change sequence identifier across versions of the COMPOSITION.

If we follow the example from above. Let’s say the user removes A2 and attach a new instane B3. Then the sequence identifiers will be:

A1 = 1
B1 = 2
B2 = 3
B3 = 4

Note that B1 has the same content across the versions. Still it changed sequence id from version 1 to version 2 since A2 was removed and the new calculated sequence gave new numbers.

This is needed to make the sequence identification algorithm stateless to allow distributed and asynchronous editing of COMPOSITION.

pablo · 27 January 2023 20:47

You mentioned “…increases the space cost of the overall DB…”, at the DB level we can do whatever, as long as when transforming DB data to RM instance in memory it is a string, we are good. So in the DB it can be a 128bit number or a binary(16) like in MySQL. Postgres has a native UUID type which already stores internally as an optimized type (means it doesn’t store the UUID as 36 bytes as it will require for the string version “e1fb491b-198f-496c-b5db-72261f9ddc30”)

Check this interesting post about MySQL MySQL UUID Smackdown: UUID vs. INT for Primary Key

thomas.beale · 28 January 2023 10:26

Yes we could do something at the DB level. It will be a bit tricky since the uid field can have other kinds of Id that are not convertible, or at least not easily, to integers.

It might be that in openEHRv2 we simplify that field to a Guid and then its type can be Integer, both computationally and in storage.

pablo · 29 January 2023 19:17

bna:

Given a container on any level of an COMPOSITION which has the possibility to add more than one item with the same archetype_node_id
Then the sequence number must be shared for all instances of the same archetype_node_id.

One simple example
Given an archetype with a node with archetype_node_id at0003
And this node has the multiplicity 0…*

In a template this node might be cloned and given a name constraint.
The “original” node does not have a name constraint and can have any possible name.

Let’s say the node at0003 has the english term “Comment”.
In a template the node might be cloned and given the name “Other”
We assign A as identifier of the original node and B to the cloned node

In the data the client/user add multiple instances of the node.

A1 and A2 can have any name. Current best practices is to use the hashtag pattern like “Comment#1”.

Since the name of A is unconstrained it is also allowed to give A2 the name “Other”.

B1 is constrained on the name and must have the name “Other”

This gives the following names of the nodes
A1 => name: "Comment#1",  value:" openEHR is great"
A2 => name: "Other", value: "The RM shines over healthcare"
B1 => name: "Other", value: "DIPS provides great software for health providers"
B2 => name: "Other", value: "Norway has the biggest ski-jump hill in the world" 
Given this dataset in a COMPOSITION there is no way to identify the that A2, B1 and B2 comes from different definitions in the template. Since A2 was given the name “Other” it looks equal to B1 and B2.

The sequence identifisers for this datasett might be:
A1 - 1 
A2 - 2
B1 - 3
B2 - 4
This gives us postulate one:
There must be one sequence of ids for all nodes at the same level with the same archetype_node_id

@bna I know understand the issue, I was focusing on the data not on the model.

Rephrasing, in AOM at the OPT level on a multiple attribute constraint you will have two alternatives.
These alternatives, in general are used for alternative types (e.g. in events, to have POINT_EVENT and INTERVAL_EVENT constraints) for the same constraint not as cloned alternatives of the same type (e.g. having two alternatives for POINT_EVENT at the multiple attribute events)
I understand that is totally valid, and not only to have two alternatives, that the two alternatives actually have the same archetype_node_id. I believe the model, at least AOM1.4 wasn’t designed to support that case, and that’s why we are discussing here. It would be nice to confirm that case is not supported and have some kind of statement added to the AOM1.4 spec, and maybe some patch as 1.5, since it’s used a lot.

Then yes, some kind of extra differentiator is needed, so when you have data for any of those nodes (POINT_EVENT "A’ and POINT_EVENT “B”), the data can reference the right node. Without that, there is no possible data validation, since when you have a data set, you need to know exactly for each node, which AOM node constraints that (considering also if there is no constraint, then that node will be valid).

Considering that:

the AOM node differentiator should be defined in the archetype or template, depends on where you have the cloned constraint node
the RM data instance of should have the node differentiator from the AOM to be able to get it for data validation and other functions

That AOM differentiator has nothing to do with the RM instance index I mentioned above that we use on instance paths:

Those indexes are not ids, but locators, and are local, don’t work across versions of the same locatable.

Now I understand that the AOM node differentiator I mentioned above is the sequence_id added to AOM2 @thomas.beale please confirm.

Again, I was not talking about the AOM but the RM above since I didn’t understand the whole picture.

pablo · 29 January 2023 19:24

Did you counter UUIDs as 36 or 16 bytes? (plain text vs binary)

I believe, others could confirm or not, most of the locatable.uid stored are UUIDs not OIDs or INTERNET_IDs. Though at the top locatable level the uid is an object version id since we agreed on using the version.uid there. But for internal uids in locatables, it will be mostly UUIDs.

Either way, we are discussing implementation but should be focusing on spec. If the spec says “it is recommended to use xxx” then implementers will figure out the optimizations they need to do, running numbers and costs at scale. Can’t consider all those implementation details at the spec level.

If there is a bidirectional conversion possible to optimize storage, I don’t think we need to worry about storage in the spec.