Updating directory, what's the subfolder identification strategy?

pablo · 3 December 2022 19:34

Some CDRs choose to store the FOLDER.uid for all folders inside EHR.directory.

In that case, when creating the directory, the root directory will have an OBJECT_VERSION_ID in the UID since it’s versionable, but the subfolders would have an UID as a HIER_OBJECT_ID or a GENERIC_ID.

When updating the EHR.directory, the version tree part is modified in the root OBJECT_VERSION_ID uid. Consider the subfolders don’t change in the update, do subfolders of the new directory version have the same UIDs as in the first version or should them have all new UIDs?

I couldn’t find that specific case in the specs, and it might be something to consider for conformance verification.

thomas.beale · 4 December 2022 15:57

I don’t see any reason any of the root UIDs would change… only the version info on the top object.

pablo · 5 December 2022 20:18

@thomas.beale The root directory changes the part of the version tree id.

Though the question is about the subfolders not about the root directory (EHR.directory)

The root is the top level folder.

sebastian.iancu · 6 December 2022 20:17

This depends on who is being used, and I’m not sure if it should be constrained in a specific way. If I remeber correctly is is also not mandatory to support folders, if some supports it, then is not stipulated what kind of folder structure should be used and by which use-case.

In case the Directory tree is a single FOLDER, which is versioned, only the top (root) level has a OBJECT_VERSION_ID uid, and any committed change in anywhere in the subtree is considered a new version of the whole Directory FOLDER. In that case identifying folders does not have to be by uid, you could just consider their names as identifier - it looks much more nicer also when it is referenced (as folder-path).

But RM 1.1.0 also adds support for multiple folders, which opens door to more use-cases. In one possible variant you can create a virtual directory, where subfolders are all versioned, thus they all have uid filled in, so any committed change will create a new version only to the container folder of that change, and perhaps the parent (depends on implementation).

I would rather focus conformance testing on (if DOLDERs/Directory is supported) to verifying if they are versioned, archetyped, and in case we support it in AQL (note that this is yet not specified), then querying with folders works as expected.

pablo · 7 December 2022 16:11

@sebastian.iancu that’s an implementation decision. My question is in the context of an implementation that is actually using UIDs on all subfolders of the EHR.directory. I prefer that approach since the name uniqueness just for identification sake is something I wouldn’t recommend an implementation to do, it’s not reliable and depends on how the uniqueness is checked like allowing or not special characters, considering uppercase equals or different to lower case, or accentuation equals/different to the same letter without accentuation (o vs. ó vs. ö). So naming is not part of the equation here.

This question is in the context of a single EHR.directory without considering the EHR.folders attribute, since I guess the solution for the first question can be applied to any top-level versionable folder in EHR.folders and their subfolders.

For simplification, just consider EHR.directory and subfolders. For instance:

create directory

root => object::system::1

subfolder 1 => uid1
subfolder 2 => uid2

update directory

root => object::system::2

subfolder 1 => uid1 or new uid? (not changed)
subfolder 2 => uid2 or new uid? (nod changed)
subfolder 3 => uid 3 (created by the update)

Note the update generates a new VERSION for the root which includes the tree of all subfolders, and subfolder 1 and subfolder 2 are not modified by the update, logically would those folders keep the same uids ans in the first version tree or should have different uids because those instances are different to the instances created on 1. (because they belong to a different version tree)?

In terms of conformance, we need to use the verification process according to the implementation strategy, that is: if the Conformance Statement document of the vendor they say they don’t support folders, that’s OK, then the RM support in the report would mark that as “not supported”. If the Conformance Statement says it supports folders, then it should specify the folder identification strategy. If they use names, then we will need to use the test cases which use names, and check if they do name uniqueness right. If they use uids to identify folders, then we are on the case I’m posting here. All cases should be considered for conformance.

What do others think? @thomas.beale @Seref @ian.mcnicoll @pieterbos @yampeku

sebastian.iancu · 7 December 2022 20:27

I certainly understand what you are looking for (regarding situation with Directory, not with folders, conformance & implementation strategy).
It is indeed at this time an implementation choice. But are you implying that we should enforce uniqueness of sibling nodes only by using their uid, and not allow the use of ‘name’ for same purpose? The consequence is that we’ll accept as valid a case when two sibling folder have same name (but different uid) - which IMO functionally does not make too much sense. And furthermore the consequence is that when we’ll support folders in AQL, they will have to be used/filtered by their uid, as their name might not be unique.

Colin_Sutton · 7 December 2022 23:22

Logically, subfolders 1 and 2 would not have a new uid: they would belong to both versions of the folder.
Analogy: my street (folder) has three houses (subfolders) . If I add a house to the street the street name does not change. If I change the street name, the houses do not change. If I want to search for the house (AQL), I should be able to find it with either street name by default, but might want to constrain the street version to match a source document version.

pablo · 9 December 2022 18:08

Nope, I’m trying to understand how one option should work considering the current spec. If a system implements FOLDERs, with the current specs we can have three options:

do not constraint sibling names to be unique and do not require uid
require unique sibling names
require uid

A 4th option would be 2 and 3 together.

My question is about option 3. only, so IF a system requires the uid for all subfolders in EHR.directory, when the directory is updated, does it make more sense to kee the same subfolder UIDs or it’s OK that those are modified?

I’m not saying we need to enforce any, just trying to understand what-if each case happens. In fact I’m in agreement with enforcing unique sibling names, though the rules about string comparison should be given by the spec if we are going to enforce this (what I mentioned about comparing uppercase to lowercase characters, or same character with different accentuation marks). We need to know what “unique” means in this context.

I don’t think UIDs should be in AQL at all in the FROM or WHERE clauses. The UID, IMHO, is only to get full resources from the REST API and for internal organization/management of a CDR.

That’s two assertions in one. This “they would belong to both versions of the folder” implies a “delta” versioning scheme (only the updated parts are created, then the rest of the structure references the previous version of objects that didn’t change).

Then for “full copy” versioning, you will have a complete different tree for the second version of the EHR.directory. If the uid is used internally as a key in the database, I see now for “full copy” versioning, those implementations can’t have the same uid for the FOLDERs that didn’t change, though the folders will have the same name. Then for “delta” versioning, keeping the uid values makes sense.

So now I see it really depends on how versioning is done internally! And I think the Conformance Statement of a CDR should include this information in order to verify conformance correctly, so the tests know what to expect in terms of FOLDER identification when uids are required.

thomas.beale · 9 December 2022 18:41

I’m not sure why a system would need to do that - FOLDERs beneath a top FOLDER are just like CLUSTERs beneath an EVALUATION or similar.

But, if you did, then ask: what gives a FOLDER its identity? Normally, its name (but you might say: its other meta-data as well). If those things don’t change, you have the same FOLDER over time, and the UID should stay the same. If you change its name, then you are (presumably) changing its meaning, and most likely its contents. In that case, change the UID. However you might change the name in some small way, and intend to use the FOLDER as it was before, i.e. you consider it the same FOLDER as before, maybe with a more informative name. Then retain the original UID.

The general approach is to do with identity and whether the change to the FOLDER is considered to be just an adjustment that doesn’t change its semantics, or else equivalent to a deletion followed by a creation of a new FOLDER.

This approach means that any logic that is tracking / searching on those UIDs over versions, will treat unchanging UIDs as the ‘same’ FOLDERs over time, which is probably the effect you want.

pablo · 19 December 2022 05:02

The question is not why a system would do that, but what would happen in that case. Let’s stick to the hypothesis

I don’t think name is the identity of a folder. Names can change and folders stay the same. Let’s consider a file system structure, there is an identity that is managed by the operating system, and the structure the user sees. The internal part is for management, while the external part is for user interaction. When creating, modifying, moving and deleting things, the user interacts with the external parts, but internally things are different. I see EHR FOLDERs don’t have much difference to file system folders in practical terms. So my question is, besides how the user sees the FOLDERs, if a system decides to implement the internal management based on mandatory UIDs, what do others think about the UID management.

Just for fun, considering linux file descriptors, name and the folder metadata are different things, and to find a file or folder some kind of translation should be done over the path or name, to get the right pointer to the folder/file contents inode - Wikipedia

Can’t assume why a folder name changes, there are so many reasons why that could happen besides changing meaning, like fixing a typo or adding specification (“episode” > “episode (asthma)”).

The issue is the system can’t guess about the semantics of the change, so it can’t decide which case it is and then decide if the UID should change or be the same, so human intervention might be needed, but then if UIDs are managed by the platform, there shouldn’t be human intervention.

I believe normal EHR operations would use the last versions of the FOLDERs independently of the UIDs, though detailed management of the FOLDER structure and data integrations can rely on the UIDs. Some time ago I sent a proposal to extend the FOLDER operations from the REST API that works with UIDs to operate over individual FOLDERs instead of needing to update the whole EHR.directory each time a small change is needed, on this one we could use the UIDs.

For now I think what @Colin_Sutton mentioned about how to maintain UIDs and then I analyzed considering delta vs. full copy versioning is the simpler approach for the UIDs.