EHR/DIRECTORY API proposals (Pablo)

thomas.beale · 7 December 2021 01:23

The following is an annex from @pablo’s conformance documentation for EHR service directory sub-component (i.e. the part of the EHR API for handling EHR.directory). I’m moving this here to get it out of the specification, which should just contain tests for existing API. (@pablo - feel free to modify any way you like).

Annex: proposal for FOLDER API

REF: Log In - Confluence

openEHR ticket SPECPR-338

Background

Current service model related for EHR.directory might lead to some complexities and issues for internal implementation and for the REST API. The goal is to discuss those issues and decide upon our internal implementation rules for FOLDER in general, and particularly for EHR.directory.

Operations for the Service Model

Current operations

has_directory(ehr_id): Boolean
has_path(ehr_id, path): Boolean // path from the root EHR.directory, also the idea of this path is that is defined by archetypes (this is another issue mentioned below)
create_directory(ehr_id, folder) //root directory
get_directory(ehr_id): FOLDER // this might need to be VERSION
get_directory_at_time(ehr_id, time): FOLDER // this might also be VERSION
update_directory(ehr_id, folder) // folder is the full EHR.directory modified
delete_directory(ehr_id)
has_directory_version(ehr_id, version_uid): Boolean
get _directory_at_version(ehr_id, version_uid): FOLDER // this might also be VERSION
get_versioned_directory(ehr_id): VERSIONED_FOLDER

Issues

To update, the client has to get the full EHR.directory structure, do changes on the client side (that means the management happens on the client), then the update needs to commit the whole structure with the modifications, this adds a lot of complexity on the client side and might not be the most natural way of managing an EHR.directory. (Luis: agree. FOLDERS in my view should be self standing structures and the system should allow for updating only one of them (e.g. change its items or details) as long as the edition of the FOLDER does not introduce inconsistencies in other FOLDERs.)
The has_path operation uses a path that should be defined by an archetype (mentioned by Thomas on the SEC Slack), my interpretation was those where instance paths considering the EHR.directory tree structure, which makes sense since it is impractical to have the whole EHR.directory structure defined by archetypes, and even some of those FOLDERs will be created in an ad-hoc way (IMO most will be created this way and using a generic archetype for definition, this is also the approach of Code24 which has been using folders for 6 years). Also paths are name-based, which makes them language dependent and creates the need for a constraint to have sibling FOLDERs with unique names. (Luis: the uid inherited from LOCATABLE (now optional 0…1) should be mandatory in our implementation for implementation reasons (1…1). At the moment is is the Primary Key in the Database, thus it is mandatory and unique).
Also related to paths, the current spec shows name-based paths to reference internal FOLDERs, but to reference to items in a FOLDER, the path uses numeric indexes, which seems inconsistent. One possibility is to use the item name on the path, the issue that creates is the items are really VERSIONED_OBJECT, which doesn’t have a name, but the VERSIONED_OBJECT.latest_version() which is VERSION has a name if T is LOCATABLE, so FOLDER.item[i].latest_version().data.name could be used in the path, but again, that creates another couple of issues: a. the name is not really form the item but from the contained data, and b. since the data could be updated, the name could change, changing the path. So the name-based path IMO is not really useful for any use case.
That last part makes me think of the name-based paths for FOLDERs, since FOLDER.name could also change, since FOLDERs could be created, renamed, deleted, etc. so the paths that were valid at one point could be invalid later. And one idea of these paths was to use them also for AQL, but IMO is almost impossible to get something very detailed from AQL using paths for FOLDERs, since I think most FOLDERs will be created ad-hoc and might not have a full structure defined by archetypes, only the basic structure, and maybe the new FOLDER.details structure, which could be archetyped but also could be used in an ad-hoc way.
Not issues from the operations but from the model: a. a FOLDER could have more than one parent, b. a FOLDER could have an ancestor as subfolder. These break the tree structure and openEHR needs to add some invariants to prevent this on the model.
We should clearly commit to implement FOLDERs directory as trees in the computational sense. The aim of this is to guarantee some performance issues (approx. O(lgn) when rearranged optimally) and avoid
possible cycles that may derive from graph-like directories. This is in contradiction with some implementations that allow to virtually define graphs using the LINK class.
The operation 'has_directory(ehr_id): Boolean' makes sense in EHRs, however for phenotyping in clinical research it may be actually the opposite. For example, a clinical study on back pain surgery may have a folder containing many EHRs rather than the other way around.

Proposals for operations

has_directory(ehr_id): Boolean // MAINTAIN
has_folder(ehr_id, folder_uid): Boolean // NEW, uses uid not path
has_path(ehr_id, path): Boolean // MAINTAIN - 1. spec needs to explicitly state 'path' is an archetype path, not an instance path, 2. add an example with archetype paths to show how this operation will work, I think looks good on paper but it can be difficult to implement
create_directory(ehr_id, folder) // MAINTAIN - discuss about the EHR and support self-standing FOLDERs without belonging to an EHR.
get_directory(ehr_id): FOLDER // MAINTAIN
get_directory_at_time(ehr_id, time): FOLDER // MAINTAIN
get_folder(ehr_id, folder_uid): FOLDER // NEW, like cd + ls commands (this is optional since the information will be included in the result of get_directory). This will return the latest version of the directory provided that folder is not versioned.
create_folder(ehr_id, parent_folder_uid, new_folder) // NEW, like mkdir command, if no parent_folder_uid is provided, the new_folder will be created under the EHR.directory
update_folder(ehr_id, updated_folder) // NEW, allows to modify an individual FOLDER and what it contains, including name, details, folders and items. The updated_folder contains it’s uid so there is no need for an extra parameter. If subfolders are deleted in the updated folder, they are deleted in the directory as well in EHRbase.
remove_folder(ehr_id, folder_uid) // NEW, like rmdir -r (removes also subfolders and items)
add_item(ehr_id, folder_uid, versioned_object_uid) // NEW, like the touch command, adds the item to the FOLDER.items via OBJECT_REF (TODO: verify OBJECT_REF needs namespace and type values but I think those
could be set to default values set on the server config so we might not need to add extra parameters for those)
remove_item(ehr_id, folder_uid, versioned_object_uid) // NEW, like the rm command, removes the versioned object reference from the FOLDER.items
delete_directory(ehr_id) // MAINTAIN, but is contained in remove_folder when it is invoked with the EHR.directory.uid as folder_uid value
has_directory_version(ehr_id, version_uid): Boolean // MAINTAIN
get _directory_at_version(ehr_id, version_uid): FOLDER // MAINTAIN
get_versioned_directory(ehr_id): VERSIONED_FOLDER // MAINTAIN

Notes

Referencing FOLDERs by uid requires that the FOLDER.uid is set for all FOLDERs by the server. In the RM the uid is optional, so this could be an implementation constraint but still 'spec valid'.
The added operations seem to be a more natural way of managing with FOLDERs and their items like a user could do on a Linux Terminal and avoids the extra complexity of managing the whole EHR.directory on the client side for creating new FOLDERs and adding new references to items, also for deleting stuff, instead of having one big operation, we could map one action from a user to one operation on the Service Model. Still the create_folder() operation could receive a full FOLDER structure with subfolders and references to items, or just the basic data like name and details, and then it could be modified using the other operations, or the same create_folder() to add subfolders to it. That also adds more flexibility for client-side implementation.
About versioning, from the spec, the only versionable FOLDER is the EHR.directory, no internal FOLDERs could be versioned. Considering the new operations, each creation, update and removal of FOLDERs and items, would generate a new version of the containing EHR.directory, so this is an implementation consideration. Either way this should be done with the current operations in the SM spec, this is just to note that individual FOLDERs shouldn’t be versioned (Code24 is versioning individual FOLDERs and they might propose a change request to make that valid in the spec, but won’t be any time soon).
Using the parent_folder_id to create new FOLDERs prevent the generation of non-tree structures, since a. FOLDER.uid should always be assigned by the server and 2. only children to a given parent could be created.
TODO: we still need to discuss AQL requirements for FOLDERs and what will be needed to support those (from archetype modeling to internal implementation).

pablo · 7 December 2021 02:44

Thanks @thomas.beale to revive this discussion!

The rationale behind this is to have operations that allow to modify the EHR.directory without the need of providing the full directory structure for each modification operation, and manage internal FOLDERs using the concept of a path like a file system uses paths to manage files and folders, or the UID of the FOLDER directly (which requires a mandatory UID for FOLDERs). So the basic idea is to optimize the payloads for EHR.directory/FOLDER operations. Currently we only have EHR.directory operations, but not operations for internal FODLERs.

Since for any of these operations the only versioned object is the VERSIONED_FOLDER for the EHR.directory, each change to an internal FOLDER will still create a new version of the whole EHR.directory, but it can be seen as a delta modification (instead of having the whole structure, we have only the structure that was modified). This applies to creating a folder, modifying a folder (change it’s name, adding or removing a reference to a LOCATABLE, etc), or deleting a folder (and it’s subfolders and references).

Feel free to add your comments and thoughts about this little proposal, it’s nice to move this discussion along. I believe this will add more flexibility to our API and Service Model.

sebastian.iancu · 7 December 2021 12:13

Wow @pablo you’ve done a lot of work analyzing these.

Not sure how to contribute for now, as I have several remarks and suggestion - but I’m afraid we’ll loose overview if we do it here on discourse. Perhaps a common meeting might be more appropriate.

But meanwhile I could just add that with the introduction of EHR.folders in RM 1.1 a lot of problems you mentions above are solved or at least simplified. But that EHR.folder is not yet “exposed” in REST API, neither in SM. I would prefer not to mix it with EHR.directory, which is something different.
Also a lot of issues you mention I guess are theoretical problems which I don’t agree we should solve them by introducing more rule, features, invariants, etc - but rather adres it as best-practice or recommendations.

sebastian.iancu · 7 December 2021 12:30

(TB edit: I fixed this post to quote Pablo rather than me, since this is his text)

This is not true, as of RM 1.1 - check EHR IM - Folders.
Some folders can be versioned, i.e. the top (root) of each tree, while others are just subtree-nodes (what you also name “internal folders”). An EHR might have more than 1 such versioned trees (first being the Directory, to maintain compatibility with RMs pre 1.1). Also, even before RM 1.1, a VERSION_FOLDER was not necessary the EHR.directory, whereas the EHR.directory was always a VERSIONED_FOLDER - my point is that you could have VERSION_FOLDERS, but you could not “reach” them from EHR.

pablo · 7 December 2021 12:30

Thanks @sebastian.iancu it would be wonderful to have a work meeting about this topic to discuss in detail. I don’t have any strong opinions right now about the decisions we need to make to make this work

sebastian.iancu · 7 December 2021 12:44

Is this a problem, or a requirement to be solved?
How is it now on a pc or mac: if someone stores somewhere a path (a reference) to a file, and later renames one of the parent folder that is present in that path, will it still be able to use the stored path? or will it have to update that path manually … ?

pablo · 7 December 2021 15:08

On file systems there are tables that store the paths and IDs of each folder and file. Names could change, changing the paths, but IDs don’t change. Check the inodes on Linux file systems inode - Wikipedia

IMO the issue we have is how paths in folder instances are defined, and that references created pointing to those folders could be created inside also outside the CDR. But there are two requirements, one is to have paths available at the API level to work on folders, the second one would be to store paths as references to folders. I would try to solve the first requirement. If the second is needed, I would suggest to use just uids.

Then we could say how paths are created, and if those paths use the names or other attributes.

Note that “paths” in this context are all data instance paths, not archetype/templates paths.

EDIT: lot of words changed from typing on the cellphone, now fixed, sorry.

thomas.beale · 8 December 2021 19:22

Just to re-iterate a very old point: the intended structure of the content of EHR.directory or EHR.folders is just trees of FOLERs containing refs to COMPOSITIONs (and maybe other things). The whole structure in each case is versioned, just as if it where a CLUSTER/ELEMENT tree. The formal type of the top-level object is therefore VERSIONED_FOLDER, but it doesn’t contain VERSIONED_FOLDERs or any other esoteric structures inside - it’s just single VERSIONED_OBJECT container with each VERSION being a tree structure.

When doing a commit, the current semantics are that a new version of the tree is committed. Now, an implementation might store this as diffs (like Git does). In a smarter API, we could enable a ‘commit_change()’ kind of operation, which would be some structure or transaction that changes the tree, rather than having to supply the changed tree.

If you look at it like this then quite a few of Pablo’s operations become ‘diff’ operations, and a simpler API interface can supply the same semantics, e.g.:

commit_change (op: enum OP_NAMES, args: ...)

which then gives the ability to do things like:

commit_change (op: REMOVE_PATH, old_path: "/path/to/remove")
commit_change (op: RENAME_PATH, old_path: "/path/to/rename",  new_name: "new_name")
commit_change (op: MOVE_PATH, old_path: "/path/to/old",  new_parent: "/path/to/new/parent")
commit_change (op: ADD_PATH, target_path: "/path/to/parent",  new_child_path: "/x/y/z")
commit_change (op: ADD_ITEM, target_path: "/path/to/target",  new_item: Composition1)

There are many variations on this idea - to be discussed, if there is interest.

pablo · 8 December 2021 20:13

@thomas.beale I understand your examples, my proposal is something like that, though I don’t like the idea of single interface for multiple operations encoded as parameters. IMHO it is simpler to define and implement single operations for each purpose.

thomas.beale · 9 December 2021 16:06

I don’t mind too much either way - am just providing it as a technical possibility for discussion.