Tags: Initial thoughts on RM representation

thomas.beale · 2 April 2020 11:55

Starting from this wiki page put together by Better/Code24/DIPS, we have some ideas about what tags look like in 3 current systems.

The job is to decide what to try to standardise on. We could just standardised on the API and a representation of a ‘tag’, and ignore where they are stored. However… if we do that, and some site decides to move some EHRs from say Better or DIPS to EhrBase, the tagging will most likely break because the RM-level representation won’t be compatible. So I think we have no choice but to specify it at RM and API levels.

For the RM level, based on the explanations on that page, and the basic assumption that tags should not change the content that they are tags for (i.e. cause any new versions of anything), a simple model would be (pseudo-code, until I do some UML):

class EHR {
    compositions[0..*]: List<OBJECT_REF>;
    contributions[0..*]: List<OBJECT_REF>;
    folders[0..*]: List<OBJECT_REF>;
    tags[0..*]: List<OBJECT_REF>;  // our new addition
    etc: ...
}

class TAG {
    key: String[1];
    value: String[0..1]; // or maybe [0..*]
    target_locator[1]: Uri; // or some other path-like thing
    target_uid[0..1]: UID_BASED_ID;  // for fast access?
}

Any of the details of the TAG class can be changed. So the above would most likely mean the addition of a TAG table in the same way as COMPOSITION (or VERSIONED_COMPOSITION, depending on how the DB is done).

It would also mean that if you hard-delete or move an EHR, the tags get deleted or moved, i.e. they are ‘part’ of an EHR. However, in the above they have no versioning.

The above draft model is more or less just inferring from the descriptions on the wiki page, and may be completely wrong. But I guess we need to start somewhere.

Comments welcome.

sebastian.iancu · 2 April 2020 12:33

Agree!

Now, something about issues I see above:

My impression that target_uid is mandatory (in all implementations), where as the target_locator is optional, pointing to something inside the target itself, for instance a DV_* thing, etc.

Secondly, I see you added tag: List<OBJECT_REF>, which implies that the TAG should have also an uid attribute?
I don’t mind having such list accessible from EHR object, but under-the-hood our tags are actually directly ‘bounded’ (only) to versioned_objects (instead of EHR).

Also, (as I’m probably famous by now on this ‘topic’) does this imply tags are EHR related? Because I think is actually applicable also to DEMOGRAPHICS (and we do that), targeting PARTY objects.

thomas.beale · 2 April 2020 13:43

@sebastian.iancu I think all of those points are valid. Rather than trying to model too much too early, I’ll just throw a few other possible requirements:

tag ‘groups’, i.e. named groups of tags pertaining to some purpose, e.g. ‘research project CV19’; presumably there would be a default group; how to distinguish between ‘real clinical’ tag sets and speculative / research ones?
audit trails - recording the creator, time etc of a tag and tag group
versioning, or not?
do we need to distinguish between tags that belong to an EHR; other tags that belong on the ‘EHR repository’ or ‘Demographic repository’ as a whole - maybe only the former are ‘part of the EHR’?

sebastian.iancu · 2 April 2020 15:07

Perhaps Better & DIPS see this teh same, but I will reply from Code24 experience we had with tags (6y, since about 2014):

no need for versioning or audit (rather use folders)
no need for grouping (rather use folders)
tags (and their values) have no semantic ‘payload’ (you can’t compare tags and say that one is more clinical than other one), neither no need to distinguish between tags for EHR or external tags; they are all just tags
tags are mainly (or ‘only’ if I’m not mistaken) used for querying purposes, a sort of index or indexing technique for versioned data.

You may argue that this is last point is enough reason to have them versioned and keep audit trail, but I see ‘tags’ more as requirement emerging from implementation space, bring in performance improvements or some extra functionality while using data and I guess it just has to be used properly. So, it should not be misused and overloaded with clinical meaning or content (better use FOLDERs instead). At least in Code24 case, this is the way we use them, and this is the reason that tag-values are always inferred from target itself. But I’m curios also about others opinion…

thomas.beale · 8 April 2020 12:10

Some further ideas on tags modelling, based on @sebastian.iancu’s comments and my further thoughts. In terms of features, the minimum model will need to encompass the following requirements:

definitions: tag groups and tags, e.g. as you see in Discourse; the idea here is that ‘tags’ can be viewed and edited independently from the things they apply to; deleting a tag from the definitions would presumably mean deleting from all items it was used on;
tag instances: for each object to which tags can be applied (presumably LOCATABLEs), record the id/locator of that object, and all the tags currently applied to it;
tag collections: distinct collections of tag instances, that can be separately created and removed; potentially allow 0..* tag collections on an EHR; also on higher-level objects such as EHR_REPOSITORY and DEMOGRAPHIC_REPOSITORY; this would mean some tag collections were considered ‘part of’ the EHR, and others (attached to the repository object) are not;
querying: some ability to indicate which tag collections are active for querying. E.g. we don’t want ‘Bob’s crazy research project’ tags to start affecting clinical query results and what appears on screens other than in Bob’s private lab.

Some modelling ideas. Firstly, the definition level:

// definitions level
class TAG_GROUP {
    uid: Uid[1];
    name: String[1];
    purpose: String[1];
    // some authoring info as well?
    tags: TAG_DEF[*]; // the tags in this group
}

class TAG_DEF {
    uid: Uid[1];
    name: String[1];        // the name of the tag
    has_value: Boolean[1];  // true for tags that can have value(s)
    // some authoring info as well?
}

Now the instance level, i.e. actual tags.

// instance level
class TAG {
    tag_def: Uid[1];
    value: String[0..1];           // or maybe [0..*]
    target_item[*]: UID_BASED_ID;  // the items marked with this tag OR PATH
}

class TAG_COLLECTION {
    uid: Uid[1];
    name: String[1];
    purpose: String[1];  // purpose of this collection
    tags: TAG[*];
}

class EHR {
    ...
    tag_collections[0..*]: List<OBJECT_REF>;
    ...
}

class EHR_REPOSITORY {
    ...
    tag_collections[0..*]: List<OBJECT_REF>;
    ...
}

class DEMOGRAPHIC_REPOSITORY {
    ...
    tag_collections[0..*]: List<OBJECT_REF>;
    ...
}

Notice here that in this approach, a TAG is really an extension of all items with that tag. With this model, deleting a tag is just deleting one of these objects; changing a tag name is changing just a single String.

We could also do it the other way around, i.e. record tags on each item, e.g.

class ITEM_TAG {
    target_item: UID_BASED_ID[1];  // e.g. some Lab result OBSERVATION
    tags: Uid[*];                  // Uids of the applied tags
}

In this approach, deleting or changing tags is more complicated - you have to iterate through all the ITEM_TAGs looking for the tags you want to remove. Changing a tag name however is still just changing a single string.

The above is not a proposal at this stage, just some modelling ideas to give us a feel of how sophisticated or not we want to be.

thomas.beale · 9 April 2020 12:29

Further requirements:

only set by applications ; admins? ; not end users
no need to view tags / tag sets other than by pure extraction, i.e. no definition tool / view
tags can have values (i.e. optional)
include advice in specs to say how not to misuse
need support in AQL
move an EHR = move the tags as well
can point to fine grained things - any LOCATABLE, including PARTY, FOLDER, also VERSIONED_XX objs

Questions:

should they be in serialised form of Composition etc, i.e. visible in REST APIs?
how can AQL queries that mention tags be reliably interoperable?
essential Q: how connected to Composition? Linked externally

Possible simplified model based on the above.

class EHR {
    ...
    tags[0..*]: List<OBJECT_REF>;  // points to ITEM_TAGS[*]
    ...
}

// List of all tags associated with an item
class ITEM_TAGS {
    target_item: UID_BASED_ID[1];  // can point to any LOCATABLE
    tags: ITEM_TAG[*]; 
}

// a single tag instance
class ITEM_TAG {
    tag: String[1];     // the tag
    value: String [*];  // optional value(s)
}

matijap · 10 April 2020 12:03

I think tags could (and should) be made an entity that is completely orthogonal to existing RM model (and EHR class in particular), both to support Code24 cases with demographics and to simplify things. They should have some kind of a “link” to their target entity (which might be a versioned composition, composition, an object within it, a demographics entity, etc.), a tag name and an optional tag value. That’s it.

Implementation details will be different. We will probably retain our current model (that supports tagging compositions, and interprets tagging a versioned composition as tagging all of its versions) for the foreseeable future and make a translation layer for the API. Also the way AQL queries work (in an efficient manner) with tags will be an implementation detail, although the predicates and/or functions we will define in AQL for purposes of working with tags will need to have a dependency towards the tagging spec (as opposed to the RM model which we already decided we do not want to see as a dependency in the AQL spec, except as part of the examples).

thomas.beale · 10 April 2020 15:45

I should have been clearer above. We would define the TAGxx classes in the Common IM (where FOLDER, PARTICIPATION, AUDIT_DETAILS etc are defined). So I think that makes them separate enough.

For the formal representation, I think you are proposing this:

class ITEM_TAG {
    target: UID_BASED_ID[1];
    tag: String[1];
    value: String[0..1];
}

The association with the EHR class in my suggestion above means that ‘EHRs may include tags’. They could be used in other places as well, e.g. some implem-specific location that represents (say), a demographic repository. Or are you saying you don’t even want tags attached to EHRs? I thought we said yesterday that we consider tags to be a part of an EHR that would be deleted or moved with it?

sebastian.iancu · 13 April 2020 11:46

I did not replied yet, as I was wondering if the last(s) are final-good representation - from my point of view they are good, especially the last variant of ITEM_TAG (target, tag, value).

In our case, the collection of EHR’s tag can be always inferred from target above; so if you really want, there can be a function (i.e. tags()) in the EHR to be used to retrieve them - but i’m also ok without it. ‘Moving’ or ‘deleting’ tags is cascaded as soon as targets are moved or deleted, implicitly also when the owner of those targets (the EHR) is moved or deleted.

thomas.beale · 27 April 2020 15:36

Ok troops… more on tagging. Two questions:

Do we agree that the target may be a:

VERSIONED_OBJECT<T> e.g. a VERSIONED_OBJECT<COMPOSITION>
specific VERSION<T> e.g. a VERSION<COMPOSITION>
anything else?

Do we want to add any other field to make it easier to determine what kind of thing the tag is pointing to?

do we want value to be a single string or an Array<String>, or do we want to treat it like a JSON or other structured data field? In the latter case, do we want to introduce a lightweight type that acts like DV_PARSABLE, i.e. indicates language and value?

sebastian.iancu · 29 April 2020 10:00

My opinion:

VERSIONED_OBJECT and VERSION
just (plain) String

matijap · 8 May 2020 07:05

We also have an aqlPath in the tags so that we can target specific objects within a composition. The tags are still attached to the composition though and the AQL Engine is not very intelligent around aqlPath.

matijap · 8 May 2020 07:05

I thought we said yesterday that we consider tags to be a part of an EHR that would be deleted or moved with it?

Sounds reasonable.

thomas.beale · 8 May 2020 13:34

I forgot the path. So we want this.

class ITEM_TAG {
    target: UID_BASED_ID[1];
    target_path: String[0..1];
    tag: String[1];
    value: String[0..1];
    owner: OBJECT_REF[1];
}

Plus a connection from EHR to ‘tags’. See here for what this looks like in the UML.

Note particularly the owner field. Putting this in the tag is an explicit DB foreign-key kind of approach which we don’t usually do. If you take that owner away, the connection to the owning object (usually an EHR) will still be needed, but will be invisible to any interoperability view, e.g. how the data look through the REST API. We can do either way. I’d like to get a better feel from implementers which you think will work better.

[NB: we can call that path field aql_path if you want, but I think it can be an RM path as well].

sebastian.iancu · 14 May 2020 11:39

In order to be consistent can we name the owner as owner_id, as it is also in the VERSIONED_OBJECT?
https://specifications.openehr.org/releases/RM/Release-1.0.4/common.html#_versioned_object_t_class