Closing the two final holes in openEHR's FAIR support?

siljelb · 13 October 2022 10:28

In FAIRness of openEHR Archetypes and Templates, two remaining potential holes in openEHR’s FAIR support are identified, namely:

and

I think A2 can be remedied by using published archetypes. A published resource can be deprecated from the CKM, but it can’t be deleted.

A1 however would, if I understand it correctly, require that each resource includes the uri to the CKM where it’s maintained. Could this be added by each CKM as a new field under “other_details”? Or would something else/more be needed?

thomas.beale · 13 October 2022 13:57

And/or ‘deleted’ could just be a logical action of moving archetypes to a special Deleted folder, like operating systems do. That folder would maintain the original structure, and for each deleted item, the date-time of deletion and maybe a reason. This would be useful anyway - a lot of things have been deleted when they were in use in research, which has caused problems in the past.

You don’t generally want an artifact to contain any information about the container / server it might be in today (what about copies?); such info should be provided separately via a web service call that specifically retrieves an artifact, but wrapped in a small info object containing that extra info.

sebastian.garde · 14 October 2022 06:53

We have the original namespace and publisher as well as the current custodian in the archetype.
The idea was to then have an index linking namespaces to CKMs and where available applicable Git repos - maintained and hosted by openEHR - I think this was the original intention of this list here: CKM Instance Details.

This page has the required information and seems to be a simple but reasonable approach given Thomas’ concerns. This page could of course also be a fancy web-service that automates what Thomas describes.

We have the mechansim for this using the status rejected for draft archetypes and depcrecated for already published archetypes.
As SIlje says, we cannot usually delete once published, but for draft archetypes (or even archetypes in initial state in incubators), complete deletion for these seems unavoidable as a CKM feature. You have to draw the line somewhere, and the published state seems reasonable to me.

siljelb · 14 October 2022 08:05

Thanks, Thomas and Sebastian, for your responses!

I see your point about the namespaces, but this page isn’t easily located using only the information contained in the persisted data, or even in an archetype ADL file. If I understand FAIR correctly, you should be able to find the metadata from the data itself.

Agree. We’re constantly warning people to take special care if using v0 archetypes for real data.

ian.mcnicoll · 14 October 2022 08:21

So I agree with all of these responses!! - we need to make that Wiki page much more visible and find a way to add other sources open git repos to be included.

I suspect the problem with the FAIR expectations is that they are based on whole dataset publication, whereas we are publishing each archetype separately, and I agree with Thomas and Sebastian that we re actually following better practice in terms not providing the physical endpoint in the artefact.

However …! I wonder if we might compromise by adding something light ‘Last accessed at’, in the same way as we do academic references, perhaps auto-updated? That keeps us conformant with FAIR but also does provide some utility someone wondering where an archetype has come from, but without any guarantee that the site still exists, or is current. Add it as ‘soft metadata’?

thomas.beale · 14 October 2022 09:17

In my knowledge, it’s not usually ‘real data’ that’s the problem - it’s long-term (PhD level) research… and having archetypes literally disappear is a real problem for people doing that - they have to scratch around looking for local copies etc.

In a way, visible .v0 archetypes are almost an advertisement to researchers, saying, this is ‘an idea’ of what we think is needed here, please do some work with it and make proposals for improvement…

I think if it is good enough to be posted on the CKM site (even as a ‘very draft’ artifact), it’s good enough to keep there for some time, e.g. 3y, in a ‘deleted’ bucket.

thomas.beale · 14 October 2022 10:05

Well different copies from different servers will have that field with a different value; how is it to be kept up to date in the long run. A namespace + well-advertised service at a known location (https://ckm.openEHR.org/something) is a much more durable solution. Note that the current approach already assumes that a/the CKM is known.

If FAIR requires literal URLs to be buried in (copies of) managed artefacts, then FAIR is broken, not openEHR

sebastian.garde · 17 October 2022 06:44

You can keep them for as long as you want - that is what the ‘rejected’ status is doing in CKM.
But there is always the case that you want to delete something that was e.g. uploaded accidentally completely.
It therefore seems a policy decision, but likely not something that can be (fully) enforced by tooling.
I don’t think that the 3y bucket offers much in terms of fulfilling the stated FAIR principle, or do the FAIR principles mention timelines for this?
I think I have also seen an Archetype Graveyard project or incubator which could be used in addition (e.g. as such a bucket). Main purpose from my view however is that the archetypes then are no longer owned by any other particular project.

From my point of view, the new Clinical Program should look at this, decide on and publish their terms of governing the models keeping fAir in mind. If this leads to something that can reasonably be supported/enforced by CKM all the better.

thomas.beale · 17 October 2022 10:42

AFAIK, ‘rejected’ means that the artefact has been considered and found to be deficient for its intended purpose in some way. So researchers would realise (I think) that rejected archetypes are not going to ever be published. They might still want to use them for their own needs though - it would be as if they had built a local archetype.

My understanding is that the majority of v0 archetypes are ones that no-one has gotten around to reviewing or working on, and for which no conscious assessment has been made at all. I think this is mainly what researchers have used, and for at least some, the v0 disappears when some work is done to create some v1 archetypes, usually after a long time.

My suggestion of the 3 years was for the latter category - archetypes that got uploaded but not subsequently worked on for some time.

I think all this can change anyway, if the Clinical Program Board wants to create a new approach for ‘development’ level archetypes, archetypes offered from industry but not yet reviewed and so on. I don’t personally have an opinion about any of that - was just reporting the experience of various researchers that were working on archetypes that subsequently disappeared.

Yes, I guess that’s more an undo operation.

Absolutely. I had better add FAIR as one of the considerations for the CPB when it gets going!

damoca · 17 October 2022 13:02

This is a very interesting topic with many implications. I’m taking into account two premises.

First, the openEHR CKM is the main reference for internationally validated and quality curated archetypes. But, as it happens now, there can be many other archetype repositories. Any solution thought for openEHR archetypes FAIR compliance should have them also in mind.
Second, CKM is the well-known and recognized archetype repository and governance system of openEHR. But it is also the name of a commercial solution. It is important to bear this in mind when thinking in assigning a unique and universal identifier for archetypes, as I discuss next.

That said, for me there are at least three main topics to discuss.

1. Identification of archetypes (and templates?). We currently have archetype ids, namespace+archetype ids, and UIDs that help to uniquely identify a version of an archetype. But they don’t help to locate them. Yes, we have URL such as https://ckm.openehr.org/ckm/archetypes/1013.1.2881 but, will ckm.openehr.org be a valid reference in 20 years? And what happens to archetypes stored in other repositories or domains? What if an organization publishes its archetypes in GitHub or whatever exists in the future?
In other domains such as publications this is solved by using a standard doi.org domain, that is linked but independent of the URL where the current publication can be accessed. I think openEHR should follow a similar approach. To have an openehr.org/resourceid assigned to archetypes or other accessible resources, that points to the place where they are published, whatever the server is.
For sure, this is not easy to maintain, and there we have the infamous experience of OIDs, but it is something to explore. The document Archetype Identification (openehr.org) is also a good starting point.

2. Long term persistence. I have little to add to what has been already discussed. It is necessary to have a policy about the minimum time of persistence, and to have rules to decide when a unique identifier and locator is assigned to archetypes in development.

3. Access and retrieval of ADL. We lack a formal REST API for the retrieval of archetypes. I know CMK has it, but it is not part of the openEHR REST API specifications. There, we can only find a service for the recovery of templates, oriented to those that are stored in a data repository and not specifically for archetype governance systems. ISO 13606 part 5 has some ideas for that, but it needs quite a lot of real technical definition.

sebastian.garde · 17 October 2022 14:20

Absolutely

Well v0 and v1 keep the same asset identifier in CKM, so any link will still work and you can go to the revision history to see it all. I think this is a non-issue.

For v1->v2, or vice versa, we have implemented links to predecessor and successor archetype.

It is - quite a bit of work has gone into this document, inspired by CKM practical problems and solutions and vice versa.

Agree. The simplistic approach is the webpage mentioned before and using the namespace to find the correct location (whether a CKM, a different type of CKM or just a git repository) is not tied to a particular CKM instance or other type of repository and can change on demand. But independent of one particular tool it is only helpful as long as we have agreed on one or more possible identifier for the resource.

The actual resource id could also be the archetype id for example (+/- namespace).
Note that you can also just use https://ckm.openehr.org/ckm/archetypes/openEHR-EHR-CLUSTER.laboratory_test_analyte.v1 - this has the caveat that archetype ids may change before publication, while still being the same asset. This could be the same as in a file-based (git) repo.

There’s also the UID (automatically) added to each archetype that is stable for the archetype’s main version as well as the build UID which changes for each [minor, patch, unstable] revision of the archetype. Both could be used as well, depending on the use case.

In addition, for better or for worse, we have actually implemented an OID registry in CKM (for a national programme a long time ago) where you can assign OIDs for each resource as well if CKM’s “Citeable Id” (1013.1.2881 in your example) is not sufficient for external purposes.
This may be a useful starting point to link OIDs (or with some work on it to support external unique ids with a different syntax) to a more generic Page not found approach (in either direction) if none of the ids are sufficient.

The downside to this is of course the maintenance overhead. I think we need to be smart or this will be a theoretically sound, but practically very challenging approach - which I assume is pretty much what you have dubbed as the “infamous experience of OIDs” above?

ian.mcnicoll · 20 October 2022 06:54

I think it would make sense to include other non-CKM peer-managed repositories on the same wiki-page, though we should re-work that page and make much more visible.

Perhaps the minimum viable approach would be to ask any published workspace to have a self-assigned Custodian namespace, along with the current ‘baseURL’ - e.g that might be a github account rather than individual repo.

Definitely something for the Clinical program to look at.

thomas.beale · 20 October 2022 11:16

I certainly think the Clinical Program should not restrict its remit to one tool, one specific repository instance, but rather ‘managing clinical models’ generally.

Well anyone can do what they want of course, but I think the openEHR Clinical Program needs to develop a set of quality criteria that could be used to determine whether some Git repo or whatever other repository really can be trusted as viable clinical models - e.g. maintained; internally coherent; follows openEHR modelling patterns; etc.

ian.mcnicoll · 20 October 2022 14:01

I think we should expose all 3.

Formal CKMs
Quality assured non-CKM repos
Peer-produced content - as long it is minimally correctly namespaced, and clear that this is not formally QA’d. That can still provide valuable content for others, and a start point for (1) and (2)