VDFAI Validation interpretation

Hi

from Archetype Object Model 2 (AOM2) :

VDFAI: archetype identifier validity in definition. Any archetype identifier mentioned in an archetype slot in the definition section must conform to the published openEHR specification for archetype identifiers.

Given that we usually have regexes in the Slots, I can read “conform to” in two ways:

  1. The specified regex MUST NOT allow any strings that are no valid archetype ids or
  2. It must simply still be possible to construct a correct archetype id from the regex.

Concrete example:
Is this a valid archetype id constraint in a slot
A)

/openEHR-EHR-CLUSTER\.my_concept.*/

or does it need to be
B)

/openEHR-EHR-CLUSTER\.my_concept(-[a-zA-Z0-9_]+)*\.v[0-9]+/

(Noting this is still a comprise due to the .v01 not really being a valid version).

The original Java reference implementation actually uses a compromise and checks a few things like there being the correct number of hyphens and dots in the archetype, that there is a “.v” part and that after the (last) .v the regex allows a number.

It seems to me that the first interpretation is quite strict and the second maybe too loose to be meaningful.

Any comments on what is (and also what should be) the right interpretation of this rule?

Regards,
Sebastian

It’s a good question. My view is that slot regexes are not the primary definers of correct or complete Archetype ids; we assume that is done elsewhere - e.g. in the parser rules / AOM type defs of ARCHETYPE.archetype_id. In fact, if the definition of correct archetype_id were to change slightly, e.g. to allow some v1.2.3+xxx pattern or whatever, and slot regexes were performing the maximal strict check, then all those slot regexes will instantly break when you make such a change.

So I would say that slot regexes should be doing a ‘minimal match’, not a maximal validation.

I tend to agree, but a few comments:

  1. I have no real idea how VDFAI could then be implemented, generally speaking, for such a regex slot.
  2. It is easy to mix up the regexes for “allow all versions” with “allow all versions and anything that comes after my_concept” - in the above example A : Is my_conceptBLAXYZ.v1 allowed or only my_concept.v1 and if you allow the first, is that really what you wanted express with your regex…

I think these two reasons may have been the reason (in the Java Ref Impl) for at least doing some cursory checking of

  • a “.v” being present and
  • the correct number of (escaped) dots, so that at least the three main parts of the id can be identified, etc.
  • (maybe) the correct total number of hyphens between rm publisher, rmpackage and class: openEHR-EHR-CLUSTER

But with the “minimal match” conformance, not even that is then possible to report as part of VDFAI.

This is ADL1.4-ish stuff…!

In the ideal, we would be doing no lexical matching (other than direct hit), and instead, we would rely on << operator etc to allow children or this+children - and I think we will move to this at some point.

In the practical world, I think that perhaps a middle-of-the-road regex is ok for now.

Thanks Thomas.
Yes, kind of 1.4 but I wasn’t (only) talking about children: my_conceptBLAXYZ.v1 vs my_concept-BLAXYZ.v1 .

When specifiying a constraint, this is what AD allows you to explicitly specify (“all” or “only specialisations” or “nothing”) + choice to add a version or not.

With “minimal match” conformance, a regex like

openEHR-EHR-CLUSTER.ww.*

would be allowed (or maybe better: not prohibited), but certainly makes it harder to meaningfully check if encountered in the wild. [Note: As part of the regex it is .* at the end, not just *]

Here: Did the creator really want to express that anything after ww is ok, including for example:

openEHR-EHR-CLUSTER.wwmytest.v0
openEHR-EHR-CLUSTER.ww-myspecialisation.v5
openEHR-EHR-CLUSTER.ww.v1

Well that example regex is clear on what it allows: any of your 3 ids would fit. I am inclined to think we should work on the basis that tools that enable archetype ids to be stated as slot fillers will themselves ensure that the ids are legal archetype ids, and ideally that the ids actually exist in the current archetype library or repository. This would imply that more minimal regexes are ok.

The corollory of this is that people writing regexes by hand in archetype slot definitions (surely a tiny minority of hard-core obsessives :wink: are required to know what the hell they are doing!

I had a look at what validation code I have for VDFAI in the ADL workbench - turns out I never implemented it… maybe I am too trusting :wink:

Yes, no doubt. So if tooling is used and gets it right, ok.

I think you would have a hard time writing it: we can of course easily check if a given string fits a certain regex, but checking if any regex you can possibly define can match at least one valid archetype id is at least hard, and maybe impossible to do efficiently, generally speaking. At least I do not have a good idea for that and think this leaves some very minimal validation based on looking at the regex and typical patterns used.

regex should be “contracts”, and they can be built in a very user-friendly way (the interface can be as easy as a series of checks).
I remember doing a while ago a review of all the archetypes in the CKM to clean it of wrongly defined regexes, but maybe more have popped up in the meantime.