Multi RM JSON schema validation and current schema issues

pablo · 17 September 2022 02:28

I’ve been playing around with the JSON Schemas and found some issues in the published ones.

First, the entry points to validate the root nodes are missing some types (ITEM_LIST, ITEM_SINGLE, ITEM_TABLE, ELEMENT, HISTORY, etc.)

Then some types for OBJECT_REF could be constrained. For instance EHR.ehr_access and EHR.ehr_status have PARTY_REF available as a possible type and this won’t happen, and I believe for EHR.ehr_status the ACCESS_GROUP_REF won’t happen either. There are some other examples of this in the schemas for OBJECT_REF, but the schema allows all types on each OBJECT_REF item. Same for EHR.directorya, EHR.compositions and EHR.contributions.

In some schemas EHR.contributions is required while in the RM it’s not.

DV_QUANTITY had a property attribute that is not currently in the specs. I checked the DV spec history and it seems that attribute was there but removed for v0.9 (like 15 years ago!).

In LOCATABLE there is an assertion Archetyped_valid: is_archetype_root xor archetype_details = Void, and I think types COMPOSITION, EHR_STATUS, FOLDER, and subtypes of PARTY (ROLE, PERSON, etc.) will be archetype roots, so why not add archetype_details as required for those types?

The question above has a second consideration: the rm_version field is in archetype_details and without that it’s not possible to know which schema version to use, if a system supports more than one RM version. So IMO archetype_details should be mandatory so any system can lookup for it before validating, then choose the right JSON schema version to validate such instance.

Though making archetype_details mandatory in the schema isn’t really solving the problem because a system should access the archetype_details BEFORE the validation is executed, but it helps to clarify that point to implementers. So the existence of archetype_details should be done and validated by code before the JSON Schema validation is executed. This of course considering a system can handle multiple RM version. But I guess most systems just assume JSON objects will comply with one specific version supported by the system and they will have rules to check for that. Though I’m more concerned about Conformance Verification and for that I need to use different RM versions.

DV_EHR_URI and DV_URI are missing the value attribute as required.

Then I have modified the 1.0.3 schema to be compliant with 1.0.2, and used that 1.0.2 schema to create a schema that allows to validated the openEHR REST API JSON payloads, which is different from the schema to validate RM JSON instances (e.g. EHR.ehr_status is EHR_STATUS instead of OBJECT_REF).

I will check 1.0.4 and 1.1.0 schemas on the next days.

UPDATE:

In JSON Schema 1.0.4 there is a ARCHETYPE_HRID class, it seems that is a class from AOM2 not from the RM, should that be in the RM JSON Schema?

Found ISO8601_TYPE on the schemas, that class is abstract. Other abstract RM classes are not in the schemas, should that one be there?

Schemas 1.0.4 have also URI which is abstract, same question as above.

EXTRACT_ENTITY_MANIFEST.other_ids in schema 1.0.3 is “object” while in 1.0.4 it’s array of string, which is the most accurate for the RM type other_ids: List.

RESOURCE_DESCRIPTION_ITEM.other_details is optional in the RM but required in the schemas.

RESOURCE_DESCRIPTION_ITEM.language is CODE_PHRASE in the RM (1.0.3, 1.0.4) but in the 1.0.4 schema has type TERMINOLOGY_CODE

FOLDER.details is in the 1.0.4 Schema but was added in the RM 1.1.0, so should be removed from the 1.0.4 Schema.

GENERIC_CONTENT_ITEM.other_details is a Hash<String, String> in the RM, in the 1.0.3 schema it is “object” but in the schema 1.0.4 it’s “array of string”, which I think it doesn’t represent a map.

TRANSLATION_DETAILS.language is CODE_PHRASE in RM 1.0.3 and 1.0.4 but in schema 1.0.4 it’s type is TERMINOLOGY_CODE.

ACTIVITY.action_archetype_id is mandatory in the RM but in schemas 1.0.3 and 1.0.4 is not required.

pablo · 19 September 2022 12:14

Current schemas are defined here openEHR - JSON Schemas (ITS-JSON) Component - latest

Fixed schemas can be found here openEHR-OPT/src/main/resources/json_schema at master · ppazos/openEHR-OPT · GitHub

pablo · 20 September 2022 21:34

Open questions from my review:

VERSIONED_OBJECT.owner_id is “Reference to object to which this version container belongs, e.g. the id of the containing EHR or other relevant owning entity.” (a little vague definition, in EHRServer it is actually the EHR id).

In the schemas that can be OBJECT_REF or any of it’s subclasses: LOCATABLE_REF, PARTY_REF or ACCESS_GROUP_REF. If owner_id IS actually the EHR, then just the OBJECT_REF type should be required in the schema because the EHR is not LOCATABLE, so the reference is not LOCATABLE_REF, it is certainly not a PARTY so it wouldn’t be PARTY_REF and it’s not an ACCESS GROUP, so no ACCESS_GROUP_REF is needed there. But the definition leaves the door open to other possible owners, so I don’t know if the other possible types can be safely removed from the schemas.

What would be the role of the EXTRACT types in the API JSON schemas?

That is: X_VERSIONED_OBJECT, X_VERSIONED_EHR_STATUS, X_VERSIONED_PARTY, X_VERSIONED_COMPOSITION, X_VERSIONED_EHR_ACCESS.

Do we have an API for that? If not, I would remove those types from the API schemas.

For the ENTRIES the workflow_id and protocol_id are OBJECT_REF in the RM. I guess no protocol or workflow will be represented as a PARTY class or as an ACCESS_GROUP class, so I would like to remove the possible types in the schemas for all the entries for those attributes, that is: PARTY_REF and ACCESS_GROUP_REF.
If ORIGINAL_VERSION.data and IMPORTED_VERSION.data are just object in the schema, when we have a VERSION of something, the schema won’t validate the data. IMO it should match one of the versionable types, even though the RM leaves that open to any type T, we know it will be COMPOSITION, FOLDER, EHR_STATUS or the concrete subclasses of PARTY (maybe EHR_ACCESS too).

pablo · 21 September 2022 04:57

Here you can find all the fixed JSON schemas, for all the RM versions, and also the API “flavor” of the JSON schemas. openEHR-OPT/src/main/resources/json_schema at master · ppazos/openEHR-OPT · GitHub

pieterbos · 31 October 2022 13:59

Most of these can be easily fixed in the BMM or the generator. So I think we should.

I will add the entry points to the generator so they get generated.

Object_ref/party_ref: i will check later, need a bit more time

DV_QUANTITY; these BMMs are used to validate archetypes, where a property is often still present… So removing it will invalidate a lot of archetypes. Maybe we should remove it in the generator?

Locatable: archetype_details: sounds like a good idea. Maybe we should also add that to the BMM?

DV_EHR_URI: that’s odd, BMM is correct. I will check the generator

ARCHETYPE_HRID: that is from base, and in BMM in base 1.0.0 and that is used in 1.0.3 and 1.0.4. That is why it is present.

ISO8601_type is not marked abstract in the base bmm (1.0 and 1.1). Should that be changed?
URI is also not marked abstract in the base BMM

EXTRACT_ENTITY_MANIFEST.other_ids is a hash<String, String> in both BMM versions (well, sort of, missing the key type in one of the BMMs, hence the difference)

RESOURCE_DESCRIPTION_ITEM.other_details should be optional indeed, is a mistake in the BMM

RESOURCE_DESCRIPTION_ITEM is TERMINOLOGY_CODE in base (Resource Model ) but code phrase in the RM (Common Information Model). Since AFAIK this is only used in the AOM and that uses TerminologyCode, should we leave it TerminologyCode or change it?

folder details should be fixed in the BMM

GENERIC_CONTENT_ITEM.other_details is missing the key type in 1.1.0 and 1.0.4 BMM, has it correctly in 1.0.3, needs a fix in BMM or perhaps an assumption on key type in the generator

TRANSLATION_DETAILS.language: same problem as RESOURCE_DESCRIPTION_ITEM: difference between base and RM, and afaik never used directly from RM, always from base, so we probably should fix Base?

ACTIVITY.action_archetype_id: good catch. I remember this was a discussion before. Does anyone remember this?

pablo · 31 October 2022 16:59

@pieterbos

In AOM property is an attribute of C_DV_QUANTITY, not of DV_QUANTITY see page 17 https://specifications.openehr.org/releases/1.0.2/architecture/am/openehr_archetype_profile.pdf this is not an attribute of DV_QUANTITY.

I don’t think so, look at base/foundation and base/base and there is no such type (not in 1.1.0 or 1.2.0):

It’s an AOM2 type: Archetype Object Model 2 (AOM2)

pieterbos · 31 October 2022 17:24

In AOM 2, there is no C_DV_QUANTITY, so not possible to use that. I can just mark the property as to be ignored in the configuration of the generator and it will be removed, that should fix it.

Archetype hrid is in the base BMM. That should be moved to AOM 2 BMM then.

thomas.beale · 31 October 2022 21:22

Given that the BMMs could be used for many purposes (archetype checking is just one), we should really remove property from DV_QUANTITY. But I won’t just yet, since things will break. It might actually be that we have to give in and create an ADL2 special type for Quantity (a bit like C_DV_QUANTITY in ADL1.4)

Don’t understand this one - archetype_details is already in LOCATABLE in the BMMs.

Yes… fixed.

It shouldn’t be, if you mean the ‘Uri’ type.

1.0.4 and 1.1.0 were wrong; 1.0.3 was ok. Odd! Fixed now…

The 1.0.2 one was wrong; fixed now.

(You meant the language attribute I think)
I’d leave the two versions different, since that’s what the two versions in the source specs say, i.e. they do use those different types. Annoying I know - but does it pose a concrete problem?

Not sure what the problem is here - it is a new attribute in RM 1.1.0 only.

I fixed the BMM. BTW, these Hash<> properties really should be of meta-type P_BMM_INDEXED_CONTAINER_PROPERTY, as per the spec, but I don’t think Archie supports this yet, so we didn’t change the BMMs - these fields are all faked with P_BMM_GENERIC_TYPE. That’s ok for now.

These differences correctly reflect what is in the specifications. It’s annoying that they are different, but are they causing a problem in the schema generation? Remembering that the schemas generated for specific versions of RM, BASE etc should be different if the underlying models are. Or maybe we are trying to erase some differences - happy to discuss.

I probably didn’t get all the errors, but see this commit - should fix a fair few.

thomas.beale · 31 October 2022 22:35

Yep. It should actually be moved to BASE, but we have not yet done that, so I’ve corrected the BMMs on this. See this commit.

BTW some of these errors may have caused @borut.jures problems as well since he is relying on generation from BMM as well. Might be worth checking…

pablo · 1 November 2022 00:33

C_DV_QUANTITY is part of the Archetype Profile spec that extends AOM 1.4 C_DOMAIN_TYPE.

I don’t know how that works in AOM 2.

In any case propertyfor DV_QUANITTY has nothing to do with the RM, so IMO it shouldn’t be in the JSON Schemas.

Maybe there is a mismatch between the spec and the BMMs there, I thought the specs were generated from the BMMs (table definitions and UMLs). If moving the definition in the BMMs solve the issue, that’s great.

pablo · 1 November 2022 00:38

I guess he is pointing to:

@thomas.beale see my first message about this item.

pieterbos · 1 November 2022 13:29

Yes, I am pointing to Pablos first message about locatable and archetype details being mandatory in practice for several classes, due to the combination of invariants on those classes. The question is, should we enforce it in json schema.

The problem right now is that in both cases, the base BMM is used. This means the RM has the same thing in json schema as the AOM. While the specification has two different versions. So, the question is, which should we use? My question was, are these used outside of the AOM? If not, we can safely stick to the version from Base/AOM. If the other version is actually in use, we may not be able to do that, and may have to change the BMM. Since that will involve steps such as moving this class out of base, or creating a specific base BMM file to solve this, I would prefer not to do that.

thomas.beale · 1 November 2022 14:32

Well the current version of BMM doesn’t have the invariants in it, so you are looking at either including it manually somehow in JSON schema, or not enforcing it. I’d be inclined to leave it out. At some point in the future a new version of BMM will have the invariants, and a revised schema generator will use them. Anything we leave out of the schemas today of course can’t be used to validate data of course, so full validation will always rely on some deeper layer e.g. in Java or whatever.

So the question is really if we want to validate data 100% in JSON schema and nothing else. That’s got its attractions, since it would be nice to know that everything can be checked in one hit.

If we want those invariants in BMM today to enable that, I can find a way to do that, it’s not too hard, so we can explore it.

Just to be clear, we are talking about AUTHORED_RESOURCE and its subordinate classes. The version in BASE is used by AOM2, and the older version in RM (Common IM) is used in ADL/AOM1.4. AUTHORED_RESOURCE (the one in BASE) is also used by Task Planning, and could be used in other places - it doesn’t contain any archetype-specific semantics.

I would have done this differently if I had thought ADL1.4 was never going to die. Ideally, everyone would upgrade their AOM1.4 software to use the BASE version, but I don’t know if that will actually happen.
I’m not even sure if it would cause a problem to just use the one from BASE and ignore the other one. We are talking about meta-data fields that would only appear in OPTs, in the context of the REST API. Anyway, the differences are:

in the BASE version, the classes contain more fields, but they’re all optional
the language attribute in 3 classes is CODE_PHRASE in the RM (old) version and Terminology_code in the BASE version
the RESOURCE_DESCRIPTION.lifecycle_state attribute is of type String in the RM version and Terminology_code in the BASE version.

See above. If the aim is just to support ADL1.4 OPTs in the REST API, then the RM version is the one to use.

pieterbos · 1 November 2022 15:59

thomas.beale:

Just to be clear, we are talking about AUTHORED_RESOURCE and its subordinate classes. The version in BASE is used by AOM2, and the older version in RM (Common IM) is used in ADL/AOM1.4. AUTHORED_RESOURCE (the one in BASE) is also used by Task Planning, and could be used in other places - it doesn’t contain any archetype-specific semantics.

I would have done this differently if I had thought ADL1.4 was never going to die. Ideally, everyone would upgrade their AOM1.4 software to use the BASE version, but I don’t know if that will actually happen.
I’m not even sure if it would cause a problem to just use the one from BASE and ignore the other one. We are talking about meta-data fields that would only appear in OPTs, in the context of the REST API. Anyway, the differences are:

in the BASE version, the classes contain more fields, but they’re all optional

the language attribute in 3 classes is CODE_PHRASE in the RM (old) version and Terminology_code in the BASE version

the RESOURCE_DESCRIPTION.lifecycle_state attribute is of type String in the RM version and Terminology_code in the BASE version.

Do I understand correctly that task planning instances can be used together with RM content in a single json file? If so, I think the RM should use the same version as task planning uses, which is the same version as the AOM 2.

AOM 1.4 is a separate json schema, that we do not yet have. That will not be mixed with RM content, and the RM does not reference these classes anyway, so that should be no problem? OPT 1.4: not sure.

pablo · 1 November 2022 16:01

If the BMMs should be the source of truth in terms of the RM, the invariants should be there too. If not, we are missing part of the metadata of the model.

For me validation is 100% or nothing, RM IMHO is: classes, attributes, relationships and invariants/rules.

Validation in practice:

When exchanging openEHR data, the JSON/XML schema-based validation should catch everything that is violating the RM, so the JSON/XML can be parsed securely into an RM object instance. Then the second validation level is validating the object against the OPT constraints. So all the errors should be caught between those two levels. But I don’t think RM invariants should be validated at the second level, since when you have an RM object instance, I would expect that to comply with the RM (including invariants).

thomas.beale · 1 November 2022 16:25

TP instances can have RM instances, e.g. various DV types, PARTY_PROXY, so yes, can be mixed.

In fact, it is only TP and AOM that use AUTHORED_RESOURCE - the RM defines in it the Common IM (the old version), but doesn’t use it internally.

Ideally we would move that old version out of RM/Common IM into the AOM1.4 model, which is the only place that old version is needed.

But even if we did that, it would only appear in the latest versions of the relevant BMMs - the older ones will still have this problem. It’s obviously easy to hack the older BMMs to look different to the models they are based on, and that could be done to make JSON schema generation happier, but then someone is maintaining a fork of BMMs somewhere…

Indeed - hence the reason for me developing the new version of BMM that includes a full Expression language that supports those invariants. But the current version of BMM in use doesn’t include those upgrades.

sebastian.iancu · 10 November 2022 10:39

Hi,
I’m behind catching up with a lot of things on discourse, but this long thread does not help… Although intentions are all really good (thanks @pablo and @pieterbos), I would like to mention that ITS-BMM is an official component, and even in DEV state, as it is already in-use (by Archie?), editing it should still follow the maintenance lifecycle with JIRA issues, SEC approval, etc.

pieterbos · 10 November 2022 11:33

Of course Sebastian - I said this merely to indicate that ITS-BMM contains issues, and that I would prefer to fix them at the source rather than fixing the json schema generated from it manually. I assume we can make patch-releases to the BMM if it deviaties from the specification, which it actually does at the moment.

pieterbos · 10 November 2022 11:39

Also for validating invariants, apart from terminology code invariants, I do not think an expression language will help in generating json schema from the BMM, and I think another approach will be necessary if we want to include these in a next version of the json schema.

thomas.beale · 10 November 2022 16:51

In the ‘dev’ state, we don’t have to follow the PR / CR workflow… however, it’s a question as to what state the BMM ITS should be in - I would think ‘trial’ would be more appropriate. And then your comments apply

But I do think we should be able to shortcut a bit when we agree that bugs are being fixed, and just patch directly. I take the blame for not having been super-disciplined on this - I’ve preferred just to fix any errors ASAP, but I agree we need to be more careful on documenting what we are doing.