Validation of REST API payloads against JSON/XML schemas

pablo · 13 December 2021 17:01

Coming back to an old topic. Our current JSON/XML schemas reflect the structures in the RM. Now the serialized representations we use at the REST API level are slightly different than the RM JSON/XML representations, so we can’t use the same schemas to validate those.

For instance, POST /ehr returns an EHR in JSON that contains the ehr_status, which is the actual EHR_STATUS serialized to JSON (the last version of the status). In the RM JSON schema, ehr.ehr_status is an OBJECT_REF. So the result of POST /ehr can’t be validated using the JSON schema because of the EHR_STATUS vs. OBJECT_REF mismatch.

@thomas.beale and I talked about this some time ago and he defined some extra types to be used at the Service Model level (openEHR Platform Service Model), which in theory should be the spec in which the REST API spec is based on (but we worked on the SM after the REST API was released).

Now I don’t think we have a formal way of validating API results against a schema, for those results that differ from the RM-based schemas, like when in the RM we have an OBJECT_REF and in REST we have the actual object.

This is an issue for conformance but also for implementation, since a JSON parser instead of parsing an openEHR EHR, should check if it’s a RM valid EHR or an API valid EHR. This is not a huge technical issue, but having a formal way to express API objects and a formal schema for those objects would be a better solution than type checking on the fly to know how to validate and parse a JSON document.

How are implementers dealing with these issues of validating and parsing RM vs. API JSONs/XMLs?

pablo · 2 January 2022 15:49

Another thing we are missing, and might be causing issues for implementers, is the documentation of which RM attributes in the REST API are different than the RM spec, just to mark there is a difference so implementers can take that into account when working with the REST API and the available schemas.

I’m not sure if this topic is in the SEC agenda, IMO is super important, specially for conformance testing.

borut.jures · 2 January 2022 18:20

I’m learning why the efforts around the conformance are so important. I expected that every system built based on such a detailed specifications will conform to them (or at least eventually conform when the specifications are stabilized).

I plan to work on a REST API implementation and I’m concerned that if I folow the RM it will not be what is expected by the REST API.

Since EHR.ehr_status is an OBJECT_REF, then “ehr_status” in the POST response should be JSON representation of OBJECT_REF. The example on ITS-REST page is almost correct (instead of "type": "EHR_STATUS" which should be OBJECT_REF as stated by Pablo):

  "ehr_status": {
    "id": {
      "_type": "OBJECT_VERSION_ID",
      "value": "8849182c-82ad-4088-a07f-48ead4180515::openEHRSys.example.com::1"
    },
    "namespace": "local",
    "type": "EHR_STATUS"     <= should be "_type": "OBJECT_REF" or omitted
  },

Since we know that “ehr_status” is OBJECT_REF, the "type": "EHR_STATUS" isn’t needed. An implementation of reading above as JSON would skip the “type” property since the type is known from the RM model (“_type” property is needed only when an abstract type is used in the RM).

Instead of writing more:

…the REST API should instead be aligned with the RM specifications. Is this possible?

I’ll search for past discussions on why the differences were needed in the current REST API (links are appreciated). Maybe I’m missing some information.

thomas.beale · 2 January 2022 20:25

It is very important indeed! My solution to this was to ensure that every structure in any API (or maybe specific APIs, if REST needs to be different to e.g. Kafka, Protobuf etc) has a formal type definition - currently provided by classes in the UML and the BMM equivalent classes.

Where there need to be deviations from the RM, which is a persistence model, new classes need to be defined, e.g. the UPDATE_VERSION classes defined here in the platform service specification.

I’m not sure if this approach is widely accepted, but if it is not, some equivalent needs to be adopted, otherwise finding out what the allowed structures are (e.g. what fields can be missing etc) in REST API calls is requires ad hoc documentation in the REST API docs.

BTW it is not unusual or wrong for data structures going through an API to differ from the persistence structures of the target back-end (for commit calls), or the data structures applications want (i.e. call results) - communicating through an API can easily use differential structures, which we can think of as ‘updates’, which will be partial representations designed to be applied to the current state of some already persisting data item. This is why we get structures like the UV types mentioned above.

If we supported differential versioning (in addition to the current approach of new version = new copy) at the specification level in openEHR, we would have a lot more of these differential structures, which would all require type definitions.

sebastian.iancu · 3 January 2022 11:46

I also think that for now this is the solution.

The OBJECT_REF.type is an attribute indicating the type referred by the id, in this case and EHR_STATUS. It is not related to the JSON serialization rule where _type is used to indicate the instance type (in this case an OBJECT_REF) - see Base Types. The _type is however optional, can be always present, but it is only needed by the deserialization - in this case it is known by the API specs how should the payload be deserialized i.e. as OBJECT_REF. Thus the example above is good the way it is, no issue or error there.

pablo · 3 January 2022 16:16

The REST API should be aligned with the models defined in the Service Model, not with the Reference Model. RM is to manage data internally, SM is to manage data exchanged.

The issues I’m signalling are

the partial definition of the SM models,
the need of harmonization between SM and REST,
the need to JSON/XML schemas compliant with the SM models,
a GAP/diff between SM and RM models so implementers understand the differences

All these elements will be applied directly in the conformance spec.

thomas.beale · 3 January 2022 16:26

Exactly right. We should note however, that data structures documented in the SM should generally be fairly simple / lightweight derivatives of the underlying RM - i.e. the SM doesn’t create its own ‘big’ data model.

pablo · 3 January 2022 16:30

That is key, we also need to be sure those models are updated in the SM spec.

IMHO this is the right approach and I can’t think of a better alternative. It is also consistent with what are are doing with the rest of the specs: having a model and serialization formats defined for that model. Then, if with the API we had first the serialization formats, we need to create the corresponding models and document the differences between that exchange model and the RM, because at some point the EHR needs to map the exchange model to the Reference Model. Architecturally this is something like:

REST API receives JSON/XM payload
Payload is validated against SM schemas (syntactic validation)
Payload is parsed to SM models, creating SM object instances
SM object instances are mapped to RM objects
RM objects are validated against OPT (semantic validation)
RM objects are mapped to persistence model objects
Persistence model objects are persisted
…

Of course any implementation could go directly from step 1 to step 7 but the process will be over complicated, entangling different processes that could be perfectly separated if the right models and schemas are defined.

That opens a new door: do we really need RM schemas? I think we do, since some SM objects will be a 1-to-1 mapping to the RM objects, but also for tasks like ETL or data migration in general, the RM schemas (serialization formats) might be used instead of the SM formats/schemas.

Of course other implementers might have other opinions. If this approach sounds under optimal, it would be great to have a better solution so the REST formats are specified correctly.

pablo · 3 January 2022 16:37

I’m in total agreement, the SM model should be just a thin layer with the diffs from what we are actually using in the REST API and the schemas based on the RM. Most differences are about inline objects instead of OBJECT_REFs, but I didn’t do any exhaustive analysis yet.

pablo · 7 October 2022 02:46

FYI, I have created JSON schemas for API payloads to be able to validate on the REST service without doing any hacks to the JSON schemas based on the reference model, that differ from some payloads used in the REST API spec.

Here you can find multiple versions of the schemas, based on the corresponding RM version openEHR-OPT/src/main/resources/json_schema at master · ppazos/openEHR-OPT · GitHub

Just today I found an error on the demographic classes that I’ve fixed on the v1.0.2 schemas and need to check and fix those errors in the other versions and flavors (a flavor is the “rm” vs. the “api” schema).

The schemas might not be 100% perfect as they are right now but have many fixes over the official openEHR schemas, and I think when I’m done testing and a PR is created, these will become official, so feel free to test them!

@thomas.beale some weeks ago @sebastian.iancu or @pieterbos suggested the errors in the schemas might come from errors in the BMM files. If you do a diff between one of my RM schemas and the corresponding official openEHR schema you can see the differences and that might help spot issues in BMM. Though I’m not sure if the current schemas were generated or manually created. My edits are 100% manual and checked against the corresponding spec version.