JSON Schema and OpenAPI: current state, and how to progress

pieterbos · 14 March 2021 14:56

We recently had a discussion on the SEC call about JSON Schema. I was asked to write down the current state and to get the discussion going on how to progress. So, here it is:

Several options exist to define a JSON format. Two of the most often used are:

JSON Schema
OpenAPI

Tools for working with Json Schema are widespread. It is very suitable for validation purposes, including built-in support in text-editors. However, with OpenEHRs extensive use of polymorphism, it is not suited for code generation.

OpenAPI is an extension of JSON Schema. The latest version is fully compatible with JSON Schema, but it still contains extensions. To make it compatible, they defined the extensions as a Json Schema dialect, with added vocabularies, which is possible in json schema. It requires OpenAPI tooling to process, json schema based tooling works, but it will not be complete.
One of the extensions defined in OpenAPI is a feature to specify a discriminator column, which means it is possible to use it to generate code for models using polymorphism. Its use cases include API specification, validation, code generation and API specification. If we include it in the REST Api specification, it is possible to generate code for an OpenEHR REST API client in many languages, including all archetype and RM models. Also it is possible to generate documentation for these APIs in many formats, and to plug this definition into tooling to manually try an API.

Current state of JSON Schema

We currently have two JSON Schema files:

The official one in GitHub - openEHR/specifications-ITS-JSON: JSON-Schema Specifications
The one generated from BMM in Archie, at RM 1.1.0 json schema.json · GitHub

specifications-ITS-JSON

The schema in specifications-ITS-JSON is generated by Sebastian Iancu from the UML models. The code to generate this is closed source, and not available to others. It has the following

benefits:

very extensive, including all documentation about models, including which functions are defined
it has a well organised structure, with separate files in packages

And the following drawbacks:

it cannot be used for validation, because it does not implement the OpenEHR polymorphism including the _type column to discriminate types. If it encounters a CLUSTER in a place where the model says ITEM, the items attribute of the cluster plus any of its content will not be checked by the validator at all.
no root nodes are specified
it cannot be hand edited because it is so big, and all information is duplicated in several files, and the code to generate it is not released
slower to parse because it has many extra fields.

The Archie generated JSON Schema

The Archie version is autogenerated from BMM. The code to do so is available as part of Archie, at archie/JSONSchemaCreator.java at master · openEHR/archie · GitHub. It has the following

benefits:

can be used to validate, including all polymorphism except some generics
tuned for speed of validation and quality of validation messages
extensively tested against the Archie json mapping for the RM, even with additionalProperties: false set everywhere to test for completeness.
the code to generate it is open source
contains no extra information, so fast to parse.

drawbacks:

not currently separated in packages or files, just one file, but the information to do so is present in the BMM models.
nearly no documentation included, because the BMM files contain very little documentation.

Proposal for JSON Schema

We probably need a standard JSON Schema. It can be (a next iteration of) the Archie JSON Schema, or a next iteration of the current specifications-ITS-JSON. To switch to the Archie JSON schema, we have to decide if the current form is good enough, or if it needs a couple more improvements, for exampled in the form of documentation or splitting in a different package structure. It will also need to be tested against other implementations - the current one works with Archie and EHRBase, but it has not been tested against other implementations yet.
To keep the current schema, we will need to adjust it so it contains the constraints for polymorphism, and it will need to be extensively tested. We also need to decide on whether it is acceptable if this json schema can only be generated with unreleased tooling, or if we want an open variant to generate this.

Opinions?

OpenAPI

There is only one current OpenEHR OpenAPI model that I know of, both for archetypes and the RM model. It is generated from BMM. For the AOM, a BMM is first generated from the Archie implementation, because no model is available.The code is open source at First version of a working open API model generator by pieterbos · Pull Request #180 · openEHR/archie · GitHub and the output including a demo of how to use it for code generation is available at GitHub - nedap/openehr-openapi: An example project to show how OpenAPI can work with OpenEHR
The current model works to validate and to generate code and human readable documentation. However, it has the following problems:

code to generate the files is still in a branch in Archie, needs to be updated, reviewed and merged
one file, not structured in packages
nearly no documentation from the models, will have to be added to the BMM files
it contains only the models, no REST API definition yet.

I think it would be good to do further work on it, by creating an OpenAPI definition of the OpenEHR Rest API, referencing the autogenerated models. This would mean clients could be autogenerated, rather than having to rely on someone hand-coding a library such as Archie. The generated models can be referenced in such an API easily, and it is possible to mandate fields as mandatory for specific APIs from within the API definition, not changing the schema at all. It would also be easier to test whether a given implementation conforms to the OpenEHR models.

Again, opinions?

thomas.beale · 15 March 2021 10:36

Excellent summary. I’ll just make the initial observation that I think we could benefit if we (most likely me) implement the UML → BMM automatic extractor (i.e. the same thing as already generates the specifications class tables, but targetted to BMM files instead). This would have the effect of:

extracting all the documentation elements as well as structure
performing the corrections to the UML to generate correct BMM (the UML doesn’t represent container types properly, among other things; I have special code to fix this).
always up to date with the original models (as long as we have them in UML at any rate
open source.

I may try an initial version of this quite soon, and I think it would be a couple of weeks’ work., so it wouldn’t be instant, but not too far off either.

pieterbos · 15 March 2021 21:14

Adding the documentation to the BMM would solve the documentation right away, or with about 10 minutes of work on the generator, both for the JSON Schema and the OpenAPI models for the RM - sounds good!

Right now for the AOM we use a BMM generated from the Archie AOM implementation with reflection. Does the UML → BMM automatic extractor work on the AOM by any chance?

thomas.beale · 15 March 2021 21:37

It will work on everything in UML, which includes AOM2 and AOM1.4, even BMM itself. That’s how the AOM and BMM specifications are generated (the class tables). So it will literally suck out everything. Of course, I have to write that outputter But the extractor side code (UML 2.x openAPI calls - ugly) won’t change, or not much. So it’s mainly a question of instantiating BMM objects and writing them out to P_BMM2 format (the current format); later I will write a BMM3 outputter… which will look like Xcore format, and includes the new EL.

Seref · 16 March 2021 08:54

Many thanks for writing this @pieterbos . The mail notification somehow escaped me. I’ll keep this on the screen and make time to read it.

Seref · 12 April 2021 11:42

I just recently concluded the project I was working on, so I’m only commenting on this now. Once again thanks @pieterbos , much appreciated.

The way I see it, OpenAPI is significantly better than JsonSchema for our technical and political goals. Where JSON matters most from a standard perspective is the system periphery, where it is the serialisation format.

However, JSON in the context of OpenAPI has a much larger ecosystem around it compared to JSON Schema. Code generation, API specification etc helps us a lot more than just being able to validate payload content is valid. I think the nice thing about OpenAPI approach is that we can do it incrementally. Even if we only have an OpenAPI model (which you already support as a downstream artefact from BMM), that gives us code generation and validation for data, which we can extend to service definitions later (which’ll be based on data definitions anyway).

This actually takes us beyond FHIR, despite FHIR always being very bullish on the system periphery by design, because last time I checked, FHIR did not have an OpenAPI spec (not sure where it is now).

With JSON Schema, whatever we produce will leave developers in the cold in terms of figuring out how to use it from their applications, whereas with OpenAPI, we have a well supported stack/eco-system to point at, even if we don’t have it initially (i.e. working on service definitions later).

From a specification p.o.v. even if we had the current OpenAPI output of the code in the Archie branch, that’s a useable artefact, giving us what we’d get from Json Schema pretty much right now, with more to add if we want to.

Re adding documentation capabilities to BMM: I’d see that as a layer above BMM, no different than using templates for UI generation. A meta mechanism in BMM similar to annotations, which’d let the user link to some documentation may be a better separation of concerns, but I’m not really taking part on BMM development so I’ll stop at this point.

I was not aware of your work on OpenAPI, great job as usual, I’ll go and take a look now.

thomas.beale · 12 April 2021 11:54

Just an aside - the ‘documentation’ @pieterbos is talking about here is just the primary model documentation i.e. class descriptions, feature descriptions etc - all that stuff you see in the Class Definitions tables in the openEHR spec. Because we don’t yet extract BMM from the UML, which is where those documenation fragments currently sit, the BMM files are missing them.

My goal is to get a UML->BMM extractor working, so that those documentation fragments will be exported automatically, along with all changes made to the UML. It’s not 100% ideal having the UML as the primary expression in the toolchain, but I guess its utility is still sufficient to justify it (i.e. we stop using UML, we have no diagrams ;), and we can compensate for its annoyances with extractor hacks, which I already do.

Being able to add annotations to BMM in another layer could indeed be useful…

Back to the main conversation on OpenAPI, also very educational for me.

Seref · 12 April 2021 12:12

Ah, sorry, I misunderstood the documentation bit then, though my suggestion still stands , but back to the main convo as you say.

The way I see it, openAPI will slowly, but almost surely will kill JSON Schema. As its adoption at the REST endpoints increase, it’ll be the primary source of formalism (it hurts me to use this word for json, but anyway…) for the json content pushed to backends, so I cannot see how any greenfield work can kick start with JSON Schema. Anyway, the point is for once we may have a late comer advantage here by not having invested too much into JSON schema and jumping to OpenAPI

pieterbos · 12 April 2021 12:12

The nice thing is - we get to have both, without much work!

Note that the current apib-files for the REST API also can represent models, even if it is not currently used in those files. For those a generator that does openAPI → .apib is available. Or openAPI could replace it, eventually.

Now we just have to update the Archie OpenAPI-generator and make sure it works in the latest version again, I created that pull request ages ago as more of an experiment…

Seref · 12 April 2021 12:15

Thanks, Pieter, I missed the point re apib files. Good news then!

Re the latest version: do you think it’d be better to target OpenAPI 3.0 now rather than 3.1? Last I checked almost the whole tool stack, including UI tools etc were still supporting 3.0 max, so the exact latest version may not be the most convenient for our purposes

sebastian.iancu · 12 April 2021 12:18

Well, to my knowledge JSON Schema is not replacing OpenAPI, neither the other way around.
OpenAPI is an alternative to API Blueprint. I guess both (OpenAPI as especially API Blueprint) can use JSON Schema for models.

The plan we had a few years ago in SEC was to document the API using API blueprint, with references to JSON Schema for the resources definition, and optionally export it as OpenAPI specs so that we can generate code, etc.

sebastian.iancu · 12 April 2021 12:22

There are also tools to convert .apib files to OpenAPI equivalent, although I never tried them.

Nowadays API Blueprint is not that popular anymore, OpenAPI probably won the battle on documenting APIs, but I think there are still benefits to our keep current REST specs as .apib files., while keeping an eye on the developments on these fields and adapt later if necessary.

sebastian.iancu · 12 April 2021 12:30

I see @pieterbos responded also, supporting my thoughts about conversion from .apib files.

pieterbos · 12 April 2021 12:37

I’m not sure, but I think the differences should be small between those two versions.

Yes, except that the OpenAPI dialect of JSON Schema has some extensions that are absolutely necessary if you want to use any kind of automated tooling to map to classes with polymorphism. So, for the model part of OpenAPI, you would need a different file than plain JSON Schema to express the OpenEHR model. Which is why I generated both a json schema file for validation, and an OpenAPI file with just the models for validation, api specification and code generation. The differences are that they both have a very different way to express polymorphism - one only describes validation rules for json, the other a way to map to the OO-concepts.

thomas.beale · 12 April 2021 12:40

I know we had this discussion before, but can’t remember the answer - does the orthodox variety of JSON schema not handle polymorphic typing?

pieterbos · 12 April 2021 13:00

Yes and no. You can do oneOf or anyOf, then reference several types. However, we need a discriminator column to determine which type is used, so we need to do a bit more.
For that, you can do, pseudocode:

if ( _type == "DV_TEXT") {
 apply this part of the schema
}

Or you can do (slightly less pseudocode):

"oneOf": [
  "allOf": [
     { "ref", "reference to the type DV_TEXT"},
    {
      "type": "object",
      "properties":
        "_type": {
          "const": "DV_TEXT"
     }
}, ... add more subtypes here
]]

The first works well with validators in the sense that the output is understandable and it is fast. The second one with most validators produces tons of possible output on a validation error, basically a message per possible variant, and the validators I have tried were rather slow with it. It should be possible to write tools that recognise these patterns and validate well and generate code, but I have not found any tools available that can generate any type of code from these constructions.

OpenApi defines a discriminator column, as in Inheritance and Polymorphism . That solves the problem entirely. Note that the paragraphs ‘model composition’ and ‘polymorphism’ are still standard JSON Schema. Tools to generate code are widely available, but I am not sure if they all support the discriminator column mechanism.

joostholslag · 30 October 2021 19:06

Just learning about OpenAPI ft openEHR. And I had a question that might be very stupid. But if the reference and archetype models are in OpenAPI, could we also model archetypes in OpenAPI (instead of ADL)? And templates? This would make it a lot easier for people new to openEHR (or the occasional strong-headed CIO not willing to learn adl), to get into openEHR. Since with OpenAPI code generation they could use the models in there native programming language directly.
Would be even better if e.g. CKM could export archetypes/templates that would self validate, so the relevant RM+AOM would be part of the OpenAPI schema file.
And it would mean we no longer would have to maintain ODIN and ADL.

thomas.beale · 31 October 2021 14:16

There is a categorical difference between what generic formalisms like OpenAPI (or even JSON) can represent and native syntaxes. When a generic syntax is used, it is just representing instances from some meta-model. AOM is a meta-model for archetypes; there are meta-models for every programming language, and so on.

Now, one might ask, why write Java code, or Python, C#, TypeScript etc - why have all these syntaxes when we could just have everything written in JSON, or maybe (somehow) in openAPI (which is really a cut-down version of OMG IDL). The reason is that the native syntaxes allow us to represent directly the semantics of the language, using keywords, specific symbols etc - so we can think in the concepts that the languages support.

ADL is just another programming language - its main features not found in most other languages are:

constraints, including nested (e.g. x ∈ {|0.0 .. 250.0|})
semantic overloading of model instances with domain markers (achieved by the id-codes, e.g. ELEMENT[id41|systolic|].
terminology & terminology binding

To make these things happen (just like all specific features in all those other languages) you need a meta-model, whose classes express exactly those features. For example, the ability to write ELEMENT[id41] in ADL is supported by the C_OBJECT class in the AOM meta-model. Specifically:

c_object

The 2 fields rm_type_name and node_id are what allows ELEMENT[id41] to be written.

So when we write a fragment of ADL (or Java, or Python, or anything) we are using a native syntax that makes sense to humans to write out (i.e. serialise) an instance of some class in the meta-model of that language.

But there is always a generic way to serialise instances of a meta-model, which is in ‘object dump’ syntaxes like XML, JSON, YAML etc (that’s what openEHR ODIN is).

Here’s an example:

native ADL:

EVALUATION[id1] matches {	-- Adverse reaction risk
	data matches {
		ITEM_TREE[id2] matches {
			items cardinality matches {1..*; unordered} matches {
			    ELEMENT[id3] matches {	-- Substance
					value matches {
						DV_TEXT[id130] 
					}
				}
				ELEMENT[id64] occurrences matches {0..1} matches {	-- Status
					value matches {
						DV_CODED_TEXT[id131] matches {
							defining_code matches {[ac1]}		-- Status (synthesised)
						}
						DV_TEXT[id132] 
					}
				}

And the equivalent in JSON dump format:

"definition": {
	"rm_type_name": "EVALUATION",
	"node_id": "id1",
	"attributes": [
		{
			"rm_attribute_name": "data",
			"children": [
				{
					"rm_type_name": "ITEM_TREE",
					"node_id": "id2",
					"attributes": [
						{
							"rm_attribute_name": "items",
							"children": [
								{
									"rm_type_name": "ELEMENT",
									"node_id": "id3",
									"attributes": [
										{
											"rm_attribute_name": "value",
											"children": [
												{
													"rm_type_name": "DV_TEXT",
													"node_id": "id130"
												}
											]
										}
									]
								},
								{
									"rm_type_name": "ELEMENT",
									"node_id": "id64",
									"occurrences": "0..1",
									"attributes": [
										{
											"rm_attribute_name": "value",
											"children": [
												{
													"rm_type_name": "DV_CODED_TEXT",
													"node_id": "id131",
													"attributes": [
														{
															"rm_attribute_name": "defining_code",
															"children": [
																{
																	"rm_type_name": "CODE_PHRASE",
																	"node_id": "id9999",
																	"constraint": "ac1"
																}
															]
														}
													]
												},
												{
													"rm_type_name": "DV_TEXT",
													"node_id": "id132"
												}
											]
										}
									]
								},

That’s 18 lines of native syntax compared to 60 in JSON. The ADL is also directly comprehensible (assuming one has read the ADL manual ;), whereas the JSON serialisation just looks like… a pile of objects.

The same argument holds for any programming language - consider an ‘if / then’ statement in Java, PHP, TS etc - generally easy to read and write, and makes sense at a high mathematical / logic level. However, written out as generic instances of the language meta-model, it would be impossible to read.

So the purpose of any native syntax is for humans to write and read, but also for computers to be able to understand. To do that, a native language parser is required, which consumes texts in native ADL, Java etc, and pumps out (usually) an ‘augmented abstract syntax tree’ (augmented AST), which is an in-memory representation much more like the JSON.

That in-memory structure could be serialised out using e.g. Java’s native JSON serialiser, or any other similar tool, to save it in JSON. It’s now no longer useful to humans, but it can be read back in in an instant by a standard JSON reader, to re-instantiate those in-memory objects. Whereas parsing the native form is quite a lot more complex and resource intensive. For files that are known to be valid, writing and reading to JSON or some other object dump format is thus quite useful.

Native languages have another under-rated purpose: to teach the language, i.e. teach its concepts. This fragment of Java using a ‘lambda’ - myIntegerList.forEach( (n) -> { System.out.println(n); } - makes sense to Java programmers and is very concise. Without that special syntax, it’s very hard to teach and learn.

Can native languages be avoided? Sure, with visual programming Apps that allow you to program purely in the UI. Archetypes are in fact a candidate - most clinical modellers just use the Archetype Designer or similar tool. But it’s surprising how many people write or modify native ADL. Visual programming of Java or Kotlin or Dart isn’t going to happen any time soon however, because the concepts are more sophisticated than any possible visualisation.

On to OpenAPI. It’s a generic language for expressing APIs, so directly comparable to OMG IDL - it’s like a programming language, except missing the ability to actually write the code inside routines. It doesn’t (as far as I can tell) contain any constraint semantics beyond the notion of cardinality, and it doesn’t know about terminology in any native way. So it’s probably not a good candidate for trying to express archetypes.

The use of new languages for specific purposes used to be thought of as bad, but today, everything has its own language, and instead of one language to rule them all, we have notionally one tool + meta-model approach to rule them all, since everything can be made to fit the schema:

native language → native lang Parser → in-memory AST (meta-model instance) → tools.

The important thing is to have meta-models strong enough to represent native language semantics.

Environments like JetBrains MPS are addressing this ecosystem.

With these kinds of tools, lots of languages is no longer a problem, since they allow us to represent concepts in numerous domains natively.

borut.jures · 31 October 2021 14:40

I was thinking about this yesterday. I “need” a computable version of the specifications for my approach but I’ll use JSON for the operational templates. That means I won’t use ADL.

I was wondering if there was a poll asking clinical modelers whether they prefer GUI tools or text (ADL)?

My thinking is that most would prefer a GUI tool making ADL optional. But you are saying there are people writing directly in ADL. It would be interesting to know how many % are using GUI / ADL.

thomas.beale · 31 October 2021 15:26

It’s undoubtedly more GUI, less ADL.

There is another reason to maintain ADL and a native parser however. In the past, it was assumed that there would always be a fixed generic syntax to use. This used to be XSD 1.0. But that depends on the schema you design - the XML will be different for different variant schemas. If you move to XSD 1.1, then it all changes, multiplied by variant schemas. Then the world decides XSD is annoying (it is…) and wants everything in JSON. So we do that (quite easily). Then we have to sort out the '_type' thing in JSON. Then some of us move to JSON Schema based JSON, which (probably) changes things again in some annoying way. Then (imagine) some people want to move to JSON5 (I would ;), others want YAML (but which variant?). And so it goes.

Meanwhile, there’s only one ADL syntax at each release, and we always know what it means.

We could imagine serialising a repository of archetypes into today’s YAML vX.Y.Z, and forgetting about it for a few years. Then you come along later, and want to work with those archetypes, and have trouble locating a YAML vX.Y.Z reader, because the world’s moved on.

So things are not as black and white as one might first imagine.