ADL formalisms

joostholslag · 7 June 2024 08:59

Following a discussion in the SEC yesterday. Where there was agreements on all the outstanding issues for the migration of the community to ADL2.4. (Huge success for the community! proving a willingness to work together on difficult topics and to make significant compromises/investments for a shared future). A discussion came up what the migration to ADL2.4 will mean for all the different ADL artefacts and its formats/serialisation. This will become important during the migrations of the different software components in the ecosystem (ADL editing, CKM, CDR). It’s quite a niche topic so mostly relevant to the SEC and implementors.
Currently there are many different data format and
serialisations: ADL, json, XML, ODIN
file extensions: .opt .opt2 .adl. adls. adlt .json .oet
artefacts 1.4: archetype (adl, json, xml), adl1.4 template (oet, Better’s native json), adl1.4 operational template (.opt) Better’s web template (json?) SDT, TDD, and probably others.
artefacts 2.4 (unchanged from [2.0-2.4]): adls (default differential archetype), adl (flattened archetype), adlt (differential template, technically identical to adls, Nedap invented file extension), opt2 (unspecified).

In my mind there will be two main artefacts formalism/serialisation in the future:

1: design artefacts: archetypes and templates will be in adl2.4 format and serialisation. File extensions will be .adls for differential archetype, .adlt for differential template (and let’s try to deprecate the flattened archetype .adl formalism) from adl3 on I’d like to change the serialisation of these artefacts to yaml, (file extensions to be decided; we can consider json, but I’m not in favour of including it in the specs).

2: operational artefacts: a flattened format as a use case specific dataset format, this will be mainly used at the the api level imho. So probably openapi format makes sense (with json and/or yaml serialisations, I don’t care). This should ‘replace’ the Better specific web template, and the openEHR TDD, SDT, etc. imho. And if needed (probably useful for validation and less implementation/language dependency) we can retain the .opt2 in OpenAPI ADL yaml, that still contains AT codes instead of (English) language key’s. So unrecommended for use by client app developers.

There’s probably a lot of details to fill in, and probably some major controversies, so let’s except a bit of chaos in this topic. Very curious for your thoughts @SEC @SEC-experts

Edit: the current formats are described in Simplified Data Template (SDT)

Please keep discussions on how to achieve/generate/validate the formalism on other topics, like this one JSON Schema and OpenAPI: current state, and how to progress.
I’d like to focus this to topic on discussing a desired scenario and how to simplify and standardise the currently available openEHR formalism. Since the amount and complexity is a barrier of entry to openEHR.

siljelb · 7 June 2024 13:56

3 posts were split to a new topic: What’s new in ADL2.4?

siljelb · 8 June 2024 14:03

2 posts were merged into an existing topic: What’s new in ADL2.4?

joostholslag · 17 June 2024 08:56

@pablo let’s do the discussion here:

Ideally I’d have ADL artefacts in YAML primarily, because of (arbitrary) legibility over json, do validation using json (or yaml) schema. And work together on an export/serializer to json (in this case of the archetype, not the aom/adl meta schema).
Off course I’m not against serialising archetypes to json also. But the risk I see if we don’t pick a single preferred format for hand editing, is we end up with a lowest common denominator.

These are some important advantages I think we wouldn’t want to loose.
Also tool support can become confusing. A ‘single’ conversion algorithm from yaml->json would be much more scalable than each tool having to support both yaml and json editing and conversion.

This is indeed a major disadvantage of yaml.

pablo · 17 June 2024 12:22

Even if practically it happens, archetypes are not meant to be edited manually, that’s why we have editors and why IMHO the format is not important, not even for legibility.

There is high risk in editing things manually that have an underlying schema, since any change can break the whole thing, then make it incompatible with editors, and why we should rely on editors.

Technically we need formats for storage and exchange, even for displaying. For storage and exchange you would use the smallest format, that’s YAML in most cases, while for displaying you would use the most native format. For instance, for web visualization, that’s JSON (since JSON is JavaScript and JS is supported by all browsers, so you don’t even need to parse it). In the other hand, if some tool uses an XML database and wants to store archetypes, maybe the XML representation is better. I prefer to discuss based on use cases than on personal preferences, because personal preference can’t be discussed.

If we don’t prefer one format over the others, but just have standard serializers and parsers to each format, we can convert from-to any format (that’s “model based transformation”, for instance YAML ==(parse)=> AOM instance ==(serialize)=> JSON or JSON ==(parse)=> AOM instance ==(serialize)=> YAML).

I would discourage direct format transformation (YAML ==> JSON) since it’s costly and more difficult to maintain (a single change on one format affects the whole thing).

borut.jures · 17 June 2024 12:46

If we only discuss YAML and JSON as possible formats, the conversion should be direct, using standard libraries for converting between them, and thus removing a dependency on “less” used and tested AOM serializers.

This doesn’t sound right. A well supported and maintained free & open source library is not more expensive than asking openEHR vendors to do the same conversion with custom serializers?

Such conversions are widely used by non-openEHR projects so we could expect that the libraries will be maintained and tested.

Both YAML and JSON are text formats and “small” compared to other data we store (e.g. images). I would expect that each system stores archetypes/OPTs only once. The difference in % might be large, but in KB it should be insignificant.

If only editors are used to edit archetypes, they could save them in YAML and JSON. Or even only JSON.

p.s.
As an engineer it seems logical to only use JSON for archetypes and OPT2, however Silje and Thomas are saying there are “many” editors who edit archetypes by hand. It would be great to hear what “many” means in this case and if they only make “small” changes by hand (which wouldn’t be that painful in JSON).

joostholslag · 17 June 2024 14:02

Very arbitrary, but:
My best guess would be it’s maybe a hundred to a thousand people worldwide. Most of those will be developers who (arguably) would be ok with editing json as well. The main audience should be clinical modellers, but most (>90-99% of those) will only edit using a GUI.

So yes, it will mostly be small changes. But already the jump from GUI to editing ADL by hand is huge. I think the jump to JSON will be too discouraging for people like me. FWIW I edited quite a few (>10) yaml files, but never json, it feels just too hostile/ machine like.

ian.mcnicoll · 17 June 2024 14:04

It is certainly not ‘common’ to have to edit ADL manually but it is definitely needed occasionally.

I’ve edited the YAML samples to use 2 spaces and I think the nesting is probably visually not an issue and I presume a decent editor will help validate incorrect spacing e.g if new content is added.

One disadvantage of JSON is that comments are not supported but TBH these are only ever used AFAIK to orientate node Identifiers in ADL e.g.

			ELEMENT[at0003] occurrences matches {0..1} matches {    -- Document type
				value matches {
					DV_TEXT matches {*}
				}
			}

Perhaps the use of SNOMED-like piped rubrics on the nodeId would be a better option. This was discussed in connection with ADL paths. Always optional and always ignored.

  children: 
                              - rm_type_name: "DV_TEXT"
                                node_id: "at0003 | Document type |"

I’m pretty agnostic re JSON and YAML. I think both would be editable to the level we require, as long as some of the ‘long-hand’ expressions from raw AOM are compressed e.g. multiplicity and slot constraints

borut.jures · 17 June 2024 14:23

Unsupported comments in JSON are a “feature”. I use all the “comments” and documentation in my generators to include as much helpful into in the generated artifacts.

The node identifiers are taken from the terminology section and added to the nodes. We will not loose these in YAML/JSON.

p.s.
I have edited around 100 archetypes myself (mostly fixing small inconsistencies that were caught by strict validation in my tools).

ian.mcnicoll · 17 June 2024 14:24

Can you give an example of your ‘commented’ json archetype

damoca · 17 June 2024 14:33

In these discussions about the ADL we tend to forget that ADL is just a format for serialization and exchange, but the real modeling formalism is AOM.

I mention this because probably we should not waste time discussing about choosing one format or the other, but assume that the systems should support several formats. For example, we all assume that a CDR will accept a JSON, an XML or an SDT as data instances. We should also assume that a system could accept an ADL, a YAML, or a JSON (or even XML), and the users/consumers of the models will choose the most appropriate format depending on their use case or technological stack. Just as FHIR does:

The work for the SEC should be to provide the correct schemas for the accepted formats, and not impose a single format for all systems.

borut.jures · 17 June 2024 14:39

Comments in JSON could use a similar approach as types:

"_comment": "some text"

I’m not suggesting to add _comment to the JSON – I prefer the current documentation properties/attributes.

I’m not using comments in JSON. I only wanted to point out that -- Document type in your example would not require a “comment” in JSON since the used text is found in terminology for the at0003.

thomas.beale · 17 June 2024 15:25

If we talk about the cADL part of ADL (the definition part), ADL is a human level format like a programming language. All the other serial formats are direct serialisations of in-memory AOM structures. So if we think about hand-editing JSON, YAML etc, you have to know the AOM. There is no ‘syntax’ to help you remember what to do, and the structures for things that are very simple in ADL like 'occurrences matches {0..1}' are quite voluminous.

For the rest of an archetype, other than expressions / rules, the AOM structures are mostly maps & lists, and both JSON and YAML will seem fairly natural formalisms, because they are natively based on maps and lists.

EDIT: for validated archetypes & templates, any serial object dump format (XML, JSON, YAML, …) can be conveniently used to persist and read the artefact, bypassing the original validation - as long as no-one has touched the artefact in the meantime. This means these formats (I’m talking about 100% JSON or YAML, i.e. no cADL or EL left) are good for use in operational systems, persisting validated archetypes in libraries (like CKM) and so on. It is very useful to be able to deserialise an archetype straight from (say) JSON into memory and not have to use any AOM level validation.

This would normally be the case.

I would agree with that.

pablo · 17 June 2024 22:40

I don’t think we are settled just for those two. We might need to consider also XML that has been the alternative to ADL for AOM serialization format for a while.

One element to consider is that our formats tend not to be 100% what the model is, but are slightly optimized, which makes generic library-based transformations a little difficult, since there is no canonical transformation between the formats we use. Consider the XML and JSON schemas for RM for instance, it’s not straightforward to do a direct transformation between formats.

I’m not sure I understand the argument about “less used and tested AOM serializers”, we are actually talking about creating those from scratch, which require proper testing as any software piece.

If there are no canonical transformations (just using a library, without introducing custom mappings) between the formats, then direct format transformation has no advantages to model based transformation, and generates coupling between different formats for different purposes. What we know is that those should always be able to be transformed to/from a valid AOM instance. On the other hand, model based transformations, allow to support any future formats without touching the current ones, even allow to support different flavors under the same format, like a simplified JSON or XML if needed.

You are assuming such conversion is possible, we don’t know that yet since the schemas are not defined. Based on my experience with RM transformations, I did implement direct transformations between XML and JSON and that was a pain to maintain for years. When I put the model in the middle an separated the logic to deal just with XML and JSON parsers and serializers, it was easier to maintain and use. The issue with the RM is that the XML and JSON schemas are not 1 to 1 compatible, so generic library-based transformations need some tweaks to make the transformation work.

Also consider not all vendors will be able to use those open source libraries you mention (though you didn’t mention them specifically), since there are vendors in Java, PHP, .Net, etc. IMO we can’t choose the libraries vendors should use neither. We should focus on the spec side and let vendors choose which technology to use (openEHR reference implementations, external libraries or their own). Then we can focus on the formats, the schemas, etc.

I understand this generates great debate because of personal preferences and different experiences, I think the focus should be in the specs and use cases more than in the specific technologies and our own preferences.

Either way, the goal of my proposal in the last SEC meeting was to try to get away from custom openEHR serialization formats and try to adopt more common options for the new AOM.

REF: https://openehr.atlassian.net/wiki/spaces/spec/pages/2201812993/2023-11-15+16+Arnhem+SEC+Meeting

I would wonder why modelers are doing such things, maybe modeling tools are not good enough for their needs?

thomas.beale · 18 June 2024 04:40

It’s nothing to do with JSON v ADL. The definition part of a JSON archetype looks like this:

	"definition": {
		"_type": "C_COMPLEX_OBJECT",
		"rm_type_name": "Direct_observation",
		"node_id": "id1",
		"attributes": [
			{
				"_type": "C_ATTRIBUTE",
				"rm_attribute_name": "data",
				"children": [
					{
						"_type": "C_COMPLEX_OBJECT",
						"rm_type_name": "Node",
						"node_id": "id38",
						"attributes": [
							{
								"_type": "C_ATTRIBUTE",
								"rm_attribute_name": "value",
								"children": [
									{
										"_type": "C_COMPLEX_OBJECT",
										"rm_type_name": "Text",
										"node_id": "id174"
									}
								]
							}
						]
					},
					{
						"_type": "C_COMPLEX_OBJECT",
						"rm_type_name": "Node",
						"node_id": "id7",
						"occurrences": {
							"lower": 0,
							"upper": 3
						},
						"attributes": [
							{
								"_type": "C_ATTRIBUTE",
								"rm_attribute_name": "items",
								"children": [
									{
										"_type": "C_COMPLEX_OBJECT",
										"rm_type_name": "Node",
										"node_id": "id8",
										"attributes": [
											{
												"_type": "C_ATTRIBUTE",
												"rm_attribute_name": "value",
												"children": [
													{
														"_type": "C_COMPLEX_OBJECT",
														"rm_type_name": "Coded_text",
														"node_id": "id175",
														"attributes": [
															{
																"_type": "C_ATTRIBUTE",
																"rm_attribute_name": "term",
																"children": [
																	{
																		"_type": "C_COMPLEX_OBJECT",
																		"rm_type_name": "Terminology_term",
																		"node_id": "id223",
																		"attributes": [
																			{
																				"_type": "C_ATTRIBUTE",
																				"rm_attribute_name": "concept",
																				"children": [
																					{
																						"_type": "C_TERMINOLOGY_CODE",
																						"rm_type_name": "Terminology_code",
																						"node_id": "id9999",
																						"constraint": "ac1"
																					}
																				]
																			}
																		]
																	}
																]
															}
														]
													}
												]
											}
										]
									},

That’s a direct dump of AOM (object meta-model) instance of an archetype. It’s not that devs don’t know JSON, the problem is that you have to do mental somersaults to create the AOM structure in your mind in order to understand what to do.

The same archetype content in ADL is:

definition
	Direct_observation[id1] matches {	-- Audiogram test result
		data cardinality matches {1..*; unordered} matches {
			Node[id38] matches {	-- Test result name
				value matches {
					Text[id174] 
				}
			}
			Node[id7] occurrences matches {0..3} matches {	-- Result details
				items cardinality matches {2..*; unordered} matches {
					Node[id8] matches {	-- Test ear
						value matches {
							Coded_text[id175] matches {
								term matches {
									Terminology_term[id223] matches {
										concept matches {[ac1]}		-- Test ear (synthesised)
									}
								}
							}
						}
					}

Any developer can learn this - it’s a block-structured language like all the other ones they use, and there’s not much mental work to understand what is going on.

It’s the same reason we program with source code languages like Java and C# (Go, Python, whatever) rather than in bytecode, MSIL, or other post-parse machine formats. Humans learn formalisms through such languages; post-parse meta-model representations are hard to mentally deal with.

JSON and YAML are pretty readable for the description and terminology parts of an archetype, but no-one’s going to read either for the definition part.

That’s why a human- (and machine-readable) archetype should be in YAML + cADL (+EL) or JSON + cADL (+EL), whereas a pure machine-readable form can be 100% JSON (I don’t see the point of YAML for this form, but it would work).

siljelb · 18 June 2024 07:52

It’s several different reasons:

modelling tools not doing everything we expect them to do (like switching the original language with one of the translations, or add terminology based (as opposed to UCUM) units to a DV_QUANTITY)
modelling tools not doing things we don’t necessarily expect them to do (like being able to do a search-and-replace for a specific word or phrase throughout the archetype)
modelling tools lagging a bit behind specs (such as new units added to the units file)
modelling tools and CKM or CDRs not agreeing on what the correct syntax is

pablo · 18 June 2024 14:53

Expressions and rules could be represented in a declarative way instead of embedding an imperative expression in the syntax.

Declarative would be to represent what looks like a programming language expression like “if x the y” as data (JSON, XML, etc ). Just as an example https://jsonlogic.com/

I think the advantage of that strategy is that:

There is no need of a custom syntax
So there is no need of a custom grammar
Then there is no need to generate custom parsers
And then to integrate custom ASTs in our code in order to run that logic

A declarative expression will just parse as JSON, XML, YAML, etc. and engines could be created in any language to evaluate the expressions. From experience of building a rule engine that way, that doesn’t require a lot of work and would be similar to the last step of using the ASTs of parsed expressions of the embedded language, without the custom parsing part.

Just an idea to think about!

pablo · 18 June 2024 14:55

Thanks @siljelb that’s what I thought. With a correct modeling tool there wouldn’t be a need for manual adjustments.

Maybe more modelers should be involved in the design of the modeling tools.

ian.mcnicoll · 18 June 2024 15:15

The tools are actually very good IMO, but there will always be gaps, bugs and omissions. That’s why a few of use will always like to be able to visualise and occasionally edit the raw files.

I agree with Thomas that we need two broad types of representation

A pure machine processable form of the AOM (essentially what we have right now with the archetype xml and .opt, for use internally by openEHR CDR and tool builders
Something more like ADL that avoids this, and applies equally to third-party devs and to clinical modellers.

It’s not that devs don’t know JSON, the problem is that you have to do mental somersaults to create the AOM structure in your mind in order to understand what to do.

I understand the argument for using cADL+ e.g. YAML but I think we really have to get away from needing a custom low-level parsing to make use of this type of file. It remains a considerable barrier to use by third-party devs.

@Joost - I’m not sure the jump to YAML vs JSON from ADL is really all that different. In both cases you need to understand the way that objects, arrays and nesting work in each case, and TBH in most circumstances we would only be editing a small fragment (I hope!). It is the way that we represent the cADL sections that are the challenge, and would be broadly similar in both YAML and JSON.

There definitely need be a variety of outputs but I would (just) favour JSON.

Here is a snippet of Web template JSON.

        "children" : [ {
          "id" : "weight",
          "name" : "Weight",
          "localizedName" : "Weight",
          "rmType" : "DV_QUANTITY",
          "nodeId" : "at0004",
          "min" : 1,
          "max" : 1,
          "localizedNames" : {
            "en" : "Weight"
          },
          "localizedDescriptions" : {
            "en" : "The weight of the individual."
          },
          "aqlPath" : "/content[openEHR-EHR-SECTION.adhoc.v1,'Body mass metrics']/items[openEHR-EHR-OBSERVATION.body_weight.v2,'Weight']/data[at0002]/events[at0003]/data[at0001]/items[at0004]/value",
          "inputs" : [ {
            "suffix" : "magnitude",
            "type" : "DECIMAL"
          }, {
            "suffix" : "unit",
            "type" : "CODED_TEXT",
            "list" : [ {
              "value" : "kg",
              "label" : "kg",
              "validation" : {
                "range" : {
                  "minOp" : ">=",
                  "min" : 0.0,
                  "maxOp" : "<=",
                  "max" : 1000.0
                }
              }
            }

thomas.beale · 18 June 2024 16:45

This is already the case, with the BEL (Basic Expression Language) embedded in ADL, in commercial use by at least Nedap.

GDL2 also does ‘when condition then action’ rules.

Doing expressions really properly might look more like EL/DL - see examples here.

Everyone coming this problem always says the same thing: why don’t we use (JS, Java, Python, …)? There is a 40 year history of languages in decision support, from Arden 2 onward, to address the limitations of existing languages.

Some of the things general purpose languages don’t do:

have any notion of terminology
have any notion of ‘binding’ data items (e.g. ‘is_diabetic’, ‘date_of_birth’) to data sources
Allen operators, i.e. time operators like ‘before’, ‘during’
no notion of ‘currency’, i.e. how stale a variable (e.g. SpO2) is with respect to reality
tabular decision logic
close-to-domain conceptual model

Trying to use any general purpose programming language gets painful very quickly because of the lack of support for the above.

For this reason, there are still contemporary attempts to create decision / process languages to solve some of these needs:

HL7 CQL - has Allen operators, terminology, and an approach to binding
OMG Decision Modelling Language (DMN) - related to BPMN and CMN - developed for the insurance industry - supports tabular decision programming
OMG FEEL - simple expression language underpinning DMN
Gello - an older HL7 attempt
openEHR GDL, BEL, EL, DL, Task Planning.
various languages in use in CDS products like SaVia Health, Lumeon and others

Attempts to cross-compile expression and decision language to existing languages can work. It’s a bit of work, but it’s a workable approach, depending on the choice of actual language. A few years ago, when we were working on Task Planning, Better implemented this approach with (from memory) TypeScript as the target language. It worked, sort of, although wasn’t very performant.

Writing a new language is easy these days. With tools like Antlr4, you can create and debug a powerful grammar in days / weeks, and the tools will generate much of the code of the parser.

In my experience, fighting against a commodity language to implement domain-specific concepts is always worse than having a proper language to do the job. If you think about it, a custom language is just a way of formalising how to ‘use’ to do the job, and (usually) vastly reduces the amount of code.

I think approaches like jsonlogic etc might be reasonable targets for cross-compilation.

Do you mean a parser for cADL? It’s just a parser like any other, and there are open source parsers such as Archie available to parse them.

That’s probably what devs working with templates would want to use. Clinical modellers editing description meta-data or doing translations might not love JSON as much If JSON can be supported, YAML can be supported.