Moving on from ODIN & beyond JSON (but staying compatible)

thomas.beale · 9 February 2021 17:04

I’ve been thinking about whether we might want to update or replace ODIN, the JSON-like format used for representing the meta-data and terminology in archetypes. For those who don’t know, ODIN was invented 20 or so years ago, when there was no JSON to speak of, and we’ve kept using it because it’s regular and includes a) a lot more leaf types than JSON, particularly Intervals and Date/time types, and b) type-markers. However, it’s not visually very close to JSON, or even YAML, which is also in relatively wide use.

We have a number of situations in which we want to be able to use literal data structures including at least the following:

in the openEHR REST APIs
in the descriptive parts and terminology section of archetypes, templates, OPTs
in the terminology and constants (‘reference’) section in Decision Logic Modules (example)
in the Serial /Flat Template Data Format (early spec here) - exemplified by EhrScape web templates.

Only the first of these is completely regular JSON, everything else has requirements that JSON is too weak on its own to satisfy. However, it is almost a standard thing for technology platforms to define JSON-like and/or YAML variants that suit their purposes and then guarantee (possibly lossy) down-transform to standard JSON.

The nice ‘bar’ trick invented by Better for EhrScape JSON is an example of a non-conforming JSON-like syntax:

{
    "|code": "238",
    "|value": "other care",
    "|terminology": "openehr"
}

We would also like to be able to use space-efficient alternatives e.g. [icd10AM::F60.1] that are not supported in JSON.

I’ve thought about this question from the direction of defining literal expressions in the Expression Language. This approach provides a more programming like approach, but designed for down-conversion to JSON, and enables smart structures like the following (see here):

    |
    | qRisk3 Risk table
    |
    Risk_factor_scales = {
            [female]: {
                [has_atrial_fibrillation]:              1.59233549692696630,
                [atypical_antipsychotic_medication]:    0.252376420701155570,
                [on_corticosteroids]:                   0.595207253046018510,
                [has_impotence]:                        0,
                [has_migraines]:                        0.3012672608703450,
                [has_rheumatoid_arthritis]:             0.213648034351819420,
                [has_chronic_kidney_disease]:           0.651945694938458330,
                [has_severe_mental_illness]:            0.125553080588201780,
                [has_systemic_lupus]:                   0.758809386542676930,
                [on_hypertension_treatment]:            0.509315936834230040,
                [has_family_history_CV_disease]:        0.454453190208962130
            },
            [male]: {
                [has_atrial_fibrillation]:              0.882092369280546570,
                [atypical_antipsychotic_medication]:    0.130468798551735130,
                [on_corticosteroids]:                   0.454853997504455430,
                [has_impotence]:                        0.222518590867053830,
                [has_migraines]:                        0.255841780741599130,
                [has_rheumatoid_arthritis]:             0.209706580139565670,
                [has_chronic_kidney_disease]:           0.718532612882743840,
                [has_severe_mental_illness]:            0.121330398820471640,
                [has_systemic_lupus]:                   0.440157217445752200,
                [on_hypertension_treatment]:            0.516598710826954740,
                [has_family_history_CV_disease]:        0.540554690093901560
            }
        }
        ;

        |
        | a Map<Term, Interval<Quantity>> structure
        |
        ranges = {
            ------------------------------------
            [mild_low_risk]:  |<= 99 /min|,
            [mild_at_risk]:   |100 .. 120 /min|,
            [moderate_risk]:  |>= 121 /min|
            ------------------------------------
        }
        ;

We could potentially use this syntax to replace some uses of ODIN today, e.g. like this:

    term_definitions = {
        "en": {
            "date_of_birth": {
                text = "Date of birth",
                provenance = {"GDL2": ["gt0009"]}
            },
            "age_in_years": {
                text = "Age (years)",
                provenance = {"GDL2": ["gt0010"]}
            },
            "age_category":  {
                text = "Age category",
                provenance = {"GDL2": ["gt0017"]}
            },
            "gender": {
                text = "Gender",
                provenance = {"GDL2": ["gt0009", "gt0016"]}
            }
        }
    }

Note that the syntax uses {} and [] for containers in the same way as JSON, so it reads in a way close to JSON, and it is easy to down-convert. The = syntax makes it easier to distinguish objects from Map / Array structures.

I am interested in thoughts from the community.

ian.mcnicoll · 9 February 2021 17:32

I’d certainly be all in favour.

We might also have a look at JSON5

{
  // comments
  unquoted: 'and you can quote me on that',
  singleQuotes: 'I can use "double quotes" here',
  lineBreaks: "Look, Mom! \
No \\n's!",
  hexadecimal: 0xdecaf,
  leadingDecimalPoint: .8675309, andTrailing: 8675309.,
  positiveSign: +1,
  trailingComma: 'in objects', andIn: ['arrays',],
  "backwardsCompatible": "with JSON",
}

I don’t think the Better use of the | symbol is non-conformant JSON, certainly none of my tooling complains. It is just part of the name.

thomas.beale · 9 February 2021 18:51

Ah that’s certainly better than JSON, with the ability to use identifiers on the left, not just Strings. So we’d potentially think about JSON5 a/the down-conversion target. Surprising they didn’t think of adding in ISO8601 Strings.

pieterbos · 10 February 2021 13:21

I think it is a really good idea to replace ODIN with something standard such as JSON.

I think it’s a not a good idea to replace it with something new that nobody else is using. That is just worsening the same problem, because we all will have to build new tools that nobody else in the world are using - parsers, serialisers, syntax highlighters, validators, editors (text and GUI), etc… It could be somewhat less of a problem if it can be converted to json and back into your own format without loss of data.
Your example cannot be converted back without changes. Also building the editing and validation tools will still be the same problem, which will simply never be up to the same standard as the standard serialisation languages.

JSON5 could be a great idea, but unfortunately, implementation support is still severely lacking, and I do not consider it to be ready for use in standards yet. I do not know if it ever will reach maturity, or if another future standard will make it obsolete.

I do not consider JSON to be ‘too weak to satisfy requirements on its own’. Several libraries can already serialise and parse these sections you name to JSON or YAML. I think a benefit of JSON is that it is extremely simple, to understand, write and implement. The ‘|’-syntax is an entirely separate issue, as Ian points out, and is still is 100% valid json.

It is also absolutely OK to define a custom short-hand syntax for some types. You define them in JSON-schema, and in the data they will be strings with a certain amount of constraints on them. It is what many people do, and JSON schema has support for that. You mention the date/time types, and they have been standardized in string — Understanding JSON Schema 7.0 documentation, as well as URI-types .
For both terminology codes and strings, it is easy to define shorthand-notation including validation in json-schema using regular expressions as in string — Understanding JSON Schema 7.0 documentation.

Json parsers and mapping frameworks are generally really good at handling these custom syntaxes, and they either have built-in solutions for these kinds of constructions, or it is really easy to add that to the more low-level libraries.
For YAML, much of the same applies as for JSON.

So my opinion is to either stick with ODIN because we already have built support, or if people are willing to change, to switch to something really 100% standard. I think that will help adoption.

thomas.beale · 10 February 2021 14:21

In theory I would agree with your general sense, and indeed, we probably should think about a way to support ODIN + JSON (swappable somehow) in archetypes. We would really need support for the various micro-syntaxes though, for reasons indicated below.

However, I am not so sure the industry is all that standard, or agrees so clearly on everything…

I started looking at YAML, which is more expressive, and came across StrictYAML, which tries to fix problems in YAML. These FAQs by the author are a really interesting read (Why not JSON, why not JSON5, why not JSON-schema, why not xyz, …)

Two things make me want to wait a bit to see how things emerge:

in JSON, the inability to have identifiers on the left is kind of dumb, and reduces readability, and makes it hard to distinguish between Map keys and attribute names - I have to say, JSON5 really does look like what JSON should have been.
historically, we have used solutions like ODIN or XML / JSON + built-in micro-syntaxes like for Intervals, literal term codes and so on, because not to do so adds 8-12 lines of XML or JSON every time an interval is encountered, say in an archetype that has been saved as JSON - which is every ‘occurrences’, ‘cardinality’ and every numeric constraint. So the size of an archetype and of an OPT has historically been nearly 10x the form using ODIN and/or micro-syntaxes (I even built a counter in ADL workbench to prove that). We had 10years of complaints in openEHR about this problem alone (mainly in XML; but JSON has the same problem).

We need a solid way to solve at least the second of these problems - if common structures like intervals add 8+ lines every time, it destroys readability and vastly inflates the files. If we can solve that using JSON-schema, then I’d say that’s the minimum first step, and we possibly should put some effort (which I am prepared to do).

Out of interest - main Java frameworks don’t support JSON5?

pieterbos · 10 February 2021 14:59

There will be people being unhappy with almost every framework, and people will keep building derivatives. Only some will become bigger.

Intervals could be simply defined as the string syntax as in ODIN or ADL. That is a matter of defining the correct regular expression in JSON. The frameworks I know have no problem parsing or serialising them, with a very small amount of custom code. I would be interested to know how this is handled in other frameworks in other languages.
However, are these actually used much in the ODIN part of archetypes, meaning outside the definition part of the archetype?

Jackson does not, but it has options to accept nonstandard options as documented in JsonReadFeature (Jackson-core 2.11.0 API) which can parse a subset of features correctly. Gson does not, but someone wrote a replacement plus a preprocessor called Jankson. It has a very low number of stars and watchers on github. There is also GitHub - brimworks/json5.java: JSON5 library for Java, same issue, plus 20x slower than other frameworks. All the other frameworks, and there are many, do not support json5.

On the positive side, babel and chromium now support json5 for their configuration files, which should help to gain some traction. However, all they need is one good tool in one language, where that is different when defining a standard.

I think the biggest problem in JSON for archetypes might actually be that comments are not allowed in JSON. There are other nonstandard supersets of json as well, such as HOCON, HJSON and JSON6.

thomas.beale · 10 February 2021 16:31

No, they are not - if we just wanted to do something about the Archetype description and terminology sections, we could get away with JSON without much problem. It doesn’t buy us much on its own though. However if people start wanting OPTs in JSON, it would be useful, but then we’re going to need the micro-syntax to handle the main definition. I looked at the JSON-schema link - it appears that all JSON-schema provides is just unidentified regexes to control certain string fields - but that doesn’t constitute a bidirectional micro-syntax parser that will plug in to a serialisation/deserialisation library. It also appears that you would have to repeat the same regex for every field of a certain type, e.g. CodePhrase (i.e. [terminology::code|rubric|]) but I probably missed something there.

(I was originally thinking about the DLM syntax, for which standard JSON isn’t important, because it’s an abstract syntax.)

Aside: one useful thing that JSON5/6 + ISO8601 dates would allow is to standardise the following usage:

{
    //
    // a Map e.g. Map<String, String>
    //
    thing_map: {
        "key1": "val1",
        "key2": "val2",
        "key3": "val3"
    }

    //
    // an object
    //
    object5: {
        attr1: "val1",
        attr2: 38495,
        attr3: 2018-11-13
    }
}

pieterbos · 10 February 2021 21:30

If you define your syntax as json5, there is no difference between object properties and map keys with a value. The two notations you mention are fully equivalent, and the only difference is notation style. Many higher level libraries will not be able to make a difference between the two. This is not a problem at all, since the model you map this on already will have this difference, except when you want to convert this to ODIN without such a model.

Whether one wants shorthand notation for intervals or not often will depend on the amount of code one wishes to write in the application using the OPT. If you want to simply be able to use the interval object model directly, it can be worth to just leave them as quite a lot of data and rely on GZIP/deflate compression that is usually already present during transport anyway to solve size issues.
if you need a short more readable form, a short hand could be useful. Both forms could have their uses.

A ‘bidirectional micro-syntax’ is not available in any format definition without some amount of custom implementation, not in any of these notations. If you implement ODIN, it will mean custom parsing for everything. For JSON, you just have to implement the tiny bit of mini-parsers and serialisers for the shorthand notations. In the javascript parser on json5.org, you would use the replacer as in GitHub - json5/json5: JSON5 — JSON for humans and the reviver as in GitHub - json5/json5: JSON5 — JSON for humans to parse and serialise this shorthand notation.
In Jackson, this would be done by using a converter or an object creation method with a string parameter. This is simply built-in functionality in these kinds of frameworks that are very little work to implement, much less work than building an ODIN parser/serializer.

I do think the multiline-strings, comments and perhaps no quotes on property names could be useful for things that are hand-written, and a simple conversion to the equivalent JSON is easy enough to add. For automatically generated notations, it is usually much less of an issue.

pieterbos · 10 February 2021 21:42

The official JSON5 parser at json5/lib/parse.js at main · json5/json5 · GitHub even throws away the information whether an identifierName is used, or a string property key, either with single or double quotes. In the parsed form, this difference is no longer present. They are truly equivalent for all purposes but human readability. When serialized from object form to string, this is how the keys are serialized, using identifierNames whenever possible: json5/lib/stringify.js at main · json5/json5 · GitHub

sebastian.iancu · 11 February 2021 08:37

I completely agree with @pieterbos on this thread.
I also believe there will be always people that does not like JSON enough or they found flaws and weaknesses which they try to solve themselves. But it will remain a niche solution (e.g. JSON5), of which we have no clue if it’s going to reach a wide adoption and maturity.

It is clear for me why you like it @thomas.beale, it has some benefits over ODIN and looks cleaner. But I still think is better (if really needed) to choose plain JSON or YAML. or just keep what we already have.

ian.mcnicoll · 11 February 2021 09:24

@sebastian.iancu @pieterbos Thanks for that input. For all of the very good reasons , you have laid out, I agree that JSON5 is probably not the right choice, right now, but I think it is worth keeping an eye to see if it gains momentum/market maturity, as it certainly solves some of the issues many of us face (much as I love JSON as-is).

thomas.beale · 11 February 2021 10:40

For complex structures I have been looking at YAML, e.g. for representing bindings and other such things.

thomas.beale · 11 February 2021 13:31

That sounds like a bad error to me. If you serialise the keys of a Map<String, Whatever> as identifiers, that’s just incorrect. That means the next reader of that file cannot figure out (without some schema) even the types of the keys, or infer that it is indeed a Map<>. I assume I must be missing something basic here!

pieterbos · 11 February 2021 15:44

That is not an error, it is design choice. Json has the two following structures, apart from the basic types, as defined on json.org as:

A collection of name/value pairs. In various languages, this is realized as an object, record, struct, dictionary, hash table, keyed list, or associative array.

An ordered list of values. In most languages, this is realized as an array, vector, list, or sequence.

Json5 is simply a more convenient notation format for hand edited (configuration) files, that can be converted to json. It relaxes some of the rather strict json rules. One of the things it allows is unquoted names of the name/value pairs, which then must adhere to the ECMAscript syntax for an identifierName. They are conceptually still just the same name/value pairs from json, and nothing more. Json5 still could be very useful for hand-coded or human readable parts of archetypes, but perhaps less so for generated and mainly intended to be machine readable artifacts such as OPTs.

Yaml has a built-in type system, and it defines many included types, including two map types, at Language-Independent Types for YAML™ Version 1.1, which can be included as tags in yaml.
This does not change the notation, it only adds the tag defining the type, so there still is no different notation for the property names of an object or the keys of a map. Often this tag is omitted since the yaml is mapping to a non-polymorphic part of a model:

One of the things that makes it very hard to define a parser, serializer and to map ODIN to objects with standard mapping tools is this distinction between maps and objects. It is also one of the reasons it is impossible to generate correct ODIN from just json, or from yaml without tags.
One of the other nonstandard problem causing feature is indexed lists not having a different notation from maps.

thomas.beale · 11 February 2021 17:14

Sure. But we had certain design goals back in the early 2000s and one of them was to be able to have a serial format for large complex objects that could be written and read without a separate schema, and Objects, Maps and Arrays would all be correctly handled. Plus one that was efficient for all the leaf types. This all works perfectly for 20y with ODIN. At that time, the world was completely XML-centric, so that was the ugly alternative. JSON back then was only just emerging, and was purely a web-dev thing (its original purpose) - it’s really only in the last 5+ years that people talk as if JSON is the first preference for complex data, including through REST APIs.

In any case, we didn’t even think about JSON for over a decade with respect to archetypes - the request from everyone was to be able to read and write them in XML, not JSON.

The fact that ODIN → JSON is a lossy conversion isn’t anything to do with ODIN; it’s to do with JSON not representing any difference between objects and maps and quite a few other types - i.e. everything is just a map and/or array, i.e. its a DOM concept (unless you inject ‘_type’). If JSON had been designed for large complex data, it would have been very easy to make it a lot better - the reason it is so simple is due to its original design use, which was just small web form objects.

I realise that ODIN seems painfully custom today (which is why I am looking at how we can phase it out), just pointing out the historical background.

I’m surprised that these more recent variants (JSON5 etc) didn’t take the opportunity to do a very obvious thing and distinguish true objects from map and array structures, which would be dead easy and extremely useful. For example, in a map you know you can add in some more items with new keys, and they will end up in the in-memory Map object. But with a true object, you can’t just add fields - well you can in the JSON, but they will just be ignored on the way in. This is extremely useful for anyone hand-editing a JSON file. Seems like a missed opportunity. Of course it can be worked around in code and with JSON-schema.

Anyway, these are theoretical considerations. Practically speaking:

for the Archetype terminology, straight JSON will work well enough, but I don’t know how much it really matters, unless the entire archetype is supposed to be read and written in JSON. I implemented this years ago in ADL WB, and the definition part is close to unreadable (it’s better than XML, but still really inefficient), so it’s really only interesting as a machine persistence method (e.g. for tools), which is where it does actually make sense, with JSON-schema behind.
for the Archetype description, long strings and formatting need to be handled properly. YAML looks better for that but of course it doesn’t make sense to make mixed JSON and YAML archetypes. At the moment my JSON outputter writes out ‘\n’ for newlines, making the text hard to read (but correctly represented in a computational sense). See example below. I’m not sure how to make standard JSON do this better. Again, doesn’t matter for machine use, only for human use.

Summary: for machine use, straight JSON is ok, but it’s not much good for human readability or size-efficiency. For humanly readable archetypes, we could consider mixed JSON and ADL (i.e. better than mixed ODIN and ADL), but we need to solve the text formatting problem. Or else, we go YAML + ADL.

Not thinking of doing anything right now, just thinking around the issues and looking for a strategy.

Example of large formatted text in std JSON:

		"details": {
			"en": {
				"language": "ISO_639-1::en",
				"purpose": "To record measurements of hearing acuity using a calibrated hearing test device, and their interpretation by a clinician.",
				"use": "Use to record measurements and related findings for a single identified test of hearing acuity, for each ear tested separately or both ears simultaneously, via air conduction and/or bone conduction, with masking when required.\n\nUse to record the interpretation of all measurements of hearing acuity for each ear or both ears if tested simultaneously, and an overall interpretation (or audiological diagnosis). \n\nThis archetype has been designed to capture hearing threshold determination for air conduction and/or bone conduction (with or without masking) for the following tests: \n- Pure Tone Audiometry;\n- Play Audiometry;\n- Auditory Brainstem Response; and\n- Visual Reinforcement Orientation Audiometry.\n\nAll of the data elements are recorded using a single method or protocol. If, during the test, any of the protocol parameters need to be modified, then the subsequent part of the test will need to be recorded within a separate instance of the test data, using the updated protocol parameters.",
				"keywords": ["hearing", "test", "audiogram", "audiometry", "acuity", "threshold", "decibels", "ABR", "VROA", "VRA", "play"],
				"misuse": "Not to be used for hearing screening assessment - use the OBSERVATION.hearing_screening archetype.\n\nNot to be used to record other auditory assessments such as:\n- Behavioural Observation Audiometry (BOA);\n- Most Comfortable Listening Level (MCL) and Uncomfortable Listening Level (UCL); and\n- Auditory Brainstem Response (ABR) for any purpose other than hearing threshold determination.\nThese assessments need to be recorded in specific archetypes for the purpose."
			}
		},