Simplified Data Template (SDT) - data types

thomas.beale · 1 April 2020 21:10

Following our decision to provide a more basic specification of the Simplified Data Template (SDT), I thought I would start at the ‘easy’ end, i.e. data types and various low level RM types. Here is a wiki page that I am starting to populate on this.

One question I thought I would pose is this. For primitive, types, the representation is generally obvious and non-controversial (i.e. it’s the JSON built-in types plus date/times as Strings, and a few extras like URI strings etc).

However for DV types and things like PARTICIPATION etc, we potentially want a parsable structure, in order to avoid the more voluminous standard JSON.

So for example, in the EtherCIS approach that you see here, we have:

DvCodedText: "terminology::code|value|"
DvQuantity: "78.500,kg"

The Marand Web Template (MWT) uses a different approach of path additions to add to normal paths:

DvCodedText

{
"|code": "238",
"|value": "other care",
"|terminology": "openehr"
}

and
DvQuantity is (I assume) the same kind of thing, i.e.

{
"|magnitude": "78.5",
"|precision": 3,
"|units": "kg"
}

Now, if the aim is to compress some of these types more like the EtherCIS approach, I would propose the following:

DvCodedText: "{[terminology::code|value|]}"
DvQuantity: "{78.500,kg}"

In both cases, you have a String field - unavoidable, from a JSON point of view. What comes next is {}, which is intended to say: this is an object, not a primitive value. So, this is how a DvCodedText would be distinguished from a Terminology_code (i.e. CodePhrase), which we consider a primitive - it would just be "[terminology::code|value|]"

(Note: I added the [] that ADL and ODIN uses, because this is an openEHR standard).

This could potentially be used to write short forms of Participation and so on.

I don’t know if this is a good idea, or even if people want this compressed style of data type serialisation, so just putting it out there.

sebastian.iancu · 2 April 2020 08:17

On the wiki page , what is the purpose of the last column (now called JSON) - is that the canonical version?
For the sake of comparing apples only with different apples types, can we just use JSON for all columns (where applicable) on that wiki page and keep the last column for simplified (JSON) format (but keep somewhere also the canonical format as well)?

thomas.beale · 2 April 2020 08:38

The last col is meant to be standard JSON; the other cols are various other micro-syntaxes that would form fields within a larger JSON of either ‘flat’ or ‘full’ form. E.g. this EtherCIS string "openehr::227|emergency care|" would be inside something like:

{ 
    ...
    "/context/setting":"openehr::227|emergency care|",
    ...
}

matijap · 10 April 2020 11:46

I prefer the original EtherCIS format. It’s obvious from the context (you need the template to have that context, of course) that the string value is in fact “an object”, or rather a specific syntax for the kind of object that the attribute is. Right?

I’d just like to kindly ask to have “our” variant as another valid option as well so that we can be backwards compatible for the many customers that use our format. It basically boils down to the fact that you can address attributes of a (for instance) DvCodedText, or the whole DvCodedText. In the flat format:

{
    "a/b/c/d|code": "abc",
    "a/b/c/d|value": "def",
    "a/b/c/d|terminology": "ghi"
}

is equivalent to

{
    "a/b/c/d": "ghi::abc|def|"
}

And in structured mode, you would have d nested inside c nested inside b nested inside a, but then d could be a string attribute with value of "ghi::abc|def|" or alternatively an object {"|code":"abc", "|value":"def", "|terminology":"ghi"} .

thomas.beale · 10 April 2020 15:36

So I think you are saying: if you hit this, you have to always be looking up paths, in this case, a/b/c/d to check what kind of info model object you have - here, a DV_CODED_TEXT. Probably unavoidable I guess; my idea of adding {} may not be that helpful without actual type information.

I’m fine with keeping the '|' alternative alive; it should be easy to spot, and is regular in form, so I’ll include that in the spec.

Field names like "|code" appear to be legal in JSON as well. Interesting…

thomas.beale · 4 May 2020 19:11

I’ve put up an initial attempt to document the various JSON possibilities just for a few DV types now. Don’t worry about which document this is currently in, we can move it to somewhere else if we need to.

Right now I’d like to know if the possibilities shown are real.

thomas.beale · 4 August 2020 15:56

I’ve made some updates to the wiki page, and made the columns clearer, as well as adding some notes and questions.

There are three columns with red titles - those appear to me to be the legal possibilities, i.e. a proposed compact format, the Marand web template format, and standard JSON. Is that what we need to allow? Can the be mixed? And other questions - see the page.

thomas.beale · 10 August 2020 20:33

More improvements to the wiki page - it’s getting closer to something useful.

There are red marks/comments that people here could help with.

I will at some point soon convert this to JSON files and build the initial draft spec.

I will also most likely define the format patterns, where they are not just basic JSON, as Antlr4 patterns, unless people want something else (EBNF or whatever).

ian.mcnicoll · 11 August 2020 11:19

One thing I have played with (GraphQL experiments) is being explicit about the use of the ‘Ethercis/Cambio’ tokenised values.

{
“a/b/c/d|token”: “ghi::abc|def|”
}

Could token() or similar be added to the spec for each datatype - that would allow its use beyond just the json formats, for selected datatypes/

That may not be necessary. I know the use of the’ |’ to signify datatype leaf attributes is not universally popular but I find it helpful when reading the JSON formats, and I would definitely support optional use for compatibility purposes.

thomas.beale · 11 August 2020 13:18

Not sure I get your meaning. The thing that comes after the ‘|’ in Marand Web Template format is just the name of an RM attribute, except in some cases, Better shortened it (e.g. ‘identifier’ -> ‘id’). E.g.

"_participation": [
    {
       "|function": "requester",
       "|mode": "face-to-face communication",
       "|name": "Dr. Marcus Johnson",
       "|id": "199"
    }
]

or

{
    "|code": "238",
    "|value": "other care",
    "|terminology": "openehr"
}

If we take those ‘|’ characters out, we have normal JSON. It’s only if we want paths rather than just attribute names that the bar tells us something, as far as I can see.

Personally I actually like the bar, it’s a neat idea, and would have been great in JSON. But in the real world, we have to treat it as a custom thing.

Anyway, I still don’t know what you mean by ‘token()’…

ian.mcnicoll · 11 August 2020 14:11

re the pipe- symbol - it’s not just an attribute marker, I think it implies a leaf-node not just any RM attribute.

‘tokens’ - quite probably a bad term. I mean the stringified version of a datatype object.

All I am really suggesting is whether there is value in making the ‘stringified_value’ (also horrible but at least the meaning is clear!!), a RM function, and therefore clearly identifiable whether in JSON or anything else.

So instead of just

“/context/setting”:“openehr::227|emergency care|”,

we make it explicit as

“/context/setting|stringified”:“openehr::227|emergency care|”,

Maybe it is not necessary ]. I guess I am just wary of getting stringified terms disambiguated from ordinary text values.

Where a DV_TEXT can take either a coded_term or plain text, it would be nice to have a neat way of making the datatype/usage clear.

Sorry - went off on a tangent - too hot

thomas.beale · 11 August 2020 15:29

Seems to be attribute of a logical ‘leaf type’, but in this scheme, even PARTY_IDENTIFIED is such a type…

I don’t think we need to mark the compact format (‘compact’) on every node - the idea of the simplified flat format AFAIK is that the whole thing is in one of the allowed formats, i.e.:

compact JSON format
Marand JSON format
standard JSON
maybe sparse JSON as well, i.e. with paths on the LHS

So somewhere at the top of such a JSON text there is a format marker that indicates which of these the rest of the text is in. I.e. something like "_format": "compact | marand | standard | sparse". (We could use ‘ehrscape’ for ‘marand’ if people prefer).

I.e. you want this to be clear in a ‘blank compact template’ so a dev can see what choices s/he has? Makes sense.

matijap · 12 August 2020 04:17

ian.mcnicoll · 12 August 2020 08:07

What about a scenario where we have a DV_TEXT, potentially sub-classable to DV_CODED_TEXT. How do we disambiguate

“x”: “some text”

“x” : “c::b|a|”

just pattern matching or ??

thomas.beale · 12 August 2020 09:17

That case would normally just be pattern matching. But for larger (non-leaf) objects we would need to use hte JSON _type marker. Which means the JSON processor needs to know if it’s on a logical leaf attribute or not.

So… good point.

thomas.beale · 12 August 2020 09:19

We can certainly spot that difference easily enough. But we presumably should allow standard JSON as well, which is a 3rd variant, and if you want to allow sparse JSON, i.e. with paths on the left-hand side, that’s 4 variants. Is there to be a rule that any given simplified JSON is all one flavour only, i.e. no mixing and matching?

ian.mcnicoll · 12 August 2020 10:30

The Better formats do have a RAW attribute that allows you to drop into (escaped) canonical JSON at datatype level That is definitely helpful for obscure use-cases that are not (yet) supported by the flattened forms.

At least in terms of datatypes, I think the ability to mix and match between the stringified, structured and standard formats will be necessary but yes only a single flavour of simplified (with perhaps a little wriggle room for legacy formats)

thomas.beale · 12 August 2020 10:39

I think it be reasonable to assume a text is all Marand or all standard compact format +/- std JSON? i.e. it can’t be mixed Marand & standard compact (the thing we are specifying right now).

matijap · 12 August 2020 10:40

The question is whether we need/want the shorthand format for DV_TEXT. I’d just assume DV_CODED_TEXT when not specifying a leaf field. For DV_TEXT you’d use “x|value”, which just has an overhead of “|value” and not of two additional lines.

sebastian.iancu · 12 August 2020 10:45

Thats why I don’t fancy too much these formats (which requires regex parsing) over canonical json (natively parse by most languages). You may have the impression that data is compacter therefore easier to handle, but in fact gives developers lots of chances to introduce bugs…