Could YAML replace dADL as human readable AOM serialization format?

system · 22 November 2011 11:51

Hi!

A little suggestion/thought (that might be of value also for CIMI-folks and others looking at “archetyping” using ADL and AOM and wondering if a specific language is needed).

Limitations:
For efficient handling of RM (Reference Model) instances (patient data) flying back and forth between systems you’d probably want some binary format (protobuf, thrift datatypes, serialized Java objects or whatever), this is NOT what this suggestion is about. For development and debugging RM-instance exchange you may also want some fairly human-readable serialization that is supported by many platforms (Like JSON, YAML, XML or whatever) this is NOT what the suggestion is about either. Also note that the current suggestion only aims at looking for replacement of dADL not cADL. Also note that the AOM and XML serialisations of the AOM are not affected by this suggestion.

Background:
cADL (Constraint ADL) is a compact DSL that is aimed at defining constraints on an object model, while dADL (Data ADL) on the other hand is mainly a general object-graph serialization format.
If I understand section 1.7.5 in the ADL 1.5 spec correctly, ADL 2.0 will allow the option to define all parts of an archetype (including what is now done in cADL) as a dADL serialization of the AOM (Archetype Object Model). Is that correct Tom?

Suggestion:
Investigate if YAML can replace or complement dADL as object-graph serialization format for archetypes. (Perhaps there is interest from people using an openEHR AOM implementation in a language that already has YAML serializers to make a quick experiment?)

Motivation:

YAML parsers converting YAML documents to native object graphs already exist for a number of languages (C/C++, Ruby, Python, Java, Perl, C#/.NET, PHP, OCaml, Javascript, Actionscript, Haskell) so there would be less work creating and maintaining archetype parsers that turn archetype files into in-memory object graphs. (If you write an archetype authoring tool an need to validate archetypes, not just instantiate already validated archetypes, then the “Validity Rules” (such as the ones in blue under 4.3.1.1 in the AOM spec.) will of course still need to be implemented in software.
Having an archetype specific object-serialization language like dADL might make “archetyping” look more mysterious and suspect and might hide the fact that the semantics expressed in the AOM is the interesting thing that can be serialised in many different ways.
And (admittedly subjective) YAML lists and objects look slightly better and more readable than dADL. A notable exception is probably intervals/ranges that have a compact representation in dADL (see section 4.5.2 of the ADL 1.5 spec) but not natively in YAML.

Observations:
YAML is extensible, so data types for intervals etc can be added like in http://yaml.org/YAML_for_ruby.html#ranges, also see discussion at http://stackoverflow.com/questions/3337020/how-to-specify-ranges-in-yaml. A similar approach could be taken to dADLs “Plug-in Syntaxes” (see section 4.6) using YAML. A number of language-independent extra YAML datatypes (timestamp for example) are listed at http://yaml.org/type/index.html and you can define your own if you need more.

It seems like specification 1.1 (http://yaml.org/spec/1.1/) is the most implemented, so any dADL comparisons should probably be done towards that version to be fair.

Best regards,
Erik Sundvall
erik.sundvall@liu.se http://www.imt.liu.se/~erisu/ Tel: +46-13-286733

P.s. Tom Beale and I sort of started a brief off-list discussion about YAML, here is now an attempt to get input from more people.

thomas.beale · 22 November 2011 12:24

Hi!

A little suggestion/thought (that might be of value also for CIMI-folks and others looking at “archetyping” using ADL and AOM and wondering if a specific language is needed).

Limitations:
For efficient handling of RM (Reference Model) instances (patient data) flying back and forth between systems you’d probably want some binary format (protobuf, thrift datatypes, serialized Java objects or whatever), this is NOT what this suggestion is about. For development and debugging RM-instance exchange you may also want some fairly human-readable serialization that is supported by many platforms (Like JSON, YAML, XML or whatever) this is NOT what the suggestion is about either. Also note that the current suggestion only aims at looking for replacement of dADL not cADL. Also note that the AOM and XML serialisations of the AOM are not affected by this suggestion.

Background:
cADL (Constraint ADL) is a compact DSL that is aimed at defining constraints on an object model, while dADL (Data ADL) on the other hand is mainly a general object-graph serialization format.
If I understand section 1.7.5 in the ADL 1.5 spec correctly, ADL 2.0 will allow the option to define all parts of an archetype (including what is now done in cADL) as a dADL serialization of the AOM (Archetype Object Model). Is that correct Tom?

actually, ADL 2.0 as reported in this document is now obsolete. The ADL 1.5 compiler already does this, and will use it as a fast save/retrieve format. See below for example, or download the current release of the ADL Workbench to play. I am intending to document the ‘P_’ classes on which this serialisation is based, and on which I think any JSON / YAML / XML serialisation should be based - when we can agree on it. It is in these classes that things like occurrences are changed from MULTIPLICITY_INTERVAL to String.

Suggestion:
Investigate if YAML can replace or complement dADL as object-graph serialization format for archetypes. (Perhaps there is interest from people using an openEHR AOM implementation in a language that already has YAML serializers to make a quick experiment?)

My motivation for making pure dADL archetypes is to have a fast, efficient serialisation of the object graph of an archteype, so that when an archetype compiles successfully, it can be saved in this form and later retrieved, bypassing the ADL compiler. The value in this is that formats like dADL / JSON / YAML are low-level graph serialisations, and that really fast parsers can be written for them for use on persisted files known to be correct (i.e. generated by a serialiser in a previous save). My own dADL parser is not such a fast parser, but that’s only a matter of time

So the same arguments would apply to JSON or YAML in my view. At least for this purpose (fast save & retrieve of previously compiled archetypes), any such format could be used.

Motivation:

YAML parsers converting YAML documents to native object graphs already exist for a number of languages (C/C++, Ruby, Python, Java, Perl, C#/.NET, PHP, OCaml, Javascript, Actionscript, Haskell) so there would be less work creating and maintaining archetype parsers that turn archetype files into in-memory object graphs. (If you write an archetype authoring tool an need to validate archetypes, not just instantiate already validated archetypes, then the “Validity Rules” (such as the ones in blue under 4.3.1.1 in the AOM spec.) will of course still need to be implemented in software.

Having an archetype specific object-serialization language like dADL might make “archetyping” look more mysterious and suspect and might hide the fact that the semantics expressed in the AOM is the interesting thing that can be serialised in many different ways.

And (admittedly subjective) YAML lists and objects look slightly better and more readable than dADL. A notable exception is probably intervals/ranges that have a compact representation in dADL (see section 4.5.2 of the ADL 1.5 spec) but not natively in YAML.

Observations:
YAML is extensible, so data types for intervals etc can be added like in http://yaml.org/YAML_for_ruby.html#ranges, also see discussion at http://stackoverflow.com/questions/3337020/how-to-specify-ranges-in-yaml. A similar approach could be taken to dADLs “Plug-in Syntaxes” (see section 4.6) using YAML. A number of language-independent extra YAML datatypes (timestamp for example) are listed at http://yaml.org/type/index.html and you can define your own if you need more.

One area where dADL beats JSON and YAML (I think) is its better support for Xpath-like paths. Plus its much more compact than JSON. Personally I find YAML hard to read because there are so many syntax elements (triple ‘-’, triple ‘.’ etc) but that might just be me.

thomas


(P_ARCHETYPE) <
original_language = <[ISO_639-1::pt-br]>
translations = <
["en"] = <
language = <[ISO_639-1::en]>
author = <
["name"] = <"Sergio Miranda Freire">
["organisation"] = <"Universidade do Estado do Rio de Janeiro - UERJ">
["email"] = <["sergio@lampada.uerj.br"](mailto:sergio@lampada.uerj.br)>
>
>
>
description = <
original_author = <
["name"] = <"Sergio Miranda Freire & Rigoleta Dutra Mediano Dias">
["organisation"] = <"Universidade do Estado do Rio de Janeiro - UERJ">
["email"] = <["sergio@lampada.uerj.br"](mailto:sergio@lampada.uerj.br)>
["date"] = <"22/05/2009">
>
details = <
["en"] = <
language = <[ISO_639-1::en]>
purpose = <"Representation of a person's demographic data.">
use = <"Used in demographic service to collect a person's data.">
keywords = <"demographic service", "person's data">
misuse = <"">
copyright = <"© 2011 openEHR Foundation">
>
["pt-br"] = <
language = <[ISO_639-1::pt-br]>
purpose = <"Representação dos dados demográficos de uma pessoa.">
use = <"Usado em serviço demográficos para coletar os dados de uma pessoa.">
keywords = <"serviço demográfico", "dados de uma pessoa">
misuse = <"">
copyright = <"© 2011 openEHR Foundation">
>
>
lifecycle_state = <"Authordraft">
other_contributors = <"Sebastian Garde, Ocean Informatics, Germany (Editor)", "Omer Hotomaroglu, Turkey (Editor)", "Heather Leslie, Ocean Informatics, Australia (Editor)">
other_details = <
["references"] = <"ISO/TS 22220:2008(E) - Identification of Subject of Care - Technical Specification - International Organization for Standardization.">
>
>
artefact_object_type = <"DIFFERENTIAL_ARCHETYPE">
archetype_id = <"openEHR-DEMOGRAPHIC-PERSON.person.v1">
adl_version = <"1.5">
artefact_type = <"archetype">
definition = <
rm_type_name = <"PERSON">
node_id = <"at0000">
attributes = <
["1"] = <
rm_attribute_name = <"details">
children = <
["1"] = (P_ARCHETYPE_SLOT) <
rm_type_name = <"ITEM_TREE">
node_id = <"at0001">
occurrences = <"1">
includes = <
["1"] = <
expression = (EXPR_BINARY_OPERATOR) <
type = <"Boolean">
operator = <
value = <2007>
>
left_operand = (EXPR_LEAF) <
type = <"String">
reference_type = <"attibute">
item = <"archetype_id/value">
>
right_operand = (EXPR_LEAF) <
type = <"C_STRING">
reference_type = <"constraint">
item = (C_STRING) <
regexp = <"(person_details)[a-zA-Z0-9_-]*\\.v1">
is_open = <False>
regexp_default_delimiter = <True>
>
>
precedence_overridden = <False>
>
>
>
is_closed = <False>
>
>
is_multiple = <False>
>
["2"] = <
rm_attribute_name = <"identities">
children = <
["1"] = (P_ARCHETYPE_SLOT) <
rm_type_name = <"PARTY_IDENTITY">
node_id = <"at0002">
occurrences = <"1">
includes = <
["1"] = <
expression = (EXPR_BINARY_OPERATOR) <
type = <"Boolean">
operator = <
value = <2007>
>
left_operand = (EXPR_LEAF) <
type = <"String">
reference_type = <"attibute">
item = <"archetype_id/value">
>
right_operand = (EXPR_LEAF) <
type = <"C_STRING">
reference_type = <"constraint">
item = (C_STRING) <
regexp = <"(person_name)[a-zA-Z0-9_-]*\\.v1">
is_open = <False>
regexp_default_delimiter = <True>
>
>
precedence_overridden = <False>
>
>
>
is_closed = <False>
>
>
is_multiple = <True>
>
["3"] = <
rm_attribute_name = <"contacts">
children = <
["1"] = (P_C_COMPLEX_OBJECT) <
rm_type_name = <"CONTACT">
node_id = <"at0003">
occurrences = <"1">
attributes = <
["1"] = <
rm_attribute_name = <"addresses">
children = <
["1"] = (P_ARCHETYPE_SLOT) <
rm_type_name = <"ADDRESS">
node_id = <"at0030">
occurrences = <"1">
includes = <
["1"] = <
expression = (EXPR_BINARY_OPERATOR) <
type = <"Boolean">
operator = <
value = <2007>
>
left_operand = (EXPR_LEAF) <
type = <"String">
reference_type = <"attibute">
item = <"archetype_id/value">
>
right_operand = (EXPR_LEAF) <
type = <"C_STRING">
reference_type = <"constraint">
item = (C_STRING) <
regexp = <"(address)([a-zA-Z0-9_-]+)*\\.v1">
is_open = <False>
regexp_default_delimiter = <True>
>
>
precedence_overridden = <False>
>
>
["2"] = <
expression = (EXPR_BINARY_OPERATOR) <
type = <"Boolean">
operator = <
value = <2007>
>
left_operand = (EXPR_LEAF) <
type = <"String">
reference_type = <"attibute">
item = <"archetype_id/value">
>
right_operand = (EXPR_LEAF) <
type = <"C_STRING">
reference_type = <"constraint">
item = (C_STRING) <
regexp = <"(electronic_communication)[a-zA-Z0-9_-]*\\.v1">
is_open = <False>
regexp_default_delimiter = <True>
>
>
precedence_overridden = <False>
>
>
>
is_closed = <False>
>
>
is_multiple = <True>
>
>
>
>
is_multiple = <True>
>
["4"] = <
rm_attribute_name = <"relationships">
children = <
["1"] = (P_C_COMPLEX_OBJECT) <
rm_type_name = <"PARTY_RELATIONSHIP">
node_id = <"at0004">
attributes = <
["1"] = <
rm_attribute_name = <"details">
children = <
["1"] = (P_C_COMPLEX_OBJECT) <
rm_type_name = <"ITEM_TREE">
attributes = <
["1"] = <
rm_attribute_name = <"items">
children = <
["1"] = (P_C_COMPLEX_OBJECT) <
rm_type_name = <"ELEMENT">
node_id = <"at0040">
attributes = <
["1"] = <
rm_attribute_name = <"value">
children = <
["1"] = (P_C_COMPLEX_OBJECT) <
rm_type_name = <"DV_TEXT">
>
["2"] = (P_C_COMPLEX_OBJECT) <
rm_type_name = <"DV_CODED_TEXT">
attributes = <
["1"] = <
rm_attribute_name = <"defining_code">
children = <
["1"] = (P_CONSTRAINT_REF) <
rm_type_name = <"CODE_PHRASE">
target = <"ac0000">
>
>
is_multiple = <False>
>
>
>
>
is_multiple = <False>
>
>
>
>
is_multiple = <True>
>
>
>
>
is_multiple = <False>
>
>
>
>
is_multiple = <True>
>
>
>
ontology = <
term_definitions = <
["pt-br"] = <
["at0000"] = <
text = <"Dados da pessoa">
description = <"Dados da pessoa.">
>
["at0001"] = <
text = <"Detalhes">
description = <"Detalhes demográficos da pessoa.">
>
["at0002"] = <
text = <"Nome">
description = <"Conjunto de dados que especificam o nome da pessoa.">
>
["at0003"] = <
text = <"Contatos">
description = <"Contatos da pessoa.">
>
["at0004"] = <
text = <"Relacionamentos">
description = <"Relacionamentos de uma pessoa, especialmente laços familiares.">
>
["at0030"] = <
text = <"Endereço">
description = <"Endereços vinculados a um único contato, ou seja, com o mesmo período de validade.">
>
["at0040"] = <
text = <"Grau de parentesco">
description = <"Define o grau de parentesco entre as pessoas envolvidas.">
>
>
["en"] = <
["at0000"] = <
text = <"Person">
description = <"Personal demographic data.">
>
["at0001"] = <
text = <"Demographic details">
description = <"A person's demographic details.">
>
["at0002"] = <
text = <"Name">
description = <"A person's name.">
>
["at0003"] = <
text = <"Contacts">
description = <"A person's contacts.">
>
["at0004"] = <
text = <"Relationships">
description = <"A person's relationships, especially family ties.">
>
["at0030"] = <
text = <"Addresses">
description = <"Addresses linked to a single contact, i.e. with the same time validity.">
>
["at0040"] = <
text = <"Relationship type">
description = <"Defines the type of relationship between related persons.">
>
>
>
constraint_definitions = <
["pt-br"] = <
["ac0000"] = <
text = <"Códigos para tipo de parentesco">
description = <"códigos válidos para tipo de parentesco.">
>
>
["en"] = <
["ac0000"] = <
text = <"Codes for type of relationship">
description = <"Valid codes for type of relationship.">
>
>
>
>
is_controlled = <False>
is_generated = <True>
is_valid = <True>

system · 1 December 2011 21:37

Hi!

Let the battle begin see:
http://www.imt.liu.se/~erisu/2011/AOM-beauty-contest.html

actually, ADL 2.0 as reported in this document is now obsolete. The ADL 1.5 compiler already does this, and will use it as a fast save/retrieve format.

Will cADL become optional or go away somehow?

One area where dADL beats JSON and YAML (I think) is its better support for Xpath-like paths.

Why would that be different? I guess most path queries will run on instantiated object trees rather than on documents and then there is no difference - and if paths were run directly on documents, then please explain why dADL would support them better.

Plus its much more compact than JSON.

Much? Less noisy I would agree to though.

Personally I find YAML hard to read because there are so many syntax elements (triple ‘-’, triple ‘.’ etc) but that might just be me.

Have a look at…
http://www.imt.liu.se/~erisu/2011/AOM-beauty-contest.html
…again.

The triple ‘-’ and triple ‘.’ are (mostly optional) start and end markers of documents that make life easier when concatenating streams/documents, see the YAML specification.

Am I the only one that thinks YAML is more readable than dADL?

Best regards,
Erik Sundvall
erik.sundvall@liu.se http://www.imt.liu.se/~erisu/ Tel: +46-13-286733

thomas.beale · 2 December 2011 00:30

Hi!

Let the battle begin see:
http://www.imt.liu.se/~erisu/2011/AOM-beauty-contest.html

nice page - that’s quite fun to see them all pasted up there.

My question is: what’s the/your purpose for human readability. Is it:

education e.g. in some kind of class-room / training situation
debugging
self-learning
something else

Just a question…

actually, ADL 2.0 as reported in this document is now obsolete. The ADL 1.5 compiler already does this, and will use it as a fast save/retrieve format.

Will cADL become optional or go away somehow?

its not my intention. To be honest, I am not sure if a streaming cADL parser that knows it is parsing guranteed correct cADL might not be faster than the equivalent dADL parser for the archetype definition. But either way, cADL is a notation that really gives you a direct feel for the implicated semantics, so for understanding what you are looking at it has to be better. dADL / XML / JSON etc don’t give you a direct picture, they give you a serialised object picture from which your brain has to infer an object structure (but admittedly this is unambiguous, so your brain will probably get it right). In my view ‘proper syntax’ is nicer for direct comprehension and therefore learning.

One area where dADL beats JSON and YAML (I think) is its better support for Xpath-like paths.

Why would that be different? I guess most path queries will run on instantiated object trees rather than on documents and then there is no difference - and if paths were run directly on documents, then please explain why dADL would support them better.

Looking at the JSON again, I might have to eat my words… I guess if the attribute names / hash tags are turned into Xpath predicates the implied set of paths has to be the same.

Plus its much more compact than JSON.

Much? Less noisy I would agree to though.

Personally I find YAML hard to read because there are so many syntax elements (triple ‘-’, triple ‘.’ etc) but that might just be me.

Have a look at…
http://www.imt.liu.se/~erisu/2011/AOM-beauty-contest.html
…again.

The triple ‘-’ and triple ‘.’ are (mostly optional) start and end markers of documents that make life easier when concatenating streams/documents, see the YAML specification.

Am I the only one that thinks YAML is more readable than dADL?

when I get a moment I will add YAML to the serialiser club in the tool and we can then see if proper YAML is is or isn’t better to read (I am assuming that it will be somewhat different from the inferred YAML you generated with that web tool). I think ‘readability’ is starting to come down to congitive and linguistic / semiotic issues, which is very interestinng. There may be no objective answer to this question; if there is it will be interesting to know what the criteria are.

Nice work on the contest!

thomas

Heath_Frankel3 · 2 December 2011 01:35

Thanks Erik,

Interesting to see the line up. Can’t believe that XML wasn’t the longest file in the list, that kills one of the arguments for JSON vs XML.

For someone that is not aware of YAML, are the white space significant. If so, this kinds of kills it for me, otherwise for a Human reader its fairly natural to read without lots of brackets of various kinds.

Heath

system · 2 December 2011 09:36

Hi Erik,

is the Javascript Object Dump missing regexps for 'address' and
'electronic_communications'? Or is that irrelevant?

In the YAML, some comma separated key-value pairs are condensed into 1
line; it would be nicer if they could all be on their own line: makes
it lengthier, but more readable and a fairer comparison to the other
formats.

Cheers,

Roger

system · 2 December 2011 15:11

Hi!

http://www.imt.liu.se/~erisu/2011/AOM-beauty-contest.html

is the Javascript Object Dump missing regexps for ‘address’ and
‘electronic_communications’? Or is that irrelevant?

Thanks for spotting that, obviously something went wrong in the object dump. I have now commented that on the web page.

In the YAML, some comma separated key-value pairs are condensed into 1
line; it would be nicer if they could all be on their own line: makes
it lengthier, but more readable and a fairer comparison to the other
formats.

I think this is the default way of nesting flow style within block style with limited line length, but we should double check that, I agree that one line per thing would be more readable. Perhaps that can be configured in serializers.

Interesting to see the line up. Can’t believe that XML wasn’t the longest file in the list, that kills one of the arguments for JSON vs XML.

Well that depends how you measure length or weight in bytes in readable or compact form.
Have a look at the bottom of the http://www.imt.liu.se/~erisu/2011/AOM-beauty-contest.html where I have now added some length comparison of whitespace-compressed formats.

For someone that is not aware of YAML, are the white space significant.

Indentation level is significant when using YAML block style but not YAML flow style. See the YAML specification for details.

If so, this kinds of kills it for me, otherwise for a Human reader its fairly natural to read without lots of brackets of various kinds.

Well aren’t the most common ways of defining the tree structures to either use brackets/tags/delimiters of some kind or to use indentation? Do you have any other obvious and still readable methods that avoid brackets etc but where whitespace or indentation is not significant?

thomas.beale · 2 December 2011 19:07

Thanks Erik,

Interesting to see the line up. Can’t believe that XML wasn’t the longest file in the list, that kills one of the arguments for JSON vs XML.

For someone that is not aware of YAML, are the white space significant. If so, this kinds of kills it for me, otherwise for a Human reader its fairly natural to read without lots of brackets of various kinds.

Heath

Heath_Frankel3 · 4 December 2011 23:10

I think previously I had indicated I had no problem with the stringified interval approach in XML, but I am reverting my thinking on this and feel that it would be counter intuitive for those who what to use the XML schemas for code generation purposes. I think in this case the computable requirement outweighs the human readable requirement. I think we can come up with a much more concise representation of these intervals without compromising the computable requirement, something similar to XML schema maxOccurs/minOccurs.

Heath

please everyone remember that the dADL, JSON and XML generated from AWB all currently use the stringified expression of cardinality / occurrences / existence. Now, these are usually the most numerous constraints in an archetype and if expressed in the orthodox way, take up 6 lines of text, hence the giant files (e.g. AOM 1.4 based XML we currently use) - and thus the much reduced files you see on Erik’s page, because we are using ADL 1.5 flavoured serialisations not the ADL 1.4 one.

Now, I think we should probably go with the stringified form in all of these formalisms. The cost of doing this is a small micro-parser, but it is the same microparser for everyone, which seems attractive to me.

The alternative that Erik mentioned was more native, but still efficient interval expressions, e.g. dADL has it built in (0..* is |>=0| in dADL), and YAML and JSON could probably be persuaded to make some sort of array of integer-like things be used. XML still doesn’t have any such support. In theory this approach would be the best if each syntax supported it properly, but XML doesn’t at all, and the others don’t support Intervals with unbounded upper limit (i.e. the ‘’ in '0..’).

But Erik’s exercise certainly proved that efficient representation of the humble Interval is actually worthwhile. (Once again thanks for that page, its quite a good way to get a good feel for these syntaxes very quickly).

thomas

system · 4 December 2011 23:23

Hi All

I am going to say it once more:

If there is an expression on occurrences of ‘0..*’ anywhere in ADL then it is an error, for that is not a constraint – and can only be wrong (ie the RM may have a narrower constraint). We just need a max int and a min int – both optional.

I won’t say it again – but it does keep it simple and it is correct!

Cheers, Sam

yampeku · 4 December 2011 23:38

and if you want to express something like 'a set with all the past
test results for this patient' (that could have none)?
it would be a constraint as you are only allowing some kinds of
entries (children of a certain Snomed code for example)

thomas.beale · 5 December 2011 05:24

Hi All

I am going to say it once more:

If there is an expression on occurrences of ‘0..*’ anywhere in ADL then it is an error, for that is not a constraint – and can only be wrong (ie the RM may have a narrower constraint). We just need a max int and a min int – both optional.

I won’t say it again – but it does keep it simple and it is correct!

system · 5 December 2011 11:36

Hi!

I think previously I had indicated I had no problem with the stringified interval approach in XML, but I am reverting my thinking on this and feel that it would be counter intuitive for those who what to use the XML schemas for code generation purposes. I think in this case the computable requirement outweighs the human readable requirement.

You are probably right regarding XML, and maybe this is valid also for most JSON use-cases where the desire for an as simple as possible object-serialization-mapping outweighs human readability.

I think the openEHR community is best served by having different archetype serialization format categories with different priorities for different purposes. E.g.:

1a. An XML format optimized for mapping to XML-schema generated code.
1b. A JSON format optimized for mapping to AOM object models handcrafted or generated from AOM-specifications.

A cADL-variant wrapped in YAML optimized for human readability. It could be used for archetype files stored in version control systems (making version diffs readable) and as textual format when you need textual examples in documentation, teaching etc.

In 1a & 1b easy implementation should be prioritized over readability but in #2 human readability should be prioritized. Prioritizing both in the same format would likely fail. Things like default ordering of nodes and attributes could be recommended but optional for #1 but should be mandatory for #2 (otherwise readability suffers and diffs get messed up).

I think we can come up with a much more concise representation of these intervals without compromising the computable requirement, something similar to XML schema maxOccurs/minOccurs.

Probably, but for #1 maybe being close to the AOM should be prioritized over being concise. After all, archetypes will not be sent over the wire at the same scale as patient data (RM instances).

By the way, is the AOM open for changes (like renaming attributes) if that would increase clarity?

If we would change subject and discuss RM instance serialization, then binary formats (like Protobuf and Thrift) could form a third category where message size and speed of conversion would be prioritized over ease of implementation or readability. XML and JSON would likely be good to have also for interoperability and debugging purposes. YAML for the RM would not be an obvious “over the wire”-format, but can be very useful for compact human readable long term EHR archiving storage as plain text files and for documentation examples.

Best regards,
Erik Sundvall
erik.sundvall@liu.se http://www.imt.liu.se/~erisu/ Tel: +46-13-286733

Seref · 5 December 2011 12:32

Hi Erik,
I’ll repeat a point I’ve tried to make before, since it is relevant in the context of binary serialization.
I’ve used protocol buffers serialization of AOM in Bosphorus (I’ll put the source code under Opereffa’s svn soon, it appears I don’t even have time to clean it up)

These are very fast, but much more simplistic formalisms to represent data. You can use them to improve the performance of many things, but you’ll be writing a lot of code, and you’ll have to find non standard ways of dealing with the simplicity of the formalism. Here is the simplest example from Bosphorus: Eiffel is an object oriented language, Java is also an object oriented language. openEHR specs use interitance, which is reflected into type hierarchies of both Eiffel and Java classes. You have the protocol buffers language which does not support inheritance. How do you represent instances of abstract types in protocol buffers? How do you read/write them from/to Eiffel/Java? I’ve done these in my own way, but it will be a problem every time someone uses formalisms which are not designed for oo languages and frameworks.

In a way, it is a conceptual distance from OO. Every alternative mentioned here is at a particular position to a particular level of OO support (take it as a point in a multidimensional space). Every alternative has values higher than the rest in a particular dimension, but none of them is absolutely closer to the OO support point (represented by Java/Eiffel/C#/Python etc) In my opinion, without this evaluation of OO support, which is what we use in the actual languages of system development, other discussions are not really relevant. What if protocol buffers are fast? What if YAML, ADL, or JSON are easier to read, space efficient?

Maybe I’m being too rigid about this particular issue, but the programming language, its tools and frameworks built on it is what determines industry adoption more than everything else today. I don’t think this is being considered in these discussions, but that is just me.

Kind regards
Seref

system · 5 December 2011 14:52

Hi Seref!

I’ll repeat a point I’ve tried to make before, since it is relevant in the context of binary serialization.
I’ve used protocol buffers serialization of AOM in Bosphorus

Why do you use binary serialization for AOM? (Just curious, I thought text formats would cater for most AOM use cases.)

I have not looked deeply into protobuf so I’ll take your word on the lack of OO support. Looking at http://wiki.apache.org/thrift/ThriftTypes their “Structs” also seem to lack inheritance. So I’ll try to keep quiet about cross-platform binary formats at least until I have tried applying any of them to openEHR for real.

… you’ll have to find non standard ways of dealing with the simplicity of the formalism.

For JSON I would agree that the formalism is sometimes too simple and one may need to make an openEHR specification for how to convey object type when needed, perhaps inspired by something like

http://flexjson.sourceforge.net/ that adds a “class” attribute or
by exploring if introspection of the target object type like http://code.google.com/p/google-gson/ does is enough for openEHR data.

Here is the simplest example from Bosphorus: Eiffel is an object oriented language, Java is also an object oriented language. openEHR specs use interitance, which is reflected into type hierarchies of both Eiffel and Java classes. You have the protocol buffers language which does not support inheritance. How do you represent instances of abstract types in protocol buffers?

Sorry if I’m dense, but when do you need to instantiate abstract types in RM data?

In a way, it is a conceptual distance from OO. Every alternative mentioned here is at a particular position to a particular level of OO support (take it as a point in a multidimensional space). Every alternative has values higher than the rest in a particular dimension, but none of them is absolutely closer to the OO support point (represented by Java/Eiffel/C#/Python etc) In my opinion, without this evaluation of OO support, which is what we use in the actual languages of system development, other discussions are not really relevant. What if protocol buffers are fast? What if YAML, ADL, or JSON are easier to read, space efficient?

Do you bundle YAML and XML into that opinion (lacking of OO-support the same way as protobuf)?

Do you think that dADL can carry everything needed for openEHR (both AM and RM)? If so why wouldn’t YAML? What in basic dADL semantics is missing in YAML? YAML (using a !-prefixed syntax) and partly XML (using e.g. xsi:Type) have ways of conveying object type in the case it cannot be inferred from data.

Maybe I’m being too rigid about this particular issue, but the programming language, its tools and frameworks built on it is what determines industry adoption more than everything else today. I don’t think this is being considered in these discussions, but that is just me.

I guess language-specific binary formats (like serialized java objects) may be better for binary representation then. Thanks for the word of warning regarding protobuf.

Do you think that all openEHR instance serializations really need to be “object oriented” themselves or is it enough that the classes of the receiving application are object oriented and that the deserialization code (or the transfer format) is clever enough to put the data into the right objects?

There are some cases where different openEHR datatypes may have the same attribute signature and for those cases even transport formats aiming reduce verbosity will need to explicitly declare class type since they cannot be safely inferred.

Best regards,
Erik Sundvall
erik.sundvall@liu.se http://www.imt.liu.se/~erisu/ Tel: +46-13-286733

Seref · 6 December 2011 11:44

A bunch of responses, most of which should actually go to a wiki page for Bosphorus

I’ve used binary serialization for AOM because although Eiffel is a very impressive language, I am not happy about its libraries. Some of them are mature, but for XML, I could not find anything that’d be guaranteed to be maintained. Protocol buffers is a technology that is used very heavily in Google, and has a large community.
Performance is the key aspect of protocol buffers. It is very, very fast. When I’m exchanging simple messages over ZeroMQ (a very fast queue framework that is used in Bosphorus) I can achieve microsecond level performance (not even millisecond!) for Java to Eiffel communication. For desktop tooling purposes, this is much faster than XML.

You need to instantiate concrete instances of abstract types every time you use single or multiple attributes in AOM. Both classes descend from CAttribute. So AOM specification gives you a field with type CAttribute (abstract), and instances of this type always have either a single or multiple attribute object assigned to this field. The Eiffel parser creates an AOM Object when it parses an archetype, On the other side of the bridge, a Java object awaits to be filled with the data in the Eiffel object. Both Java and Eiffel know the relationship between these types but protocol buffers does not have inheritance. So when you’re defining a protocol buffer message with its language, you have a problem: What should be the type of the field that represents CAttribute? I’ve had to come up with a method of handling this case. Someone may use another method and that is my point: when we have to do these things, they become source of bugs and obstacles to implementation. So we may benefit from format and readability of JSON, but the type of issues I’ve been describing would introduce a lot more problems than bandwidth efficiency or human friendliness. Hence, my priorities are slightly different when it comes to what makes a formalism convenient in openEHR implementation.

With this view: I find XML seriously crippled for OO support, but at least there is some inheritance support and there is huge tooling and framework support. My job would be to find ways of walking around issues using these frameworks. I’d prefer this to having less tooling and less OO support (for JSON) I can’t speak for YAML, but in terms of maturity and support for mechanisms such as schemas, I’d be surprised if it ends up better than XML. For XML, I have JAXB, support in JAVA, Python, .NET, you name it…

dADL has the advantage of being designed in a strong openEHR context. I guess both YAML (based on the feature you’ve mentioned) and XML can match dADL to the extend that any required workarounds could be justified based on industry adoption. I do not know YAML good enough to compare it in detail, but I’d love to hear from someone the type of things I’ve been sharing here, only with YAML this time instead of JSON and XML.

Given this, if you or someone else thinks that YAML can be an alternative to dADL, there is nothing stopping anyone than implementing it and using it. Absolutely nothing. This is what I do. If I think that and XML form of ADL would help, then I take what is out there (Tom’s Eiffel code), use it, and move on.

I have a feeling that all these discussions about if this or that could replace dADL are too hypothetical. Most of the time they are academic discussions. There is nothing wrong with academic discussions, I am doing a PhD here, but if the openEHR community is spending its time and resources for academic discussions which do not necessarily connect to real life implementations in the near term, then I think we have a problem.

Thomas is heroically responding to all queries without judgement, and he is even implementing a lot of code, to give grounded answers, to provide proofs. I guess I am not as mature and as dedicated as he is. I’d rather have him working on adl 1.5 XSD schemas than proving people that openEHR can do JSON if necessary. Because having XSDs for ADL 1.5 is going to increase adoption of openEHR a lot more than having JSON output. If anybody out there does not agree, please come forward and talk about your JSON usage in your project which is about an actual information system that is running, or is supposed to run in a clinical setting.

Please do not get me wrong, all the discussion we are having here is useful, it is just that in my humble opinion, some discussions are more useful than others if this standard into which I am heavily investing is to go forward.

Best regards
Seref

Stef_Verlinden1 · 6 December 2011 12:01

+1

Cheers,

Stef

Koray_Atalag · 6 December 2011 21:23

Yeah I was also wondering what is the driver/motivation/aspiration behind using JSON, YAML etc. instead of good old ADL?

Is this to do with making openEHR easier to digest for the ‘traditional’ IT community because perhaps they don’t want to let go everything at once and leverage some existing skills like these? I also think that we as a community should look at getting more organised and get our efforts in tune as I know that quite interesting and though times are about to come…

system · 7 December 2011 10:29

Oh sigh…

Trying to be open minded, thinking a few steps ahead, sharing thoughts and regularly reevaluating design decisions does not seem to be appreciated by all on this list.

Perhaps we need to mark some discussions or sections with…
[Warning: may contain new thoughts]
…so that those of us that enjoy such discussions may continue to have them and those that get distracted by them or can’t stand them could filter out those parts.

Yeah I was also wondering what is the driver/motivation/aspiration behind using JSON, YAML etc. instead of good old ADL?

Good old which ADL? Please go back in the thread and note the difference between dADL and cADL in the reasoning, dADL is a reinvention of the wheel (object tree serialization) cADL is an optimized DSL that I have not seen any obvious widespread alternative to if brevity and readability is sought for.

Regarding the motivation you ask for, I would recommend going back in the thread again to the first message…

http://www.openehr.org/mailarchives/openehr-technical/msg06186.html

…under the boldface heading “Motivation:”, that you may have missed, and read the three bullet points. You may not agree but that and the rest of this current message might reduce your wondering about the discussion origins.

I also think that we as a community should look at getting more organised and get our efforts in tune

Yes, a bit of diversity is good in order to best explore design space, but duplicating work is a waste of time.
If we are allowed to discuss future-directed thoughts on this list (without people getting too upset) that may also help us tune our efforts. If we must implement first and then discuss it will be a lot harder to avoid duplication of work.

as I know that quite interesting and though times are about to come…

Are you referring to the CIMI-discusions or is it a general observation about how the future usually is

Regarding CIMI I think it is valuable to try to look upon openEHR with the eyes of newcomers. If there is unnecessary legacy in models or formats that we don’t easily see because we have gotten used to it, then now is a good time to try reducing it while the amount of patient data using openEHR is limited. It will be harder to change things later. Getting the template formalism integrated with the AOM 1.5 was great in this sense, and so is Tom’s experimentation with RM 2.0 constructs that may reduce the ITEM_STRUCTURE hierarchy.

From: … On Behalf Of Stef Verlinden

+1

+/- infinity

Yay, I love flame wars

Given this, if you or someone else thinks that YAML can be an alternative to dADL, there is nothing stopping anyone than implementing it and using it. Absolutely nothing.

Do you assume that if somebody is talking about a subject, then they can’t possibly be in the middle of implementing it and wanting to share thoughts at an early stage? Please try to be a bit more open minded, I did not ask you to be the first to implement YAML support. You are not the the only one implementing openEHR stuff, but I will admit that you deserve credit for, and are great at “release early, release often” and I am not (yet).

Thomas is heroically responding to all queries without judgement…

I think that is an unfair description of Tom’s judgment.

I have a feeling that all these discussions about if this or that could replace dADL are too hypothetical. Most of the time they are academic discussions. There is nothing wrong with academic discussions, I am doing a PhD here, but if the openEHR community is spending its time and resources for academic discussions which do not necessarily connect to real life implementations in the near term, then I think we have a problem.

So if something is not on your personal implementation agenda in near time, then it is “academic” and a waste of resources since it can not possibly be on the implementation agenda of somebody else…

The reason I started looking into both JSON and YAML is that they are part of our current implementation (partly using JSON, Javascript etc) (primarily for RM objects) in this process I happened to see that YAML might do the job of dADL and that we then we could reuse parser/serializer work of others (for many programming languages) instead of maintaining dADL frameworks. I wanted to share this thought at an early stage and I do appreciate that some have at least responded with positive interest and curiosity.

Sometimes time can be saved by discussion before implementation, especially carefully considering what should or should not be implemented. People at UCL or Ocean Informatics can probably regularly speak in person to core openEHR decision makers and designers, the rest of as have the mailing lists as major channels, please try to respect that too.

Please do not get me wrong, all the discussion we are having here is useful, it is just that in my humble opinion, some discussions are more useful than others if this standard into which I am heavily investing is to go forward.

You are not the only one having invested a lot of years and work in openEHR. I would ask you and others to please allow those that want to discuss things before and during implementation to do so if they wish to. Regarding YAML the p.s. on the start message of this thread said:

P.s. Tom Beale and I sort of started a brief off-list discussion about YAML, here is now an attempt to get input from more people.

I think it is better for the openEHR community to have things that are of potential interest to others, even things that are not yet tested, as on-list discussions rather then off-list discussions, but I am not longer sure everyone agrees and this is a bit worrying to me. I do still think there is enough people appreciating early open discussions and will try to continue along that path but try to remember tagging such sections with [Warning: may contain new thoughts]

Best regards,
Erik Sundvall
erik.sundvall@liu.se http://www.imt.liu.se/~erisu/ Tel: +46-13-286733

P.s. [Warning: may contain new thoughts] I suspect a current off-list discussion of scalable distributed alternatives to the CKM based on GIT might be unwelcome on the list too and it might be better to keep off-list for a long time until it has been at least partially tested some time in the distant future, since there are other things (like releasing other software) that need to be prioritized first before we have time to test anything.

yampeku · 7 December 2011 11:15

I have no problems on having different representations. In fact,
having different representations means more happy people, not less
(for example, people has been using RDF to describe archetypes for
some time).
Anyway I love this kind of threads, as are great to see new
perspectives and technologies.

P.s. I like your idea of a GIT based distributed concept repository.
If you want to start an off-list discussion please count us in, as we
are also working on a reference model independent concept repository

Topic		Replies	Views
occurrences and cardinality in ADL, XML, JSON Technical (archive)	30	19	21 November 2011
a first attempt JSON archetype Reference Implementation: Java (archive)	9	18	25 November 2011
JSON for definitions-notation Technical (archive)	14	38	16 February 2019
ADL / archetype wish list Reference Implementation: Java (archive)	8	14	21 March 2006
XML Focus Group for openehr Clinical (archive)	15	19	16 September 2008
lessons from Intermountain Health, and starting work on openEHR 2.x Technical (archive)	30	63	8 October 2012
Converting ADL to JSON Reference Implementation: Java (archive)	20	31	22 April 2015
The Truth About XML was: openEHR Subversion => Github move progress Implementers (archive)	61	8	13 April 2013
Musings about a more web-friendly openehr Technical (archive)	8	6	29 November 2013
optional existence, cardinality and occurrences. Technical (archive)	9	16	20 July 2009

Could YAML replace dADL as human readable AOM serialization format?

Related topics