Could YAML replace dADL as human readable AOM serialization format?

Erik,
Add my sigh next to yours… Lots of misunderstandings, will try to respond to most obvious ones.

I have clearly expressed that all discussions here are useful. I’ve made no connection to my agenda. My academic work does not even require the things I’ve mentioned as high priority for openEHR. I’ve been enjoying the discussions, and will continue to do so.

Your comments about dADL below, as well as your original motivations is hinting at what I’m opposing to. Your own words:
Having an archetype specific object-serialization language like dADL might make “archetyping” look more mysterious and suspect and might hide the fact that the semantics expressed in the AOM is the interesting thing that can be serialised in many different ways.

This is a negative statement about ADL, right? Nothing wrong with negative statements with ADL, I have a bunch of them in my pocket. But if this is your motivation to discuss YAML, and if the thread you’ve started is about “replacing adl”, you’re talking about replacing something that has taken lots of time and effort to create. This is where we have our difference, I agree with many of the criticisms of ADL, and it is exactly at this point I try to be open minded. I can see that there are also significant advantages of ADL, and rather than suggesting that is replaced, I first hypothesize and then go ahead and prove that it can co exist with xml, json, yaml etc. My work is out there showing that adl can co exist alongside with these formalisms. From my point of view, this is quite an open minded approach, at least more open minded than replacing it, without considering what it would actually mean in other contexts.

This is not the first time I’m having these types of discussions, and won’t be the last. I make my point whenever I see a discussion that seems to suggest switching horses midstream. I’m sorry if I’m being a buzz killer, but I’m in favor of discussing things in a larger context, including consequences for the openEHR standard and its adoption.
Reminding these consequences does not mean I’m ruling out other options. I have been discussing them in light of all the proof I have (through my work) and I’ve asked others to do so. I can not know about your work in advance, can I ?

Let us try to eliminate the misunderstanding at this point:

If this discussion concludes with the common view that yaml can be an alternative to dADL, do you think openEHR specification should replace ADL?
If the answer to the previous question is yes, then do you realize that this would mean replacing all the software that uses ADL, both open source and proprietary ?
If the answer to the previous question is yes, then do you have a suggestion for funding these changes?

I think this is the best I can do to explain what I’m trying to include in the discussions.

Best regards
Seref

Your comments about dADL below, as well as your original motivations is hinting at what I'm opposing to. Your own words:
"Having an archetype specific object-serialization language like dADL might make "archetyping" look more mysterious and suspect and might hide the fact that the semantics expressed in the AOM is the interesting thing that can be serialised in many different ways."

This is a negative statement about ADL, right?

I don't think so, Seref. It's a negative about dADL ... not ADL per se.

Going back to Erik's original post ...
http://www.openehr.org/mailarchives/openehr-technical/msg06187.html
... it's pretty clear that he is _not_ suggesting that YAML should replace ADL:

"... Also note that the current suggestion only aims at looking for replacement of dADL not cADL. Also note that the AOM and XML serialisations of the AOM are not affected by this suggestion."

Now I think Erik made a typo in that last sentence. I don't know what an "AOM serialisation of the AOM" would be. I assume that Erik meant to say that "ADL and XML serialisations of the AOM are not affected by this suggestion."

Seref also wrote:

Let us try to eliminate the misunderstanding at this point:

If this discussion concludes with the common view that yaml can be an alternative to dADL, do you think openEHR specification should replace ADL?
If the answer to the previous question is yes, then do you realize that this would mean replacing all the software that uses ADL, both open source and proprietary ?

In response to the first question, I would say no. If YAML replaced dADL as a serialisation format, it wouldn't imply replacement of ADL too.

And so, in response to your second question, I'd argue that it wouldn't imply replacing any software at all that currently uses ADL. The only software that would have to be replaced is anything currently doing serialisation with dADL ... which would be nothing yet, as far as I'm aware.

- Peter

Thanks Peter,
In that case the suggestion I’m objecting to does not exist. Though I have to confess I don’t seem to clearly understand the suggestion here, better re-read the thread with more coffee at hand.

Best regards
Seref

Hi!

I think previously I had indicated I had no problem with the stringified interval approach in XML, but I am reverting my thinking on this and feel that it would be counter intuitive for those who what to use the XML schemas for code generation purposes. I think in this case the computable requirement outweighs the human readable requirement.

You are probably right regarding XML, and maybe this is valid also for most JSON use-cases where the desire for an as simple as possible object-serialization-mapping outweighs human readability.

I think the openEHR community is best served by having different archetype serialization format categories with different priorities for different purposes. E.g.:

1a. An XML format optimized for mapping to XML-schema generated code.
1b. A JSON format optimized for mapping to AOM object models handcrafted or generated from AOM-specifications.

  1. A cADL-variant wrapped in YAML optimized for human readability. It could be used for archetype files stored in version control systems (making version diffs readable) and as textual format when you need textual examples in documentation, teaching etc.

I had never thought of that but the AWB has a multi-part serialiser component, so it would be possible. When I get a bit of time :wink:

In 1a & 1b easy implementation should be prioritized over readability but in #2 human readability should be prioritized.

Erik, You didn’t answer the question a while ago - who are the ‘readers’? I am just asking to know if you are talking about some particular kind of educational usage, and what your criteria are for ‘readability’.

Prioritizing both in the same format would likely fail. Things like default ordering of nodes and attributes could be recommended but optional for #1 but should be mandatory for #2 (otherwise readability suffers and diffs get messed up).

good point, you reminded me I have to fix the order in the AWB serialisations.

I think we can come up with a much more concise representation of these intervals without compromising the computable requirement, something similar to XML schema maxOccurs/minOccurs.

Probably, but for #1 maybe being close to the AOM should be prioritized over being concise. After all, archetypes will not be sent over the wire at the same scale as patient data (RM instances).

how can a string like “1” or “2..*” be more concise? I think this is the most concise possible format (or some slight variation, e.g. the dADL interval syntax).

By the way, is the AOM open for changes (like renaming attributes) if that would increase clarity?

well the AOM 1.5 is a draft, so in principle yes. But we need to assess the impact. Breaking archetype authoring tools probably does not matter so much - there are not many, so we can deal with that. Impacts on EHR system software will have to be more closely evaluated before we agreed to any such changes. But let us know your fantasies anyway :wink:

  • thomas

A bunch of responses, most of which should actually go to a wiki page
for Bosphorus

I've used binary serialization for AOM because although Eiffel is a
very impressive language, I am not happy about its libraries. Some of
them are mature, but for XML, I could not find anything that'd be
guaranteed to be maintained.

I don't think there is any problem with them being maintained, they are
part of the main Eiffel tool. The choice of Protocl buffers (or maybe
there is another better variant?) makes sense on the basis of
performance....

Protocol buffers is a technology that is used very heavily in Google,
and has a large community.
Performance is the key aspect of protocol buffers. It is very, very
fast. When I'm exchanging simple messages over ZeroMQ (a very fast
queue framework that is used in Bosphorus) I can achieve microsecond
level performance (not even millisecond!) for Java to Eiffel
communication. For desktop tooling purposes, this is much faster than
XML.

orders of magnitude...

Thomas is heroically responding to all queries without judgement, and
he is even implementing a lot of code, to give grounded answers, to
provide proofs.

don't give me too much credit: my lightweight serialisation library
allowed me to implement JSON output in about 4 hours, plus two days
background debugging of the {[]} ....

I guess I am not as mature and as dedicated as he is. I'd rather have
him working on adl 1.5 XSD schemas than proving people that openEHR
can do JSON if necessary. Because having XSDs for ADL 1.5 is going to
increase adoption of openEHR a lot more than having JSON output. If
anybody out there does not agree, please come forward and talk about
your JSON usage in your project which is about an actual information
system that is running, or is supposed to run in a clinical setting.

yes, I think it is about time we posted a proposed AOM 1.5 XSD...

- thomas

Good old which ADL? Please go back in the thread and note the difference between dADL and cADL in the reasoning, dADL is a reinvention of the wheel (object tree serialization)

Erik,
out of academic interest: was either YAML or JSON around in 2000, when I made a first version of dADL (I’m in a plane typing this, can’t check)? If they were, I look silly :wink: If not… In any case, JSON is seriously semantically deficient for proper serialisation purposes and is in need of at least 2 basic enhancements to work correctly on any realistic data. I agree it is fairly readable, although why attribute names are in quotes is completely beyond me…I have not yet looked at YAML properly, but it looks like it probably does the job properly.

Yes, a bit of diversity is good in order to best explore design space, but duplicating work is a waste of time.
If we are allowed to discuss future-directed thoughts on this list (without people getting too upset) that may also help us tune our efforts. If we must implement first and then discuss it will be a lot harder to avoid duplication of work.

I don’t actually think there is any harm in messing around with variations on serialisation - it’s not hard to implement (XML being the hardest), but at some point I think a wiki page with a summary of real world requirements behind each variant would be useful.

Are you referring to the CIMI-discusions or is it a general observation about how the future usually is :slight_smile:

Regarding CIMI I think it is valuable to try to look upon openEHR with the eyes of newcomers. If there is unnecessary legacy in models or formats that we don’t easily see because we have gotten used to it, then now is a good time to try reducing it while the amount of patient data using openEHR is limited. It will be harder to change things later. Getting the template formalism integrated with the AOM 1.5 was great in this sense, and so is Tom’s experimentation with RM 2.0 constructs that may reduce the ITEM_STRUCTURE hierarchy.

I have to do a bit more work to get the first proposal defined properly - there is a half done wiki page on that. Should have it fixed in a couple of days, then we can discuss. (I’m not online but if others find the page, feel free to put your own RM 2.0 variations on there somewhere).

+/- infinity

Yay, I love flame wars :slight_smile:

you can’t win like that. Godel or someone showed that there are different sizes of infinity :slight_smile:

The reason I started looking into both JSON and YAML is that they are part of our current implementation (partly using JSON, Javascript etc) (primarily for RM objects) in this process I happened to see that YAML might do the job of dADL and that we then we could reuse parser/serializer work of others (for many programming languages) instead of maintaining dADL frameworks. I wanted to share this thought at an early stage and I do appreciate that some have at least responded with positive interest and curiosity.

at some point I intend to finalise the ultimate dADL grammar and publish dADL as a standalone with at least C#, Java, Eiffel and possibly C/C++ fast & full parsers + serialisers. This is less work than you might think, and it would make dADL just as available as YAML. Well, ok it won’t be in Erlang or Haskell for a while, but I doubt if that will make much difference.

  • thomas

all,

one of the good decisions I think we made early on in openEHR’s history was to have few mailing lists rather than many. One of the consequences is that discussions about new / fun ideas are on the same list and sometimes same thread as discussions about real world implementation priorities. Please continue to enjoy :slight_smile:

  • thomas

According to http://en.wikipedia.org/wiki/JSON ...

"JSON was used at State Software, a company co-founded by Crockford, starting around 2001. The JSON.org website was launched in 2002."

And http://en.wikipedia.org/wiki/YAML ...

"YAML was first proposed by Clark Evans in 2001 ..."

Clearly you were not alone, ten years ago, in thinking that there had to be a better way than XML!

- Peter

Oh, just my personal thoughts without any sanity check – should have read the whole thread first! My reaction was just to what was written in the subject line of the thread and after reading Seref’s comments about the need to focus on outstanding/high priority issues. Sorry if I have offended – I can’t possibly be against free discussions here – even the most blue sky ones which I seldom broadcast myself :wink:

Cheers,

-koray

After reading Pablo's post on domain types I am curious about how
should they be represented on each one of the different formats. I
feel they should be 'expanded' before trying to represent them in any
other format, but I might be wrong. Any ideas or opinions?

I have to say, the more I look at YAML, the more I wonder what the
designers were thinking. For example, in this section of the spec,
multi-line quoted strings are only allowed if the 'key' is also quoted
(the strange looking JSON approach); if the key is not quoted (i.e.
'simple') then the value can't be quoted either. That's just nonsense! I
am glad I am only implementing a serialiser, not a parser...

- thomas

I have to say, the more I look at YAML, the more I wonder what the
designers were thinking. For example, in this section of the spec,

http://yaml.org/spec/current.html#id2532720

Hi!

I have to say, the more I look at YAML, the more I wonder what the
designers were thinking. For example, in this section of the spec,

http://yaml.org/spec/current.html#id2532720

multi-line quoted strings are only allowed if the 'key' is also quoted
(the strange looking JSON approach);
if the key is not quoted (i.e.
'simple') then the value can't be quoted either. That's just nonsense!

Are you sure that is what it says?

"Double quoted scalars are restricted to a single line when contained
inside a simple key."

Is it not rather that you may not use a multiline double quoted string
as a KEY (at all). It does NOT forbid you to use multiline double
quoted strings in the value, no matter if or how you quote your keys.

I have certainly seen double quoted values for unquoted keys coming
from serializers claiming to be specification conformant.

Are any of your keys so long and complicated that they would need
multiline quoted strings?

I am glad I am only implementing a serialiser, not a parser...

In many less exotic languages they are already implemented :slight_smile:
Then you configure them and then throw your object trees at them.

An example of very unfinished work in progress, using poorly readable
ordering and based on the openEHR java-ref-impl (and probably exposing
too many fields) is attached below.

Best regards,
Erik Sundvall
erik.sundvall@liu.se http://www.imt.liu.se/~erisu/ Tel: +46-13-286733

!<http://www.openehr.org/releases/1.0.2/class/openehr.am.archetype.ARCHETYPE&gt;
adl_version: '1.4'
archetype_id: openEHR-DEMOGRAPHIC-PERSON.person.v1
concept: at0000
original_language: ISO_639-1::pt-br
translations:
  en:
    language: ISO_639-1::en
    author: {email: sergio@lampada.uerj.br, organisation: Universidade
do Estado do Rio de Janeiro - UERJ, name: Sergio Miranda Freire}
description:
  original_author: {email: sergio@lampada.uerj.br, organisation:
Universidade do Estado do Rio de Janeiro - UERJ, name: Sergio Miranda
Freire & Rigoleta Dutra Mediano Dias,
    date: 22/05/2009}
  other_contributors: ['Sebastian Garde, Ocean Informatics, Germany
(Editor)', 'Omer Hotomaroglu, Turkey (Editor)', 'Heather
      Leslie, Ocean Informatics, Australia (Editor)']
  lifecycle_state: Authordraft
  details:
  - language: ISO_639-1::en
    purpose: Representation of a person's demographic data.
    keywords: [demographic service, person's data]
    use: Used in demographic service to collect a person's data.
    copyright: © openEHR Foundation
    original_resource_uri: {}
  - language: ISO_639-1::pt-br
    purpose: Representação dos dados demográficos de uma pessoa.
    keywords: [serviço demográfico, dados de uma pessoa]
    use: Usado em serviço demográficos para coletar os dados de uma pessoa.
    copyright: © openEHR Foundation
    original_resource_uri: {}
  other_details: {references: 'ISO/TS 22220:2008(E) - Identification
of Subject of Care - Technical Specification - International
      Organization for Standardization.'}
definition:
  attributes:
  - rm_attribute_name: details
    children:
    - includes:
      - expression:
          left_operand: {item: archetype_id/value, reference_type:
CONSTANT, type: STRING}
          right_operand:
            item: {pattern: '(person_details)[a-zA-Z0-9_-]*\.v1'}
            reference_type: CONSTANT
            type: String
          operator: OP_MATCHES
          precedence_overridden: false
          type: BOOLEAN
      rm_type_name: ITEM_TREE
      occurrences: [1, 1]
      node_i_d: at0001
      any_allowed: false
      path: /details[at0001]
    any_allowed: false
    path: /details
  - rm_attribute_name: identities
    children:
    - includes:
      - expression:
          left_operand: {item: archetype_id/value, reference_type:
CONSTANT, type: STRING}
          right_operand:
            item: {pattern: '(person_name)[a-zA-Z0-9_-]*\.v1'}
            reference_type: CONSTANT
            type: String
          operator: OP_MATCHES
          precedence_overridden: false
          type: BOOLEAN
      rm_type_name: PARTY_IDENTITY
      occurrences: [1, 1]
      node_i_d: at0002
      any_allowed: false
      path: /identities[at0002]
    any_allowed: false
    path: /identities
  - rm_attribute_name: contacts
    children:
    - attributes:
      - rm_attribute_name: addresses
        children:
        - includes:
          - expression:
              left_operand: {item: archetype_id/value, reference_type:
CONSTANT, type: STRING}
              right_operand:
                item: {pattern: '(address)([a-zA-Z0-9_-]+)*\.v1'}
                reference_type: CONSTANT
                type: String
              operator: OP_MATCHES
              precedence_overridden: false
              type: BOOLEAN
          - expression:
              left_operand: {item: archetype_id/value, reference_type:
CONSTANT, type: STRING}
              right_operand:
                item: {pattern: '(electronic_communication)[a-zA-Z0-9_-]*\.v1'}
                reference_type: CONSTANT
                type: String
              operator: OP_MATCHES
              precedence_overridden: false
              type: BOOLEAN
          rm_type_name: ADDRESS
          occurrences: [1, 1]
          node_i_d: at0030
          any_allowed: false
          path: /contacts[at0003]/addresses[at0030]
        any_allowed: false
        path: /contacts[at0003]/addresses
      rm_type_name: CONTACT
      occurrences: [1, 1]
      node_i_d: at0003
      any_allowed: false
      path: /contacts[at0003]
    any_allowed: false
    path: /contacts
  - rm_attribute_name: relationships
    children:
    - attributes:
      - rm_attribute_name: details
        children:
        - attributes:
          - rm_attribute_name: items
            children:
            - attributes:
              - rm_attribute_name: value
                children:
                - attributes: []
                  rm_type_name: DV_TEXT
                  occurrences: [1, 1]
                  any_allowed: true
                  path: /relationships[at0004]/details/items[at0040]/value
                - attributes:
                  - rm_attribute_name: defining_code
                    children:
                    - reference: ac0000
                      rm_type_name: CodePhrase
                      occurrences: [1, 1]
                      any_allowed: false
                      path:
/relationships[at0004]/details/items[at0040]/value/defining_code
                    any_allowed: false
                    path:
/relationships[at0004]/details/items[at0040]/value/defining_code
                  rm_type_name: DV_CODED_TEXT
                  occurrences: [1, 1]
                  any_allowed: false
                  path: /relationships[at0004]/details/items[at0040]/value
                any_allowed: false
                path: /relationships[at0004]/details/items[at0040]/value
              rm_type_name: ELEMENT
              occurrences: [1, 1]
              node_i_d: at0040
              any_allowed: false
              path: /relationships[at0004]/details/items[at0040]
            any_allowed: false
            path: /relationships[at0004]/details/items
          rm_type_name: ITEM_TREE
          occurrences: [1, 1]
          any_allowed: false
          path: /relationships[at0004]/details
        any_allowed: false
        path: /relationships[at0004]/details
      rm_type_name: PARTY_RELATIONSHIP
      occurrences: [1, 1]
      node_i_d: at0004
      any_allowed: false
      path: /relationships[at0004]
    any_allowed: false
    path: /relationships
  rm_type_name: PERSON
  occurrences: [1, 1]
  node_i_d: at0000
  any_allowed: false
  path: /
ontology:
  term_definitions_list:
  - language: pt-br
    definitions:
    - code: at0000
      items: {text: Dados da pessoa, description: Dados da pessoa.}
    - code: at0001
      items: {text: Detalhes, description: Detalhes demográficos da pessoa.}
    - code: at0002
      items: {text: Nome, description: Conjunto de dados que
especificam o nome da pessoa.}
    - code: at0003
      items: {text: Contatos, description: Contatos da pessoa.}
    - code: at0004
      items: {text: Relacionamentos, description: 'Relacionamentos de
uma pessoa, especialmente laços familiares.'}
    - code: at0030
      items: {text: Endereço, description: 'Endereços vinculados a um
único contato, ou seja, com o mesmo período de validade.'}
    - code: at0040
      items: {text: Grau de parentesco, description: Define o grau de
parentesco entre as pessoas envolvidas.}
  - language: en
    definitions:
    - code: at0000
      items: {text: Person, description: Personal demographic data.}
    - code: at0001
      items: {text: Demographic details, description: A person's
demographic details.}
    - code: at0002
      items: {text: Name, description: A person's name.}
    - code: at0003
      items: {text: Contacts, description: A person's contacts.}
    - code: at0004
      items: {text: Relationships, description: 'A person''s
relationships, especially family ties.'}
    - code: at0030
      items: {text: Addresses, description: 'Addresses linked to a
single contact, i.e. with the same time validity.'}
    - code: at0040
      items: {text: Relationship type, description: Defines the type
of relationship between related persons.}
  constraint_definitions_list:
  - language: pt-br
    definitions:
    - code: ac0000
      items: {text: Códigos para tipo de parentesco, description:
códigos válidos para tipo de parentesco.}
  - language: en
    definitions:
    - code: ac0000
      items: {text: Codes for type of relationship, description: Valid
codes for type of relationship.}
  term_binding_list: []
  constraint_binding_list: []
is_controlled: false

well I read this to say:

  • if you double quote a long String containing line breaks (if you don’t yet get into different trouble) THEN

  • this scalar cannot be the value of a ‘simple key’;

  • a ‘simple key’ is defined as:

  • A simple key has no identifying mark. It is recognized as being a key either due to being inside a flow mapping, or by being followed by an explicit value. Hence, to avoid unbound lookahead in YAML processors, simple keys are restricted to a single line and must not span more than 1024 stream characters (hence the need for the flow-key context). Note the 1024 character limit is in terms of Unicode characters rather than stream octets, and that it includes the separation following the key itself.

maybe I misunderstood that a ‘simple key’ can’t have quotes, but in any case, the concept of a ‘simple key’, if the object of YAML is object data serialisation is … pretty strange (if they are hash keys, then they are normal strings, there should be no problem. Not distingishing between hash keys and attribute names seems to be a problem in YAML as for JSON. Very odd design IMO). Why the syntactic structure of a ‘value’ should have any dependence on the syntactic structure of a ‘key’ is beyond me.

Anyway, for the moment I will stick with the format (for Strings):

unquoted_key: “double quoted string”

this format passes the online parser tests, and handles multi-line strings better. Otherwise you have to use ‘|’, ‘>’ and or '' markers all over the place.

  • thomas