Rules in archetypes - a migration path

thomas.beale · 25 January 2021 22:15

I’ve been mulling over the old expression syntax that was defined in the rules section of an archetype a few years ago, and that @pieterbos and @yampeku and I have been discussing. The following is an example.

rules
    $apgar_breathing_value: Integer := /data[id3]/events[id4]/data[id2]/items[id10]/value[id39]/value
    $apgar_heartrate_value: Integer := /data[id3]/events[id4]/data[id2]/items[id6]/value[id40]/value
    $apgar_muscle_value: Integer := /data[id3]/events[id4]/data[id2]/items[id14]/value[id41]/value
    $apgar_reflex_value: Integer := /data[id3]/events[id4]/data[id2]/items[id18]/value[id42]/value
    $apgar_colour_value: Integer := /data[id3]/events[id4]/data[id2]/items[id22]/value[id43]/value
    $apgar_total_value: Integer := /data[id3]/events[id4]/data[id2]/items[id26]/value[id44]/magnitude

    Apgar_total: $apgar_total_value = $apgar_breathing_value + $apgar_heartrate_value + $apgar_muscle_value + $apgar_reflex_value + $apgar_colour_value

In the modern syntax, this would be something like:

rules
    apgar_total_value: Integer
    apgar_heartrate_value: Integer
    apgar_breathing_value: Integer
    apgar_reflex_value: Integer
    apgar_muscle_value: Integer
    apgar_colour_value: Integer

    check apgar_total_value = apgar_heartrate_value + apgar_breathing_value + apgar_reflex_value + apgar_muscle_value + apgar_colour_value

bindings
    path_bindings = <
        ["apgar_breathing_value"] = <"/data[id3]/events[id4]/data[id2]/items[id10]/value[id39]/value">
        ["apgar_heartrate_value"] = <"/data[id3]/events[id4]/data[id2]/items[id6]/value[id40]/value">
        ["apgar_muscle_value"] = <"/data[id3]/events[id4]/data[id2]/items[id14]/value[id41]/value">
        ["apgar_reflex_value"] = <"/data[id3]/events[id4]/data[id2]/items[id18]/value[id42]/value">
        ["apgar_colour_value"] = <"/data[id3]/events[id4]/data[id2]/items[id22]/value[id43]/value">
        ["apgar_total_value"] = <"/data[id3]/events[id4]/data[id2]/items[id26]/value[id44]/magnitude">
    >

This is style of bindings currently used in GDL2 guidelines, and correctly separates out the expression (which has instance level semantics) from the bindings (which have multiple->single instance mapping semantics).

Now, if the data contained say 3 Apgar samples, i.e. 3 entire trees of instances to which these paths will map, the meaning of the expression check apgar_total_value = apgar_heartrate_value + apgar_breathing_value + apgar_reflex_value + apgar_muscle_value + apgar_colour_value is as follows:

for each top-level data instance (3 in this case)…
- traverse the expression tree, in the normal way, and for all vars in the expression tree,
  - find the node at the path mapped to each right-hand-side variable,
  - find the node at the path mapped to the left-hand side variable
- extract the data values into the parse tree and execute the expression

Or in normal words: execute the expression for each of the 3 Apgar instance trees.

Notice that the symbolic variables get mapped only to data tree nodes within the same data hierarchy - and not to all possible permutations of mappings, which would be incoherent. The semantics of ‘expressions’ as applied to archetypes require some special processing, to ensure they are executed on coherent data shapes that match the archetype.

The relationship between each (instance-level) symbolic variable and its path (a constraint-level object which can match multiple instances) is clearly a ‘mapping’ or ‘binding’ one, whereby more than one value (in general) from the data will be mapped to the variable. The older expression

    $apgar_breathing_value: Integer := /data[id3]/events[id4]/data[id2]/items[id10]/value[id39]/value

is thus not really the assignment it looks like, but is stating this binding. Indeed, it doesn’t make sense as a typed assignment, because the path and the symbols are operating at different levels of abstraction.

I therefore propose that the bridge between older archetypes that may have this construction, and proper type-safe rule expressions in archetypes is to interpret this syntax as a shorthand for the the following more modern expression + binding:

rules
   ...
    apgar_breathing_value: Integer

bindings
    path_bindings = <
        ["apgar_breathing_value"] = <"/data[id3]/events[id4]/data[id2]/items[id10]/value[id39]/value">

Ideally, a tool encountering the old pseudo-assignment expression would restructure it in the above form, i.e. silently upgrade the archetype.

There are more details to take care of, but this would fix one of the major issues. Feedback welcome.

yampeku · 26 January 2021 15:31

I somewhat feel that this is not enough information to assume that, why would be the tree and not the Composition (or Cluster) level? Seems very related to the grouping & query semantics we were discussing for AQL

pablo · 26 January 2021 16:18

Did we consider how FHIR profiles define their rules?

For instance, I don’t see we have a way to say IF x THEN y

Like in the example: apgar_total_value = apgar_heartrate_value + apgar_breathing_value + apgar_reflex_value + apgar_muscle_value + apgar_colour_value

To be able to do the assignment “apgar_total_value =”, first “apgar_heartrate_value.exists() AND apgar_breathing_value.exists() AND …” should be true.

Another thing is the meaning of “=”, is that an assignment or is the check of a rule that should be true, in programming languages “==”?

I think what they do with the rules in the profiles is pretty neat, and very easy to load into a validator of complex conditions to check if an instance validates with those extra, non-structural, rules.

thomas.beale · 26 January 2021 16:42

The general case is in fact the top-level tree, i.e. Composition, in the usual openEHR case. Consider some expression like:

defined (xxx) implies defined (aaa) and aaa = bbb + ccc

bindings
    ["xxx"] = <"/protocol/xxx">
    ["aaa"] = <"/data/events[id2]/data[id4]/aaa">
    ["bbb"] = <"/data/events[id2]/data[id5]/bbb">
    ["ccc"] = <"/data/events[id2]/data[id6]/ccc">

Notice the xxx path is in the protocol, and the other 3 are inside an event. Now, what is most likely really needed in the bindings is wildcards like this:

bindings
    ["xxx"] = <"/protocol/xxx">
    ["aaa"] = <"/data/events[*]/data[id4]/aaa">
    ["bbb"] = <"/data/events[*]/data[id5]/bbb">
    ["ccc"] = <"/data/events[*]/data[id6]/ccc">

Then the rules I stated earlier would apply to map the expression to a coherent set of data points within each Event, plus the common /protocol element (i.e. the xxx variable).

There’s a bit of work to describe the general case, but I don’t think it is complicated.

thomas.beale · 26 January 2021 16:54

We do indeed have this in the Expression Language:

attached predicate (null checking)
defined predicate - check if tracked subject variable has any data from retrieval

(Still messing around with the syntax of these, but they are both in the meta-model).

For the syntax, I think it is better to be able to use ‘=’ to mean equality checking as in mathematics - this is more domain friendly. Programmer stuff like ‘==’ and ‘===’, ‘~=’ etc won’t make sense to domain experts. So writing a = b + c is an assertion, so it really should be something like assert a = b + c. To actually write the value b+c into the field a would require an assignment for which I used the common symbol :=. Other symbols like <- (F#, a few other languages) are nicer mathematically, but possibly too obscure for domain experts.

I have not looked at any FHIR profile rules - got a link to an example?

matijap · 27 January 2021 06:48

How does the interpreter distinguish, in the wildcard case, between “expression is valid inside each event” and "expression is valid inside entire Observation" and "expression is valid within the Composition"? The binding of aaa, bbb and ccc never states whether the same value ought to be used for the wildcard for all three variables or not necessarily.

pieterbos · 27 January 2021 08:56

What we do is much simpler:

Variables, or values from paths in the language, are actual variables, but they are always a list of values. This definition uses RM context only for lookups and execution, no archetype objects anywhere.
We call a list of one value a single value.

Then binary operators are defined on:

two single values
two lists of equal size
a list and a single value

The result, in each case is:

a single value
a list, where each item with index i is equal to applying the operator to the operands at item i in both lists
a list, where each item with index i is equal to applying the operator to item i of the list operand and the single value

This is equal to Matlab vector arithmetic.
We then define similar semantics for functions.

This is not perfect, and needs some special cases for null-handling. We also store which exact unique paths from the RM are used in the calculation of each result value, so it is possible to do some fancy UI tricks.
We do not ‘bind values to the parse tree’ directly, as this causes all sorts of trouble with for all statements and there Is no way we are actually changing the parse tree during rule evaluation.

I think It would be better to add for-all-blocks to the grammar, and adding variables within those for-all blocks that can be bound to paths, within the actual language. Data binding separate from the execution makes this a very complex concept to fully understand and implement, and will always cause some trouble with unexpected cases.

We actually flatten the rules for OPT-generation. Adding for-all blocks with statements would make that significantly better and easier to read the result. We keep the variables during flattening and apply the above semantics, and add for all statements where possible in the current grammar, and this works - but it is not perfect.

An alternative to flattening rules would be to add something similar as we do with component_terminologies

pieterbos · 27 January 2021 09:02

Also, in the current grammar there is exists for null checking. There is no difference between attached and defined, it handles both cases just fine. I do not understand the difference between the two cases.
Why add that difference?

thomas.beale · 27 January 2021 09:40

I forgot to say that those bindings are in an Observation archetype, so they apply equally to every use of that Observation within a larger structure. So that takes care of the last two cases.

thomas.beale · 27 January 2021 09:55

Here I don’t agree, because we have to state a mapping from possible data items in whatever source (here data generated by archetypes, but in other situations, data resulting from AQL queries or HL7 messages or whatever) to an intelligible expression or rule. The latter always works unambiguously at the instance level, otherwise the semantics even of operators and function calls become muddied with questions of what data items are we even binding to? Having to handle paths within rules also means that such rules are essentially non-reusable (quite apart from being unreadable).

Now, in the ‘old’ EL, we have those assignment / binding statements and to make them intelligible you have to do something like you say - make them stand as ‘variables, but with multiple values’, i.e. some kind of vector. But this greatly complicates the inferred typing, since now in the statement $apgar_breathing_value: Integer := /data[id3]/events[id4]/data[id2]/items[id10]/value[id39]/value, the variable $apgar_breathing_value isn’t clearly of type Integer but presumably of some type like Array<Integer>.

How that functions is unclear in my /protocol example, where the Arrays are not even of equal length.

So I’m proposing that in the long run we get rid of that, because it results in a lot of hidden semantics that are not clear from the syntactic representation.

In the short term, it can be left, but we can add the ability to add bindings to archetypes, which means that a different implementation can use a more orthodox expression evaluation approach (i.e. where the types really are what is stated), but your current implementation can do those tricks (which hopefully generate the same results!) and not bother with bindings.

When bindings are in use, then all the questions of how real data are mapped to entities in expressions is pushed one level out from the expression language semantics and its evaluator, which I think is a better approach.

thomas.beale · 27 January 2021 11:16

There are two cases to consider:

null-checking in the usual programming sense, i.e. if (myPerson.name != null) then xyz;
for a variable that is mapped to an external datum (in GDL and Task Planning, not all variables are), is the data item found in the source system? The mapped data object will not be null, it will report an unavailable status and maybe an unavailable reason.

I’m still messing around with the best way to represent these things in the grammar, so open to suggestions. Note that the ‘defined’ or ‘exists’ predicate in archetype rule expressions could be understood either way - declaring a variable ‘blood_glucose’ mapped to a path that doesn’t exist in the data could:

cause nothing to happen at runtime, i.e. the variable is never instantiated and is literally null
cause instantiation of a proxy variable (what we will do in guidelines and task planning) that ‘tracks’ the bound data source - this object is not null, but will support a call like blood_glucose.is_unavailable, to which the syntax predicate defined (blood_glucose) can be mapped.

I would prefer it to be the second, since that represents what’s really going on - i.e. an attempt to map data instances to an expression variable.

One thing to remember is how expressions & rules work in archetypes compared to how they work in guidelines and plan structures is a fairly simplified subset. Having looked at some more complex application development requirements recently, I am of the growing opinion that we should limit the rule logic inside archetypes to just the original vision:

assert relationships (including arithmetic) between multiple fields
assert existence/defined checks and relationships, i.e. of the type exists is_smoker implies exists smoking_details or similar.

My impression is that in Nedap you are trying to do more with the rules that just this. Everyone is But what can be achieved inside an archetype is pretty limited, which is why I have been working on the concept of the separated decision logic module, and also subject proxy service.

So maybe it is a good moment to have the conversation about what the larger requirements are - e.g. rules and expressions in data capture forms, applications and so on.

thomas.beale · 27 January 2021 11:28

By the way, this logic is the equivalent of the mapping logic I am proposing, but doing it in the mapping means that the more complex inferred mixed vector/scalar types are kept away from the expression language, and limited to the particular binding mechanism. When the binding is to AQL results, this kind of vector / scalar thing can also occur, but it won’t be the same mappings. And from FHIR… who knows

pieterbos · 27 January 2021 11:52

What we are doing is interactive forms, with a simple to write language.
If you add assertions, this means a certain constraint must be valid for the data corresponding to the archetype to be valid. What we do is to help our users to enter valid data, by finding cases where we can automatically change the data so the assertions become valid, and applying those changes in the form before they save the data. This is even how the code is structured internally - first the assertions are validated, then the suggestions to change the data are generated after that for every assertion matching certain forms. See AssertionsFixer.java in Archie for the code.
If you would write ‘exists $is_smoker implies not exists $smoking_details’, we would help our end user by updating the form in case $is_smoker would have a non-null value, by removing what $smoking_details points to in the form.
If you would write your exact statement, without the ‘not’, we would make $smoking_details mandatory if the user wants to save a complete composition. And if the user had before removed that part of the data in its form or if it was not present by default, we would add an empty placeholder back for that user so they can enter it.

Without auto-fixing there is still some, but rather limited use in using these for input validation. But it is hard to present understandable messages for them unless these assertions remain very very simple, since there is no way to add a hand-written human readable text explaining the assertion that must be valid.

Another thing in the current language is that it is very simple to use. The difference between null and unavailable is hard to understand for non-technical people - the data is not there in both cases, whatever the reason. I would avoid anything that makes the language more complex to users, unless it solves a functional need. I do not see one here.

thomas.beale · 27 January 2021 12:08

oops- I meant to write is_smoker != [|never|] implies exists smoking_details…

matijap · 27 January 2021 12:22

I don’t understand how it takes care of the second case, where one Observation has multiple events.

thomas.beale · 27 January 2021 12:22

We could potentially do something about this in the meta-model, or possibly by using annotations, which would avoid changing the meta-model - I agree it would be good to have a better solution here.

Agree - At least in archetypes I’ll try to avoid that complication.

thomas.beale · 27 January 2021 12:37

Well in the following…

bindings
    ["aaa"] = <"/data/events[*]/data[id4]/aaa">
    ["bbb"] = <"/data/events[*]/data[id5]/bbb">
    ["ccc"] = <"/data/events[*]/data[id6]/ccc">

The '*' in the above is meant to imply that those paths from within the same Event within an Observation instance are mapped to the variables named ‘aaa’ etc (i.e. the runtime binding doesn’t do anything crazy like generating nonsense permutations that map breathing@1min and heart_rate@5mins into the same expression evaluation). That clearly requires some rules about how those '*' are interpreted, which IMO are easy to write.

thomas.beale · 28 January 2021 18:06

BTW @pieterbos pointed out a while ago that there is no real need for literal wildcards, the above could just be achieved without any predicate, i.e.:

bindings
    ["aaa"] = <"/data/events/data[id4]/aaa">
    ["bbb"] = <"/data/events/data[id5]/bbb">
    ["ccc"] = <"/data/events/data[id6]/ccc">

The only downside of that is that you have to know the relevant RM to realise that ‘events’ is likely to be a contained object over which you want to repeat the binding attachments. I’m fine doing it either way, and the latter way is probably more correct technically.