AQL: Formal definition of FROM clause

thomas.beale · 24 February 2020 20:02

They have to because there are no Observations in an openEHR EHR not contained by those things, in each case. But, we should not really be creating queries that don’t mention the top-level container, i.e. the EHR.

We discussed a little while ago that we should have a notional top-top-level openEHR query space, of the following logical shape (with all the VERSIONing object hidden):

openEHR {
    ehrs: EHR[*]
    demographics {
        parties: PARTY[*]
        party_relationships: PARTY_RELATIONSHIP [*]
    }
    plans: WORK_PLAN[*]
    etc
}

sebastian.iancu · 24 February 2020 21:05

As said before, I see where you are aiming with BMM Thomas, but the following:

is where I see it a bit differently, because beside an AQL processor (or something similar that needs BMM to construct/translate the logical intent of the query to a particular implementation ‘taste’), I don’t think a query-optimizer will use such BMM based information to build up its internal query-plan. At least from my experience and technological layers, the data-store implementation (the persistence architecture) is usually made with ‘hard-wires’; it is a design choice.

thomas.beale · 24 February 2020 23:05

You can do that. But that’s implementation, it’s not specification. There is nothing you can put in the AQL specification that is specific to the openEHR EHR or any other model. How those models are used in practice will give you your optimisation settings. Which would probably be different for HighMed compared to Code24 and the other patient-facing EHR systems based on the same model.

Note also, that correctly processing ‘CONTAINS’ isn’t an optimisation, it’s either correct or its not. Whether it is processed fast at runtime is a whole other question

Seref · 25 February 2020 08:19

you are, but where you’re getting is not what my questions are about

almost your entire response is the answer to “how should I implement aql”, which is not what I’m asking, really. It is entirely possible that I’m missing the answers to my questions and this would not be the first time so personally I’m not sure where to go from here.
Nonetheless, I appreciate the patience and light hearted tone, which is hard to keep when having a discussion on the internets.

thomas.beale · 25 February 2020 12:19

Actually I’m not saying anything about how to implement AQL; just about what can go in the spec versus what cannot…

AQL needs to know what ‘CONTAINS’ means, generally
An AQL processor needs to know where it can be used with respect to some particular model
It also needs to know how to navigate reference-by-id containment relationships (like EHR.compositions) and direct-reference relationships (e.g. COMPOSITION.content)
=> therefore there needs to be a way of an AQL processor interrogating a model representation that provides this information.

ian.mcnicoll · 25 February 2020 13:02

@thomas.beale Let me try to re-phrase this …

AQL needs to know what ‘CONTAINS’ means, generally

An openEHR AQL implementer needs to know where it can be used with respect to the openEHR RM(s)

An implementer needs to know that CONTAINS (for the openEHR RM) will have to navigate reference-by-id containment relationships (like EHR.compositions) and direct-reference relationships (e.g. COMPOSITION.content) and this need to be made explicit in documentation
~~=> therefore there needs to be a way of an AQL processor interrogating a model representation that provides this information.~~

This is an implementation decision - IMO we should be specifying the correct behaviour in a way that does not require direct interrogation of the model which implies a dependency on BMM (or something else) for which I detect little current appetite amongst CDR implementers. At the top of this thread, Seref made a start on a documentation approach that I think meets the current need to document the expected behaviour of an openEHR RM ‘profile’ of AQL, without needing to interrogate the underlying model UML or BMM. Both of the latter options feel much harder to grapple with and understand.

So my vote goes strongly with Seref’s proposed documentation approach - see top of this topic, and leave any discussion about a more abstract ‘model-driven approach’ for the future - that will require a lot more work from Thomas that we (openEHR) cannot really afford, and which I don’t sense is really being asked for from those here with implementation experience but please do tell me if I am wrong!

thomas.beale · 25 February 2020 13:26

It’s not really an option. If there is anything about ‘openEHR RM’ or any other model, in the AQL specification, we have done the wrong thing.

How the model is represented in any particular implementation at this point in time - BMM, rules file, hard-wired something-or-other - is an implementation question. But at runtime, an AQL processor has to be able to discover where the CONTAINS relationships are, not to mention the typing etc, of any model underlying types etc.

There is nothing we should be putting in the AQL spec that mentions anything about particular models or particular relationships, typing or anything else specific.

If someone wants to create some extra spec of ‘AQL profiles’ and how to write them, and also an ‘openEHR profile’, then fine I guess, but I don’t see the point.

I also apparently have not been clear enough about where the BMM work is or is going. It’s pretty much done, and it’s being completed to drive Expressions, Task Planning and probably GDL3, because you can’t do any of these things without proper model representation.

Being able to process expressions, queries, decision logic etc - all requires model representation. This is just standard mainstream IT, nothing special.

thomas.beale · 25 February 2020 14:07

BTW I don’t disagree with anything in @Seref’s original post, which is nice and clear, apart from possibly this:

If I read this literally, it implies that the use of ‘CONTAINS’ in an AQL query ‘establishes’ something about the model. But that isn’t right - a query based on a model can’t state truths about the model, only the model can do that. If it were the case, you could never validate a query’s use of ‘CONTAINS’, you would just have to trust it. So you’d have no way of knowing if a query was correct w.r.t. its underlying RM.

I am not sure if that is what Seref intended here, I may be reading it wrong…

ian.mcnicoll · 25 February 2020 14:27

Wre are not talking about the AQL specification wer are talking about how that AQL specification applies to a specific RM, in our cae the openEHR RM - as you can see from Seref and Bjorn’s examples there are legitimate varying interpreatations of how it should be applied - is EHR e optional, what exactly does CONTAINS mean, how deep should CONTAINS go without a parent object e.g FROM EHR e CONTAINS ELEMENT.

That level of detail need to be worked out, agreed and then documented. I think that is understood. What we are wrangling here is the best way to dcument the outcome of those discussions.

Being able to process expressions, queries, decision logic etc - all requires model representation. This is just standard mainstream IT, nothing special.

My reading of the discussion is that while openEHR has done a remarkably good job in model-driven representation in terms data of data, that this is far from the standard mainstream IT in terms of implementation, especially in terms of profiled AQL. The pushback I am hearing from Seref and Sebastian suggests that as implementers they are not comfortable, at least right now with having this kind of RM-specific behaviour documented in a model-drive formalism. Seref is telling us he has been down this road already and saw the limitations.

I am going to push very clearly that we do not adopt BMM for this purpose but go with Seref’s suggested approach - if nothing else that will allow us to make positive progress on addressing the kind of reasonable questions that Seref, Bjorn and Sebastian are asking.

I would like a much clearer answer from implementers on where/how BMM has value before commiting much more resource in that direction.

Expressions, Task Planning and probably GDL3, because you can’t do any of these things without proper model representation.

I’d like to test that statement with experienced implementers - the success of openEHR to data has been because of the great work that you have done on model representation but I know there has already been significant pushback in a similar way around Expressions. I feel we are in real danger of seeing everything in terms of abstract models. @thomas - you have a great handle on this but everytime we push further down this road , I feel we are losing understanding and support, certainly from people outside our community but increasingly from those working within.

But I am probably the least qualified person to make such a judgement, expect from the position of perhaps representing something approaching the ‘great coding unwashed’

Ian

thomas.beale · 25 February 2020 14:44

If we are losing understanding, I am unaware of it. Formal model representation really is a mainstay of mainstream computing of all kinds. It comes in many concrete forms:

so-called ‘reflection’ classes in most programming languages these days
UML / XMI
Eclipse Ecore / Xcore
OMG IDL (around for 30y)
etc

I would not expect people doing clinical modelling, or EHR users to know or care about these things of course, they are for development. Only a small number of people in the overall openEHR community need to worry about them.

Without model representation, you do the same work but in a non-reusable way, in terms of hard-wired models all over the place. We have those as well, i.e. concrete RM implementations in Java, PHP, C# etc, for concrete EHR system implementations - that is as it should be. But for generic tooling and languages, the normal approach is formal model representation. The whole industry operates like this.

Once again, I am not saying anyone should start using BMM today in AQL (i.e. beyond where it is already in use, e.g. in ADL-designer, LinkEHR etc), what I am saying is that we should clearly understand the path forward to implementing certain things properly.

However I’ll just ask one simple question: how do implementers intend to validate AQL queries?

sebastian.iancu · 25 February 2020 15:30

Well, I said it before, I understand both @Seref and @thomas.beale perspective, and I don’t think that they are clashing. @Seref text above has a good human-readable info, while @thomas.beale BMM semantic description of AQL ‘contains’ provides machine processable info. Together they should cover several things that are not covered now.

I was just invoking the (lack of) use of such BMM for implementations of the stage of running AQL (translating the AQL to a ‘internal’ query adapted to implementation, things elsewhere named query-optimizer) …at least in my opinion - but that’s also fine. I found it important to express my perspective, to give feedback.

Furthermore, I can imagine that on the stage of processing AQL query (meaning parsing AQL, all these things that should happen before you even run the query), BMM could be useful from our openEHR-SEC perspective (meaning that some implementors might choose this machine-way rather than having hard-wires). This also relates to my response to:

It could be hard-wired, or BMM, depends on implementation choices… but let it be a choice (don’t force BMM).

So once again, keep both the text above and the BMM has its potential usage and audience; just make sure they are aligned.

PS: hopefully, you’ll not find my feedback rubbish…

Seref · 25 February 2020 15:40

Leaving a typo aside, (an EHR instance, not and EHR instance…), that sentence is not about the model.
Please note the repeated use of instances in that sentence. Not types, but instances, since the structural constraint established (use/assume defined instead) by CONTAINS is a constraint on data, which in this context what I refer to with instances.

You seem to be reading it with BMM in your mind and probably thinking I’m verbally defining metadata which you’d normally put into a meta model. That’s not what I intend to do. I’m assuming RM ( and potentially + demographics) and describing what CONTAINS means in the context of a query when that query is run on data (instances) that is based on openEHR types (RM+demographics) That assumption allows me to define query semantics without referring to any metadata.

I don’t know if writing the AQL spec this way will deliver the balance between pragmatism and clarity but there is a certain motivation and thinking to it.

bna · 25 February 2020 17:08

We’ve had openEHR in production 6 years now and AQL for 5 years without using or for many years not even knowing about BMM. BMM is certainly not critical for us to understand how to execute AQL on openEHR data.

The last years we’ve implemented a BMM parser and the possibility to validate our models (C# classes) with the BMM definitions. That’s useful for verification towards versions of RM.

For AQL everything is more or less hand-coded. We’ve so far not faced serious issues related to the syntactical problems with AQL. What we have been struggling most with is how to interpret the semantic logic defined by the query editor and apply that to the underlying datasources. As you all know we use a combination of an RDMS and lucene based index in Apache SOLR. The real technical innovation is related to how we design the lucene index and how we apply the AQL logic on to the data. This is, of course, not represented in any openEHR specification

We are currently rewriting the backend and query pipeline. This is part of an ordinary maintainace of such a critical component. As part of that we are re-visiting some of our previous assumptions on the understanding of how a query in AQL should be interpreted.

Most interesting so far:

How to handle order by for a few specific data values
The understanding of implicit and explicit contains
How to handle the permutation problem

I have shared our description of the problems and also proposed our understanding.

IMHO we need to work together at such a practical level. Most of the problems is raised by real life issues and problems. We need to solve this first. Then secondary we may put out shared consensus in the formal models.

I got a task from the management board to look into why developers are not involved in openehr and modelling. I think many or most developers are problem solvers. The need to find simple answers on complex problems. To help and engage them we need to be concrete in our discussions, and without loosing our heritage as the best modeled specification ever (period).

thomas.beale · 25 February 2020 17:19

That seems to imply that AQL is a language specific to the openEHR RM. But it should be a query language applicable to any archetyped data based on any reference model - the semantics are the same. So I’m still not clear how, in your explanatory text, the AQL processor could know that (say) COMPOSITION contains CLUSTER (and not the other way around), or EHR contains (by reference) COMPOSITION.

Seref · 25 February 2020 17:41

I think it is quite clear that we disagree on this one At this point in time, with my Ocean hat on, I see zero business value in using AQL on anything other than RM (and maybe demographics along with that). Any extra work in specs, in implementation and even discussions is spending Ocean money with no commercial returns in the foreseeable future. I’m OK with making it to openEHR history as the man who misunderstood AQL most , but I’d make this point nonetheless.

That is absolutely outside the scope and intention of my text because I see how AQL processor works as implementation detail and leave it to reader of AQL spec to decide on how to make that work, which is what DIPS, Better, Ocean, Ethercis and EhrBase have done without anybody putting this into any spec in the last 10(?) years.

bna · 25 February 2020 17:45

That’s a very good point Seref. You took the word out of my mouth.

ian.mcnicoll · 25 February 2020 17:45

That implies that people are implementing some kind of generic rm-neutral aql processors which is certainly not true for ethercis and I assume ehrbase. I think Bjørn and seref are both saying that this is not how their CDRs work.

thomas.beale · 25 February 2020 17:50

Well I can’t speak for Ocean, but there are two things I would say:

it is very likely that we will want to use AQL to do querying over other archetyped data, for the simple reason that we are already creating other archetyped data, namely Demographics and Task Plans.
there still isn’t any AQL semantic that I can identify that should cause AQL to be limited to just one particular model (the openEHR RM). So I am unclear why we would specify it as if it were.

matijap · 25 February 2020 20:36

We at Better do, and will continue to do, parsing and syntactic validation using off-the-shelf lexers and parsers with antlr grammar having hard-coded some class names from openEHR RM. (Edit: well, as far as I’m aware, only the top-level EHR is really hard-coded for now.) If a need arises to add the few classes that occur in demographics, we will do that will fairly little effort.

Whether we have machine-readable (i.e. BMM) information on the fact that EHR contains COMPOSITIONs (in a different way than COMPOSITIONs contain SECTIONs and such) or not, does not make any difference to us: we treat EHRs in a totally different way than COMPOSITIONs, etc., for a technical reason that will not be affected by any amount of formal modelling. (Whether OBSERVATION can contain SECTION or whether ACTION can contain COMPOSITION or some class from demographics in the model, we do not really care when querying: if it cannot, there will simply be no data and I see no use in validating such relationships.) Adding any kind of SYSTEM above EHR, or PARTY and whatnot below that, will require manual labor which we are gladly willing to do if we see value in it for our customers. We will not rewrite our stack to be more generic – not because it is hard or not worth it, but because we believe it can not be as fast and as flexible, from the customer’s point of view, as the market needs for its (non-generic) use cases.

We need a good (i.e. understandable and strict) human-readable specification so that most questions like the ones that @bna provides a constant stream (and now he revealed why ) can be answered simply by pointing out a sentence or paragraph in the specification (that can hopefully be interpreted only in one way).

Now two questions arise.

Is a machine-readable formal model (BMM) a necessary step towards such a human-readable specification? I think not, because while it might define behaviour strictly, what will happen is that we will find a vague sentence in specification, inquire the formal model to resolve the dilemma, and find out that we do not like the answer, and then we’ll have two problems instead of one. Maybe I’m wrong.
What amount of openEHR RM may appear in AQL specification: (a) not at all, (b) only in code examples, (c ) also in clarifications in text, (d) also in the specification of the language features. I think (c ) is the correct answer, others may prove me wrong. We’ve had some discussions in other threads lately whether AQL engines may contain some exceptions when querying EHR data (like not showing incomplete compositions, or even not showing data not belonging to PARTY_SELF, unless explicitly instructed to); if we went this way and would explain this in the AQL spec, that would be the (d) way, which is arguably wrong (but putting that information elsewhere is dangerous as well, so that’s another dilemma).

What I’m trying to say is that no matter how generic and loosely-coupled the specification will be, implementations will not be, so effort should be put into all this machinery only if it will aid understanding (and I’m pretty sure it will not) or if it will enable some kind of automated validation (which I do not see how).

pablo · 25 February 2020 21:14

Hi @bna I guess that request is then mapped into an AQL expression, for instance ehrIds and values, the rest I don’t think have support in an AQL expression.

Not sure if that is related to the FROM definition or maybe a proposal for the REST API.