AQL: Formal definition of FROM clause

BTW I don’t disagree with anything in @Seref’s original post, which is nice and clear, apart from possibly this:

If I read this literally, it implies that the use of ‘CONTAINS’ in an AQL query ‘establishes’ something about the model. But that isn’t right - a query based on a model can’t state truths about the model, only the model can do that. If it were the case, you could never validate a query’s use of ‘CONTAINS’, you would just have to trust it. So you’d have no way of knowing if a query was correct w.r.t. its underlying RM.

I am not sure if that is what Seref intended here, I may be reading it wrong…

Wre are not talking about the AQL specification wer are talking about how that AQL specification applies to a specific RM, in our cae the openEHR RM - as you can see from Seref and Bjorn’s examples there are legitimate varying interpreatations of how it should be applied - is EHR e optional, what exactly does CONTAINS mean, how deep should CONTAINS go without a parent object e.g FROM EHR e CONTAINS ELEMENT.

That level of detail need to be worked out, agreed and then documented. I think that is understood. What we are wrangling here is the best way to dcument the outcome of those discussions.

Being able to process expressions, queries, decision logic etc - all requires model representation. This is just standard mainstream IT, nothing special.

My reading of the discussion is that while openEHR has done a remarkably good job in model-driven representation in terms data of data, that this is far from the standard mainstream IT in terms of implementation, especially in terms of profiled AQL. The pushback I am hearing from Seref and Sebastian suggests that as implementers they are not comfortable, at least right now with having this kind of RM-specific behaviour documented in a model-drive formalism. Seref is telling us he has been down this road already and saw the limitations.

I am going to push very clearly that we do not adopt BMM for this purpose but go with Seref’s suggested approach - if nothing else that will allow us to make positive progress on addressing the kind of reasonable questions that Seref, Bjorn and Sebastian are asking.

I would like a much clearer answer from implementers on where/how BMM has value before commiting much more resource in that direction.

Expressions, Task Planning and probably GDL3, because you can’t do any of these things without proper model representation.

I’d like to test that statement with experienced implementers - the success of openEHR to data has been because of the great work that you have done on model representation but I know there has already been significant pushback in a similar way around Expressions. I feel we are in real danger of seeing everything in terms of abstract models. @thomas - you have a great handle on this but everytime we push further down this road , I feel we are losing understanding and support, certainly from people outside our community but increasingly from those working within.

But I am probably the least qualified person to make such a judgement, expect from the position of perhaps representing something approaching the ‘great coding unwashed’ :slight_smile:

Ian

If we are losing understanding, I am unaware of it. Formal model representation really is a mainstay of mainstream computing of all kinds. It comes in many concrete forms:

  • so-called ‘reflection’ classes in most programming languages these days
  • UML / XMI
  • Eclipse Ecore / Xcore
  • OMG IDL (around for 30y)
  • etc

I would not expect people doing clinical modelling, or EHR users to know or care about these things of course, they are for development. Only a small number of people in the overall openEHR community need to worry about them.

Without model representation, you do the same work but in a non-reusable way, in terms of hard-wired models all over the place. We have those as well, i.e. concrete RM implementations in Java, PHP, C# etc, for concrete EHR system implementations - that is as it should be. But for generic tooling and languages, the normal approach is formal model representation. The whole industry operates like this.

Once again, I am not saying anyone should start using BMM today in AQL (i.e. beyond where it is already in use, e.g. in ADL-designer, LinkEHR etc), what I am saying is that we should clearly understand the path forward to implementing certain things properly.

However I’ll just ask one simple question: how do implementers intend to validate AQL queries?

Well, I said it before, I understand both @Seref and @thomas.beale perspective, and I don’t think that they are clashing. @Seref text above has a good human-readable info, while @thomas.beale BMM semantic description of AQL ‘contains’ provides machine processable info. Together they should cover several things that are not covered now.

I was just invoking the (lack of) use of such BMM for implementations of the stage of running AQL (translating the AQL to a ‘internal’ query adapted to implementation, things elsewhere named query-optimizer) …at least in my opinion - but that’s also fine. I found it important to express my perspective, to give feedback.

Furthermore, I can imagine that on the stage of processing AQL query (meaning parsing AQL, all these things that should happen before you even run the query), BMM could be useful from our openEHR-SEC perspective (meaning that some implementors might choose this machine-way rather than having hard-wires). This also relates to my response to:

It could be hard-wired, or BMM, depends on implementation choices… but let it be a choice (don’t force BMM).

So once again, keep both the text above and the BMM has its potential usage and audience; just make sure they are aligned.

PS: :innocent: hopefully, you’ll not find my feedback rubbish…

Leaving a typo aside, (an EHR instance, not and EHR instance…), that sentence is not about the model.
Please note the repeated use of instances in that sentence. Not types, but instances, since the structural constraint established (use/assume defined instead) by CONTAINS is a constraint on data, which in this context what I refer to with instances.

You seem to be reading it with BMM in your mind and probably thinking I’m verbally defining metadata which you’d normally put into a meta model. That’s not what I intend to do. I’m assuming RM ( and potentially + demographics) and describing what CONTAINS means in the context of a query when that query is run on data (instances) that is based on openEHR types (RM+demographics) That assumption allows me to define query semantics without referring to any metadata.

I don’t know if writing the AQL spec this way will deliver the balance between pragmatism and clarity but there is a certain motivation and thinking to it.

We’ve had openEHR in production 6 years now and AQL for 5 years without using or for many years not even knowing about BMM. BMM is certainly not critical for us to understand how to execute AQL on openEHR data.

The last years we’ve implemented a BMM parser and the possibility to validate our models (C# classes) with the BMM definitions. That’s useful for verification towards versions of RM.

For AQL everything is more or less hand-coded. We’ve so far not faced serious issues related to the syntactical problems with AQL. What we have been struggling most with is how to interpret the semantic logic defined by the query editor and apply that to the underlying datasources. As you all know we use a combination of an RDMS and lucene based index in Apache SOLR. The real technical innovation is related to how we design the lucene index and how we apply the AQL logic on to the data. This is, of course, not represented in any openEHR specification :slight_smile:

We are currently rewriting the backend and query pipeline. This is part of an ordinary maintainace of such a critical component. As part of that we are re-visiting some of our previous assumptions on the understanding of how a query in AQL should be interpreted.

Most interesting so far:

  • How to handle order by for a few specific data values
  • The understanding of implicit and explicit contains
  • How to handle the permutation problem

I have shared our description of the problems and also proposed our understanding.

IMHO we need to work together at such a practical level. Most of the problems is raised by real life issues and problems. We need to solve this first. Then secondary we may put out shared consensus in the formal models.

I got a task from the management board to look into why developers are not involved in openehr and modelling. I think many or most developers are problem solvers. The need to find simple answers on complex problems. To help and engage them we need to be concrete in our discussions, and without loosing our heritage as the best modeled specification ever (period).

That seems to imply that AQL is a language specific to the openEHR RM. But it should be a query language applicable to any archetyped data based on any reference model - the semantics are the same. So I’m still not clear how, in your explanatory text, the AQL processor could know that (say) COMPOSITION contains CLUSTER (and not the other way around), or EHR contains (by reference) COMPOSITION.

I think it is quite clear that we disagree on this one :slight_smile: At this point in time, with my Ocean hat on, I see zero business value in using AQL on anything other than RM (and maybe demographics along with that). Any extra work in specs, in implementation and even discussions is spending Ocean money with no commercial returns in the foreseeable future. I’m OK with making it to openEHR history as the man who misunderstood AQL most , but I’d make this point nonetheless.

That is absolutely outside the scope and intention of my text because I see how AQL processor works as implementation detail and leave it to reader of AQL spec to decide on how to make that work, which is what DIPS, Better, Ocean, Ethercis and EhrBase have done without anybody putting this into any spec in the last 10(?) years.

1 Like

That’s a very good point Seref. You took the word out of my mouth.

1 Like

That implies that people are implementing some kind of generic rm-neutral aql processors which is certainly not true for ethercis and I assume ehrbase. I think Bjørn and seref are both saying that this is not how their CDRs work.

1 Like

Well I can’t speak for Ocean, but there are two things I would say:

  • it is very likely that we will want to use AQL to do querying over other archetyped data, for the simple reason that we are already creating other archetyped data, namely Demographics and Task Plans.
  • there still isn’t any AQL semantic that I can identify that should cause AQL to be limited to just one particular model (the openEHR RM). So I am unclear why we would specify it as if it were.

We at Better do, and will continue to do, parsing and syntactic validation using off-the-shelf lexers and parsers with antlr grammar having hard-coded some class names from openEHR RM. (Edit: well, as far as I’m aware, only the top-level EHR is really hard-coded for now.) If a need arises to add the few classes that occur in demographics, we will do that will fairly little effort.

Whether we have machine-readable (i.e. BMM) information on the fact that EHR contains COMPOSITIONs (in a different way than COMPOSITIONs contain SECTIONs and such) or not, does not make any difference to us: we treat EHRs in a totally different way than COMPOSITIONs, etc., for a technical reason that will not be affected by any amount of formal modelling. (Whether OBSERVATION can contain SECTION or whether ACTION can contain COMPOSITION or some class from demographics in the model, we do not really care when querying: if it cannot, there will simply be no data and I see no use in validating such relationships.) Adding any kind of SYSTEM above EHR, or PARTY and whatnot below that, will require manual labor which we are gladly willing to do if we see value in it for our customers. We will not rewrite our stack to be more generic – not because it is hard or not worth it, but because we believe it can not be as fast and as flexible, from the customer’s point of view, as the market needs for its (non-generic) use cases.

We need a good (i.e. understandable and strict) human-readable specification so that most questions like the ones that @bna provides a constant stream (and now he revealed why :slight_smile: ) can be answered simply by pointing out a sentence or paragraph in the specification (that can hopefully be interpreted only in one way).

Now two questions arise.

  1. Is a machine-readable formal model (BMM) a necessary step towards such a human-readable specification? I think not, because while it might define behaviour strictly, what will happen is that we will find a vague sentence in specification, inquire the formal model to resolve the dilemma, and find out that we do not like the answer, and then we’ll have two problems instead of one. Maybe I’m wrong.

  2. What amount of openEHR RM may appear in AQL specification: (a) not at all, (b) only in code examples, (c ) also in clarifications in text, (d) also in the specification of the language features. I think (c ) is the correct answer, others may prove me wrong. We’ve had some discussions in other threads lately whether AQL engines may contain some exceptions when querying EHR data (like not showing incomplete compositions, or even not showing data not belonging to PARTY_SELF, unless explicitly instructed to); if we went this way and would explain this in the AQL spec, that would be the (d) way, which is arguably wrong (but putting that information elsewhere is dangerous as well, so that’s another dilemma).

What I’m trying to say is that no matter how generic and loosely-coupled the specification will be, implementations will not be, so effort should be put into all this machinery only if it will aid understanding (and I’m pretty sure it will not) or if it will enable some kind of automated validation (which I do not see how).

2 Likes

Hi @bna I guess that request is then mapped into an AQL expression, for instance ehrIds and values, the rest I don’t think have support in an AQL expression.

Not sure if that is related to the FROM definition or maybe a proposal for the REST API.

I really like the idea of adding something like that as a summary at the beginning of spec, it’s short and straightforward. I would suggest not to use the term “row” or “rowset” because that could indicate or suggest an implementation technology.

If we are not referring to the “row” in the sense of Relational Databases, and we want to keep using the terms, we should define our own “row” and “rowset” semantics in the AQL spec, which is also related to giving an idea of how an AQL processor /evaluator/execution should work.

BTW I like the idea of giving implementation hints in the spec, but don’t know if we need to separate things or create a more complete spec. With this I mean, to have a complete query spec we need to define:

  1. syntax
  2. query processing/evaluation/execution
  3. result set

We are very close to have a good “syntax spec”, but we lack on the rest.

@all I have committed and improvement of the FROM definition to my PR: https://github.com/openEHR/specifications-QUERY/pull/5/commits/9517f7b2dedf3dc083b23c656065a97d5114d14b

I tried to remove any reference to a CDR or a specific RM, this is still WIP. Rewriting that I realized we need to mention that AQL is for any RM, but the RM should comply with a couple of things:

  1. should be an OO RM (since the FROM uses class names)
  2. the RM should be used in a dual-model environment (without this we don’t have archetype IDs or paths)

In the current v1.0 spec we have “Archetype Query Language (AQL) is a declarative query language developed specifically for expressing queries used for searching and retrieving the clinical data found in archetype-based EHRs.”

We might be covered for point 2. with “…data found in archetype-based EHRs…”, but is not so explicit. Also, there are still many references to the openEHR RM and to EHRs in general, constraining the scope of AQL to CDRs only, a constraint we need to remove.

Still I think both conditions should be explicitly mentioned in the AQL introduction (OOM + dual-model), and also mention AQL works on any RM that complies with those conditions.

What do you think?

This is a great answer @matijap and you truly show why you are Better :slight_smile:

Response to your two questions:

As I wrote earlier

What I meant by this is :

  1. To teach the (internal) developers using our backend to use it properly and explain what the expected output will be based on the query they propose. Most developers has very high competence and experiences using SQL. They think AQL is the same - but it’s not. That is confusing and for many disappointing.
  2. For the core team who has been working with openEHR since 2010 to find out how a given AQL is expected to map into complex hierarchical datastructures and produce the resultset that both a “semi-clinical-tech” and the “high-competent-openEHR-expert” think is correct. Often we find a discrepancy here. I.e. @ian.mcnicoll had some issues accepting the Glasgow Coma Scale example given here. And we, as a community and SEC group, has not yet found a shared solution for the permutation problem as explained here. For the both the latter examples we, DIPS, is working on some assumptions and query logic which seems to solve it. I will share it as soon as I am able to understand what the developers are doing currently (it’s AFAIK heavy stuff, but I think/have heard rumours that the Better guys already has some solution to this).
  3. And of course the ORDER BY issue like “should there be a default order if no order is given”, how to order data types. And similar to this how to handle NULL when ordering? Do we need some operator to explicit give the AQL engine hints about i.e. NULL FIRST

All the examples above is more informal and descriptive than formal modelling definition. Other might have a different view on this. But I must say for us, DIPS, what is important in short terms is to define the expected rules for the problems raised above. And I can not see how to work with this kind of problems without discussing them. So far my the questions related to ORDER BY has been replied with “this can be fixed in BMM by some infix operators”. That’s fine I think, but for AQL we simply don’t care because the AQL pipeline is extremely handcrafted and optimized for our specific implementation. All wee need is an informal description of what we agree on as a SEC group.

Current use of AQL is limited to query EHR RM based data. I agree with @matijap that we need some clarifications in text which covers the use-cases that customers or clients will face. If we some time in the future will do more work on DEMOGRAPHICS or TASKPLANNING then we may add text to clarify such use-cases. I think @sebastian.iancu will provide some good use-cases for DEMOGRAPHICS and we will eager to learn about their experiences.

And as a final note to self: @Seref - I am sorry for not responding to your initial post in this topic. I think you made a really good start for the discussion. And it was so good that I didn’t have any specific comments to it. It made sense to me :slight_smile:

2 Likes

@pablo - sorry , I lost the context for this reply. Which of my posts are you referring?

I was referring (hinting) to our (openEHR) row/result-set or whatever that type-name we will have in our specification (because we should have those types specified, including json-schema, xsd).

I agree. Row and columns in this context is a description of the format of the result from the executed AQL. It’s not implementation specific. The terms are AQL specific definitions of the models/types/classes used in AQL.

That’s ok, my point is: if we use those terms, we need to define them in the spec or reference an external definition.

In fact, that is the point of this discussion and also the other thread about defining the operators for the simple types, because we build complex concepts without defining the basic semantics of their internal components.

1 Like