AQL: Formal definition of FROM clause

Seref · 7 February 2020 12:32

Here is the content I’d like to suggest for addition to AQL spec. The spec becomes a a bit repetitive if this is replaces 3.10.2 but I need to explain various things to expand the argument so I can live with it. Comments are welcome.

3.10.2 FROM

The FROM clause defines the scope of the query in terms of reference model (RM) types of data to be retrieved along with additional constraints that further narrow down the matching instances of data. These constraints can be constraints on attributes of the RM type in addition to structural constraints for data elements. Structural constraints can be in the form of logical relationships or instances of data directly containing other instances as nested attributes of an object. Both types of structural constraints are expressed with the same CONTAINS keyword.

All other clauses in the AQL query reference data instances defined in the FROM clause using their aliases and either express further constrains on these, or define new data instances using relative paths based on these aliases.

The simplest example of a FROM query would be one in which there is only a single RM type is declared.

SELECT .... FROM EHR e ....

The e alias, used to refer to all instances of data that has the reference model type EHR, is required to refer to this set of data from other clauses, namely SELECT and WHERE. This alias can be used directly such as:

SELECT e FROM EHR e... (select all EHR instances)

or as the root of a relative path, which allows the query to express constraints or select data items accessible from the root of the relative path, as in:

SELECT e FROM EHR e WHERE e/ehr_id/value='some_ehr_id' (select all EHR instances that has an ehr_id with value 'some_ehr_id' )

In accordance with the XPath like constraint syntax of AQL, FROM clause can introduce attribute constraints to data instances it defines as in:

SELECT ... FROM EHR e[ehr_id/value='some_ehr_id']

Note that the example above is semantically equivalent to the using the WHERE clause. The attribute constraint in the FROM clause is usually preferred.

As stated above, FROM clause also allows AQL to define structural constraints. This feature is supposed via the CONTAINS keyword, which expresses a containment relationship which can be logical or direct data instance containment. Therefore, the semantics of CONTAINS keyword is overloaded for multiple types of relationships. An example FROM clause, which depicts both types of relationships would be:

SELECT ... FROM EHR e[ehr_id/value='some_ehr_id'] CONTAINS COMPOSITION c[openEHR-EHR-COMPOSITION.report.v1] CONTAINS CLUSTER cls[at0018]

From a reference model point of view, and EHR instance does not directly contain COMPOSITION instances. Instead, it has references to them expressed via its compositions attribute. The CONTAINS keyword that establishes a structural constraint between EHR instances and COMPOSITION instances therefore implies COMPOSITION instances accessible from and EHR instance through resolving the values of its attribute, which in turn implies this is a logical structural constraint.
The second use of CONTAINS keyword in the same query above establishes a structural constraint based on an instance of a COMPOSITION reference model type actually containing an instance of CLUSTER reference model type, where both instances individually must also satisfy the attribute constraints for their archetype node id, as defined on their aliases, c and cls. This structural constraint demonstrates direct data containment semantics of CONTAINS keyword, similar to the concept of composition in object oriented languages, which is different than aggregation, as employed by the EHR type to refer to COMPOSITIONs. AQL implementations deal with this different semantics internally so that CONTAINS keyword can be used seamlessly to express constraints on data to be retrieved.

The CONTAINS keyword is complemented with logical operators such as AND and OR to express structural constraints that go beyond a single ‘path’ in the EHR. The details of these are provided below.

An important point regarding the use of complex structural constraints in the FROM clause is that FROM clause always has a single root declaration and all CONTAINS keywords and logical operators describe constraints relative to this single root.

The data items defined in the FROM clause can be of any RM type, however implementations of AQL usually support a subset of reference model types. This is usually done in order to ensure query performance and real life access patterns to data.
To clarify, a query such as

SELECT d FROM EHR CONTAINS DV_QUANTITY

is perfectly valid from a syntactic point of view, but its results, all instances of data with quantity type, across all data contained in all EHRs, is completely useless in real life.

Based on the syntax and intended functionality defined above, more formal semantics of FROM clause can potentially be represented with various existing formalisms, probably via extensions. One such formalism, Tree Pattern Queries is discussed in detail in regards to its use as a formalism for AQL in (Arikan 2016)
This particular formalism, presumably one of the many that could be used, defines semantics of FROM clause based on a single rooted tree pattern as introduced by the Tree Pattern Query (TPQ). This representation, with potential extensions borrowed from labelled property graphs (Green et al. 2018), can encode constraints defined in FROM node. Other relevant formalisms are discussed in the original work that led to the creation of AQL (Ma et al. 2007)

REFERENCES:

Arikan, S. S. 2016. ‘An Experimental Study and Evaluation of a New Architecture for Clinical Decision Support - Integrating the OpenEHR Specifications for the Electronic Health Record with Bayesian Networks’. Doctoral, UCL (University College London). An experimental study and evaluation of a new architecture for clinical decision support - integrating the openEHR specifications for the Electronic Health Record with Bayesian Networks - UCL Discovery.

Green, Alastair, Martin Junghanns, Max Kießling, Tobias Lindaaker, Stefan Plantikow, and Petra Selmer. 2018. ‘OpenCypher: New Directions in Property Graph Querying.’ In EDBT, 520–23.

Ma, Chunlan, Heath Frankel, Thomas Beale, and Sam Heard. 2007. ‘EHR Query Language (EQL)-a Query Language for Archetype-Based Health Records’. Medinfo 129: 397–401.

sebastian.iancu · 7 February 2020 15:49

On the first read text looks good to me.
Is it true that you now “made peace” on the use of CONTAINS by reference vs real contained structures?
I see also that is now open for non-EHR RM types, but is still not clear how would it look like for a demographic RM types - perhaps an example will come later.
I’m a bit concern about the last part, where you refer to TPQ, I think in openEHR specification should be very clearly stated what should be supported, so that we can do conformance-test. So what ever is not specified does not exists from these docs perspective - therefore I think we should carefully think about everything we really want all implementation to support (so …what is the common-set that all of us should support).

Seref · 7 February 2020 16:16

Thanks. Regarding CONTAINS, I’m merely making the point that is it used to imply different types of associations. As you can see, I have an explanation for the semantics of EHR CONTAINS COMPOSITION and COMPOSITION CONTAINS CLUSTER even though the underlying relationships are different. I’d be glad to discuss if how another overload of it an be used to access demographics information but I’d like to have an explanation for that too. Tom made some points about links to demographics information in the form of references from EHR subject etc, which would allow us keep the conceptual integrity of the query semantics.
Shall we say I’m eager to have “peace negotiations”?

Re the TPQ: it is just a formalism that can be used to describe queries. As I said in the text, there can be others, but this is what I have to offer. It sits between the specs and the technology so even in the thesis I concluded by saying it may become a recommended approach and not necessarily the specification. So in the text above I’m not implying it is part of AQL spec, I thought I wrote that part carefully, as in “look at these things if you want to more formally interpret this thing…” but I may have missed the mark.

I insist that we must have an explanation for retrieving a result given a query and a CDR which shows exactly why we’re getting the results we’re getting. As I said in the SEC skype meeting, SQL has relational algebra with its operations (left/right joins etc) and extensions (order by, window functions etc). Every xquery engine returns the same results given the same text because W3 specified the matching semantics, Gremlin has graph walk… So it goes.

What do we have for AQL? That’s my overarching concern. I’d be very, very happy to hear suggestions regarding this. or maybe I’m concerned about something that does not matter that much to everybody else and if someone can explain that to me, I’ll leave everybody in peace regarding this matter

sebastian.iancu · 10 February 2020 09:30

I think I get your concern, it would be better to have this explicitly described and stated, just to avoid unexpected behavior from AQL implementation. Although I’m not the right person to right such text, I will take it for a review.
My concern on the other hand is that such text might make assumptions or require something particular related to underlying AQL implementation and related storage engine (e.g. relational db, noSql, xquery, etc) - so we should watch-for and avoid for such ‘traps’. But this should not block us having a spec like you asked.

Seref · 10 February 2020 09:46

Regarding your concern:

So sure, not suggesting any particular technology is a core principle. The related chapter in my thesis is titled “persistence abstraction” for that very reason.

Given the same behaviour, each implementer would resort to their own intellectual comfort zone based on their expertise and know-how, which is already the case for all vendors anyway. What I’m trying to say is, health data in general and openEHR in particular pushes certain design patterns at the implementation level anyway, we would not worsen what is already dictated by the nature of this domain but would have a better spec.

I’ve been looking at CQL and they have a lot to answer in this department with what they’re trying to do

pablo · 18 February 2020 03:08

For the record, I have sent a PR with some improvements for the FROM definition: Spec fixes round 2 SPECQUERY-15 by ppazos · Pull Request #5 · openEHR/specifications-QUERY · GitHub

Old text:

The FROM clause utilises class expressions and a set of containment criteria to specify the data source from which the query required data is to be retrieved. Its function is similar as the FROM clause of an SQL expression.

New text:

The FROM clause is used to specify the domain of the query, that is a subset of the universe of data that could be retrieve from a CDR. That universe is anything defined in the openEHR RM (that is a set of classes that models clinical records), and archetypes (that define specific data sets based on the openEHR RM). Because of that, the FROM clause uses class expressions to specify the subsets of data, giving context to the query.

And added a summary section with this: FROM: Defines the subset of data in which the query will be executed.

That definition might be a little broad (on purpose) since allows stuff like FOLDER, LINK or ACTOR, to be part of the AQL expression in the FROM clause. I think we propose a spec to query any RM, but even using openEHR we are just focusing on querying inside EHR and not in the demographic model, and we don’t mention querying LINKed structures, which might be really powerful.

Checking your definition @Seref, this seems confusing for me:

Maybe “reference types of data” should be defined. Should that be “RM type” or “RM class”? (“class” would be correct in an OO environment).

Rephrased to avoid using “constraints” twice: “these constraints can be applied to …”

I understand “directly containing” as something that is parent → child, but CONTAINS allows to constraint containment at any level, I would remove “directly” to avoid a wrong interpretation.

This is difficult to follow, lots of “instances”

Maybe add a comma after “query”.

Can that be rewritten to avoid “based” twice?

IMO that should be the first sentence of the definition for FROM. Another thing is to mention the scope “over what”, like mentioning the “universe” of all queryable data is the RM, then what is defined in the FROM is a subset of that (please check my definition above).

The example and description could be before mentioning types/classes and instances, since the example is more descriptive and gives context to the formal definition. Something like:

FROM defines the scope of the query bla bla bla
basic examples, mention classes and aliases that refer to instances
heavy definition
more examples

An idea, can we ask each implementer to come up with a definition for the main clauses in AQL? Then compare and improve based on good definitions. Would love to hear what others think since this is the core of all queries.

thomas.beale · 18 February 2020 15:14

On the question of CONTAINS…

Ideally, in AQL we would use CONTAINS wherever logical containment is understood in the relevant RM. Logical containment means deletion semantics, i.e. cascaded-delete in RBMS thinking. Now in the openEHR RM, if we do a DELETE on an EHR, we would logically cascade that DELETE through to all referenced FOLDERs, COMPOSITIONs, EHR_STATUS and so on.

To specify that formally would require a kind of reference type that can be marked as ‘composition’ or ‘association’, in the same way UML does for direct association refs between objects (i.e. black diamond versus no diamond). But for concrete types that are intended to be physical references to sub-parts, for reasons of computational convenience or whatever, there is no way in UML or BMM to directly mark them as being composition or association.

I have implemented ‘smart references’ in the past (probably we all did at some point) that have this knowledge in them, but of course its just a specific class, it’s not built in to the language. We could take that path in openEHR RM: add a data element to the XXX_REF types that mark them as composition or association, or subtype them.

Doing it properly means putting it in the BMM, where you could look up EHR.compositions and discover that the logical relationship between the EHR and the target objects of the references (COMPOSITIONs etc) was indeed composition (or not). With that info, CONTAINS in an AQL query could be correctly interpreted for both EHR[x] CONTAINS COMPOSITION[arch-id] and COMPOSITION CONTAINS CLUSTER.

This is something I have been thinking about, and indeed, it would be easy to specify and add to the current BMM schemas. If AQL processors were to use the Archie BMM lib to read the BMMs, then the info would be right there, and everything falls into place.

sebastian.iancu · 19 February 2020 12:46

is that implicitly not the ‘type’ attribute of the OBJECT_REF?

About CONTAINS:

This is a key aspect that has to be clear from AQL spec about CONTAINS. Perhaps it deserves a small chapter.
If we all accept this as “design pattern”, then perhaps we don’t need (now) to further engineer this (with the BMM thing above)?! But, on the other hand it might be useful to have it, if AQL processor would use BMMs.

sebastian.iancu · 19 February 2020 13:17

I don’t feel this explanations makes it better. The “universe” term is (in my opinion) not very appropriate here, and also putting there the CDR would restrain in the scope to only EHR, whereas as I mentioned earlier, I really would like to consider also DEMOGRAPHIC domain for AQL.

As inspiration, I kind of like the simplicity of how wikipedia defines FROM (see From (SQL) - Wikipedia ), “… will provide the rowset to be exposed through a [Select] statement”. Perhaps @Seref will find a way of simplifying his original text there, using less (openEHR) words, and keeping CONTAINS explanation for a separate chapter?

thomas.beale · 19 February 2020 15:54

Agree on that - AQL itself doesn’t know anything about ‘CDRs’, ‘EHRs’ or any other specific kind of data.

I suppose concretely FROM specifies a ‘row-set’, but really it specifies the ‘database’, in the abstract sense, which is essentially the ‘universe’ of data to which the query applies. This is also sometimes called a ‘schema’, which is a DB word meaning ‘model’.

Seref · 20 February 2020 17:53

Good catch, fixed that bit.

I’d rather keep it as it is, happy to hear suggestions that give the same meaning.

Thx, reworded that part

Nope

Yep, done.

removed it, because it actually does not define the scope. It defines the source, and scope is defined by SELECT and WHERE potentially extending and narrowing based on the source/root.

I don’t think so. A formal definition is what was requested and IMHO a formal definition should not start with an example. The bits you expand on are the rest of the section for FROM. What I’ve written above is just the introduction based on the definition.

Seref · 20 February 2020 18:09

I respectfully, but strongly disagree I’ve seen this point made before and I meant to respond then. AQL has the potential to be generic query language to query any underlying model but I’m in favour of defining it strictly based on openEHR terms, based on openEHR EHRs, data types, structures etc.

With my implementer hat on, I would like a query language spec to focus on the language and terms of data that I’m processing. This is why my suggestions for a formal definition of FROM above refer to RM types, their attributes, containment in EHR etc.

I am the one who raised the overloaded semantics of CONTAINS and I’m happy to further specify it but I would rather do that based on more openEHR words, not less.

I made a second pass to simplify my definition but I’m keen to address potential adapters using an openEHR specific terminology and language.

That’s my 2 pennies of course, happy to hear what @bna and @matijap would think.

bna · 21 February 2020 07:19

I agree on this. We need a query language which fits the RM as good as possible. My experience so far is that the match could have been better. Data defined by our RM is hierarchical like trees, and with the possibility of making references it goes into a multi hierarchical graph. This is when you get challenges with todays AQL. It, kind of, assumes a flat database scheme and a flat tabular row based resultset.
I think current AQL is really good for lots of use cases. And I will not be surprised if we some day made a new specification which covered the hierarchical data better.

This could happen through revolution or evolution. Anyway it has to be a domain specific language for the EHR.

thomas.beale · 21 February 2020 10:20

Well, query language semantics should not differ across data models the language processes. The optimisations that might be possible are another thing. If I were implementing an AQL engine, I would expect to have some bags of heuristic rules for processing queries against particular RMs in particular usages, e.g. openEHR RM in EHRs; openEHR RM in HighMed research; openEHR demographics in an MPI; openEHR Task Planning data.

But I can’t see how the formal language specification can have anything in it that is specific to any particular model. Indeed, I am not aware of anything in the current grammar that is specific to the openEHR RM.

There is also the question of ‘clinical safety’ as Ian as raised in the past. Whether some other layer(s) of semantics are needed over the top of AQL in particular contexts is something to explore as well. But again, if such layers can’t rely on general query language semantics, you’d never be able to write those other layers.

The CONTAINS semantics can be quite easily specified in the BMM (or other representation) of any model; right now they are not, and so, AQL engines/services don’t know that logically, an openEHR EHR object ‘contains’ (= has sub-part) COMPOSITION, FOLDER, EHR_STATUS etc. We need to fix that. But building it in to the language itself is not the correct approach - it has to be stated in the model definition semantics, and we can do that, indeed, it would not be hard to add it to today’s BMMs with a small amount of work. Tools like the Better ADL-designer already read BMM files; in future AQL processors can as well, and all will be well with CONTAINS.

Seref · 21 February 2020 11:14

I agree, but I also cannot see how I may have suggested that. We have one data model as far I’m concerned: openEHR RM and data based on instances of RM.

Maybe I’m missing something here but these are all RM implementations, based on the single RM specification. I cannot see why they would be called ‘particular RMs’.

There is more than one way to skin a cat when it comes to formalising something. I’m in favour of formalising AQL on top of RM. It could also be formalised based on BMM, Tree Pattern Queries as I mentioned above, or with some other, well… formalism My understanding of formalising is ‘specifying its behaviour’ and I suggest we do that based on references to data defined by the RM, which consequently implies using the concepts and terms of the RM, as in, “FROM clause defines data elements based on RM types and constraints on RM type attributes …” etc.

I am concerned about having to resort to other and especially more generic formalisms to define/formalise AQL unless there is no way to do this without using the RM subset of openEHR specifications. The execution semantics is one example of RM not being sufficient, where I suggested the use of TPQ or alternatives, but as I said to @sebastian.iancu above, I’d still try to see that more in the ITS space and not in the AQL specification.

Well, grammar is at syntax/lexical level and even there you’d have things specific to openEHR if we wanted to help implementers, for example, you cannot have an archetype id token in an AQL query that would not be valid archetype id identifier according to RM, as in ... COMPOSITION c[myLovelyComp]... should not even be syntactically valid because we define valid archetype ID syntax in the RM

My points above re the semantics of CONTAINS are explained in terms of RM as you can see, I don’t need to break the self-containment of RM spec to explain CONTAINS can mean both resolving an aggregation relationship and a composition one. I’m merely suggesting we follow that approach.

completely agree, but your comment seems to imply you don’t think we can specify query language semantics without using another formalism. I’d say query semantics can be specified within RM, but execution is different and even than that’s ITS.

They do. The fact that we have > 1 working implementations of AQL proves that they do BMM is another way for them to know it, but then again, we’re in the ITS space.

Are you suggesting we state query related aspects in the RM? Isn’t that what you and I consistently argued against so far, especially in case of GUI aspects, and most recently in Birger’s EHR subject concern?

I’m advocating we define what AQL does based on the RM specification, and how it may do it in the ITS, whether BMM or some other mechanism.

My attempt to follow the approach I’m suggesting is above. Maybe I’m failing to understand your suggestion and I’d be delighted to be corrected or shown the error of my ways because this stuff is bloody complicated!

thomas.beale · 21 February 2020 11:34

Maybe I was not clear by what I meant when I said ‘BMM’ - I don’t mean the BMM formalism, I mean actual BMM instances, i.e. model definitions. We already have BMMs for the whole of openEHR, right here. These are the files that are consumed by tools that require a model definition.

I also have BMMs for FHIR and can make one for any model in the world. We can do the same thing with some other meta-formalism, like XMI or (maybe) JSON-schema, or whatever; we just use BMM because it works and fixes a whole lot of problems of XMI.

So what I am advocating re: specifying logical whole/part relationships, is that this semantic be defined in the BMM. (It is already in the latest BMM spec, just not in the implementations.)

If we specify this kind of thing properly in the BMM for any concrete model, then an AQL processor always processes the CONTAINS statement correctly.

Currently, AQL implementations (quite reasonably) are hard-wired to the openEHR RM, in the same way CKM is, and ADL workbench once was. We need to move AQL (and CKM) to being model-driven, and define the model-specific semantics in the model definition (BMM files, or XMI, or whatever else takes your fancy), and define the query specific semantics in the query language.

Hopefully this is clearer!

Seref · 21 February 2020 11:53

Thank you, this is indeed helpful. Allow me to allow you to help me further

a) I just cannot see what is wrong here.
b) How can anything be hard wired to RM when RM itself is technology agnostic?

I have no objections to validity of this approach, but I’m concerned about its consequences, because unless I’m missing something, this makes BMM implementation a precondition for AQL implementation. The downsides of which to me would be:

The learning curve for potential openEHR implementers, who now also have to understand BMM to understand AQL
The increase in implementation costs for potential and existing implementers.

I guess you can help me a lot more if you could tell me why defining AQL based on RM is bad (in the way you describe as hard wired)

thomas.beale · 21 February 2020 13:12

Well, the openEHR RM, at the end of the day, is just a model of data. Naturally some of us think its quite good, but that’s subjective
The openEHR Demographics part of the RM is separate in the sense of not being part of the EHR, but really querying should work with it as well.

The point for a query language isn’t to be technology agnostic, the point is to be model-agnostic.

If we make it specific to some model, we have to specify something different / new just to say how AQL works for openEHR Demographics, Task Planning, or indeed, any archetyped data - including in other domains.

The one thing AQL does need to know about that is ‘openEHR-ish’ is of course Archetypes, archetype ids etc. But that’s part of the formalism layer of openEHR, not any of the models. Hence the most recent arrangement of the components into groups that follow this idea:

Re learning curve:

well we are talking about a small number of people who are all engineers and/or scientists, so I don’t think BMM will be much of a challenge. Mainly they will experience it just by using Archie, which will make it easy to use.
model-driven is the future. If it’s not BMM, it will be Ecore, son-of-UML (SysML2 maybe) or something else. We just don’t use those things today because they are out of date (no functional stuff), broken (generics, property/association semantics) and impossible to read, in the case of XMI.

Better’s ADL-designer already uses BMM to know about models; LinkEHR also reads them. Nedap’s nascent ADL tool is BMM-driven. HL7 CIMI is (or at least was until recently) using BMM. CKM will go there at some point…

pablo · 21 February 2020 23:37

I guess that depends on the definition of “scope” and “source”.

As I understand it, “source” would be “all your data” (the think I called universe because of the mathematical set theory term, which is the “given situation” or “given state”, that could also be “domain”).

Then “scope” would be the subset of the universe that you want to focus on (still thinking as set theory here).

The SELECT is to map a projection, I like these definitions:

" In relational algebra, a projection is a unary operation written as Π a 1 , . . . , a n ( R ) {\displaystyle \Pi {a{1},…,a_{n}}(R)} $i {{a{1},...,a_{n}}}(R)$ where a 1 , . . . , a n {\displaystyle a_{1},…,a_{n}} $a_{1},...,a_{n}$ is a set of attribute names. The result of such projection is defined as the set obtained when the components of the tuple R {\displaystyle R} are restricted to the set { a 1 , . . . , a n } {\displaystyle {a_{1},…,a_{n}}} $a_{1},...,a_{n}$ – it discards (or excludes ) the other attributes.[1]"

" Projection is one of the basic operations of Relational Algebra. It takes a relation and a (possibly empty) list of attributes of that relation as input. It outputs a relation containing only the specified list of attributes with duplicate tuples removed . In other words the output must also be a relation."

And the WHERE is for filtering data from the scope, only the data that passes the filters will appear in the projection.

I know everyone here might have their own definition or idea of things. Maybe we need to go down to the basic definitions that we will agree on, because we might be talking about different things. Of course, it depends on how strict or “mathematically correct” do we want to be on the spec. It’s also valid to define our own terms in the context of openEHR, but we need to have good definitions to avoid interpretation issues.

pablo · 21 February 2020 23:41

I agree, I shouldn’t mention CDR, I was thinking of data storage.

And I agree we should explicitly say AQL expressions could be used to query any archetype RM, including openEHR EHR and DEMOGRAPHIC, but could be used for other RMs. Also that should be extended to the examples, which are all focused on EHR.