I have some follow-up questions about AQL implementation, on the topic of CONTAINS formalism, which was heavily discussed 2019-2021. A lot of ‘teasers’ and example-based-debates where published here on discourse; it takes hours to read them and perhaps even days to evaluate and comprehend entirely.
For the following, I did some tests with EHRbase, and to keep it simple, you will have to assume that there is a single EHR with a single composition in the entire CDR. This composition contains 2 OBSERVATIONs. Something like this:
C
O1
O2
Questions:
The SELECT part generates rows whenever the path points to multiple objects?
The SELECT o FROM EHR CONTAINS OBSERVATION o returns two rows, as O1 and O2; this is equivalent to SELECT c/content FROM EHR CONTAINS COMPOSITION c, which returns also the same two rows, with O1 and O2.
Asking because reading the specs I would have been expecting that ‘processing’ candidate nodes/rows is done only by CONTAINS, followed by filtering step according to WHERE conditions, then extracting data as SELECT express it. The behavior/flow is anyways not in the specs - so I wonder if all AQL implementations works the same way?
Does the SELECT c FROM EHR e CONTAINS COMPOSITION c CONTAINS OBSERVATION oreturn two (duplicate) rows, containing the same c
This is related to discussion on AQL - the simplest possible question?
In relation with question on AQL - what do you expect as results for these example - #5 by ian.mcnicoll, an AQL like SELECT x, y FROM EHR e CONTAINS (OBSERVATION x OR OBSERVATION y) should return permutation with nulls? EHRbase returns 4 rows: [O1, O2], [O1,O1], [O2,O2], [O2,O1] and does not return things like [O1, null] or [O2, null]. Is this behavior settled and therefore it needs to be added to specs?
Is there any edge case where writing AQL with archetype predicates in CONTAINS is not equivalent to same AQL with those predicates expressed as WHERE conditions? in other words, is SELECT ... FROM EHR e CONTAINS (OBSERVATION x[pred1] OR OBSERVATION y[pred2]) always equivalent to SELECT ... FROM EHR e CONTAINS (OBSERVATION x OR OBSERVATION y) WHERE x/archetype_node_id="pred1" AND y/archetype_node_id="pred2", or are there exceptions?
In a haste, I will just comment that I remember those debates and I hoped at the time that the major implementers (which definitely includes DIPS (Bjorn Naess), but we haven’t heard from them for some time) would reach some consensus on how to handle this “cartesian product dilemma”. At least us and DIPS handled it differently, and both not very logically consistently, so we agreed in principle we need to define new rules (which would break compatibility for all of us, but ensure future cross-platform compatibility). Unfortunately we never had time to finish this work and I hoped the “querying mechanisms working group” would take that over, but we haven’t ever met yet (to the best of my knowledge).
This is the issue I’ve been raising literally for years now. AQL has no execution model or semantics in its definition.
We don’t have the language (i.e. formalism) to express this behaviour if it was to be settled. Ergo, we don’t have the language to specify it in the specs. How would we write it into the spec? We would need an abstraction, something that conceptualises the data, the operations etc. Gremlin has traversal as that abstraction, SQL has relational algebra, Xquery and XSLT have the XML data model (nodes, DOM etc). We don’t have that for AQL. Re EhrBase: if my memory is correct, EhrBase team took a pragmatic approach and chose to implement eliminating some of the rows to make it easy for the user. I.e. implicit over explicit.
Tricky one. Hard to answer without reading the AQL (E)BNF grammar and it’s been too long a day for me to read that so I’ll say it should be the same for declaring predicates within the scope of an RM object (OBSERVATION etc in your example) because both support boolean ops and precedence via parenthesis (that’s why I need to check the grammar). However, if WHERE clause now has the ability to define expressions that allow referring to other aliases, that’s something that is unlikely to be supported in [...] predicates. As in WHERE x/bla/.../value > y/bla.../value . You cannot express that constraint in a way that makes sense by using the [...] predicates in the CONTAINS clause. Some years ago, WHERE did not support the kind of expression I wrote but I suspect it may be able to do that now, so that would be your edge case.
In the end, IHMO we won’t be able to clarify, agree on, or even discuss these details unless we have a mutually agreed representation of query and query execution semantics. @chunlan.ma and @heath.frankel (primary and secondary authors of AQL) adopted a hybrid of XPath and SQL query models as far as I know, but they did not specify it, unless they have a paper published somewhere that I’m not aware of.
I can go back to my cave now, having yelled the same thing in the village square for the bazillionth time
I was not referring/asking about syntactical equivalency, not even quite semantic (as CONTAINS predicate is something else then WHERE filters) but more on processing part of the engine an the expected results. I have the impression that adding CONTAINS with predicates influences the results in a different way then defining equivalent condition as WHERE constraints - therefore my question above.
I know about those failed attempts from the past - but it should not stop us. My questions are hitting this gray zone of specs which we need to address in SEC. But here I’m just trying to name to a few concrete issues I’m facing, and hope to get more explanations/thoughts about how AQL works or should work.
Regarding how to specify this AQL formalism, I rather think at this time that publishing a set of examples (data+query+results+explanation) will just be enough. EHRbase has a couple of test-cases which are good candidates to start documenting the formalism by example. It is just that I don’t know if those applies to other AQL implementations or are strictly testing EHRbase features.
It has been hard to get that AQL discussion going, in part, I suspect because it is really pretty difficult and the very people who can engage are pretty busy with day jobs. However, as we get closer to formal conformance, federation and e.g SMARTon openEHR, the market is starting to demand alignment and clarity, so perhaps the business case to release time will become easier.
I guess the headline question for me ( well out of my hacker comfort-zone) is whether to focus on practical alignment, as and when the divergences become apparent, or whether it is worth making the case for that core piece of work on agreed execution principles, which might take longer but to get started but will pay dividends down the track.
I’d rather respond to this in a SEC discussion. It is hard to comment on this without introducing commercially sensitive aspects into the conversation. Every “should” comes with a price tag.
For the implementation of EHRbase v2 we had discussed AQL semantics within the team:
c/content will return all objects within c matching the path.
As selecting c/content will not return an array of objects, additional rows are produced.
The query returns two (duplicate) rows, containing the same c.
The FROM provides the objects that the query is operating on,
so not mentioning these objects in the SELECT does not remove them from the result set.
Otherwise, as an extreme example, select 1 from COMPOSITION would always return an empty result.
x and y are left-joined to e, with the constraint that at least one of the objects must exist.
Therefore, null will only be returned if one of the joins produces no results.
x/archetype_node_id=pred1 can only be true for an existing x.
Hence,
SELECT ... FROM EHR e CONTAINS (OBSERVATION x OR OBSERVATION y)
WHERE x/archetype_node_id=pred1 AND y/archetype_node_id=pred2
would be equivalent to
SELECT ... FROM EHR e CONTAINS (OBSERVATION x[pred1] AND OBSERVATION y[pred2])
At the moment, the OR cannot be emulated by WHERE constraints because EHRbase does not yet support the EXISTS operator (and null comparisons).
SELECT ... FROM EHR e CONTAINS (OBSERVATION x[pred1] OR OBSERVATION y[pred2])
would be equivalent to
SELECT ... FROM EHR e CONTAINS (OBSERVATION x OR OBSERVATION y)
WHERE
(x/archetype_node_id=pred1 AND y/archetype_node_id = pred2)
OR (x/archetype_node_id = pred1 AND NOT EXISTS y)
OR (NOT EXISTS x AND y/archetype_node_id = pred2)
Or inverted:
(x/archetype_node_id=pred1 OR y/archetype_node_id=pred2)
AND (x/archetype_node_id=pred1 OR NOT EXISTS x)
AND (y/archetype_node_id=pred2 OR NOT EXISTS y)
Given the premise from [SPECQUERY-33] - openEHR JIRA that x = null is true iff x is missing, instead of NOT EXISTS x, x=null could be used.
We are, however, not convinced of this interpretation because it would imply bivalent logic for AQL, not the three-valued logic of SQL, which AQL is based on.
It seems uncontroversial that null is not smaller than, larger than, or equal to an existing value. Many programming languages require explicit handling of null in these cases.
A comparison of the following queries illustrates that bivalent logic can easily become confusing:
SELECT h/events
FROM HISTORY h
WHERE h/events/sample_count >= 1;
SELECT h/events
FROM HISTORY h
WHERE h/events/sample_count < 1;
SELECT h/events
FROM HISTORY h
WHERE NOT (h/events/sample_count < 1);
With bivalent logic the first and third query would not return the same result.
The first query would return all INTERVAL_EVENTs with a sample_count of at least 1.
The second one would return all INTERVAL_EVENTs with a sample_count of less than 1.
The third query, however, would, in addition to all INTERVAL_EVENTs with a sample_count of at least 1,
return all INTERVAL_EVENTs without sample_count as well as all POINT_EVENTs.