AQL - the simplest possible question?

ian.mcnicoll · 30 March 2020 08:52

In that case. we have a shared understanding. I don’t think I am diverging meaningfully from what you said.

the FROM states the target object structures
the WHERE indicates which rows to keep
and the SELECT says which columns from the remaining set of rows (= reduced table/view) to give me back

All of that makes perfect sense but it still leaves out what for me is the critical breakthrough in my understanding…

Once you start nesting CONTAINS, the target object is actually the combination of all the structures within the CONTAINS tree, with a focus on the most granular object.

so CONTAINS COMPOSITION CONTAINS OBSERVATION means

give me all the OBSERVATIONS (each within its own parent COMPOSITION)

and not give me all the COMPOSITIONS which contain OBSERVATIONS (which is the ‘natural’ meaning, I suspect ro most people- and I think the source of most of the confusion (certainly mine).

I am asking for each individual OBSERVATION to be given a row of its own, in the context of its partent COMPOSITION.

Now that I understand this is how things work, that is quite useful because it may actually be that I want to tease out each individual observation , along with e.g a composiitonID.

SELECT c/uid , o FROM EHR e CONTAINS COMPOSITION C CONTAINS OBSERVATION o

though in general I will just want the a single composition, which is where Seref’s suggestion for EXISTS would work - is there a reason why a simple EXISTS might not be enough - it is important that the EXISTS clause accepts an ALIAS so that it can be used in WHERE clauses

SELECT c/uid , o FROM EHR e CONTAINS COMPOSITION EXISTS OBSERVATION o WHERE o/origin/value >= “2020-03-04”

So @Pablo, I don’t think there is actually a ‘technical view’ and a ‘non-technical view’ (as I also used to argue). The way that CONTAINS works is non-intuitive to a non-CDR developer audience but it is relatively easy to explain, once understood. Introducing something like EXISTS solves the problem for me (or at least part of it).

AFAIK all of the CDR developers have adopted the same approach to CONTAINS so the last thing I want to do is disrupt that, especially if it is correct from a comp science perspective.

I think we have solution to this part of the permutation problem.

Do we need

EXISTS(COMPOSITION C contains OBSERVATION o)
or is
EXISTS(OBSERVATION o)

sufficient?

Seref · 30 March 2020 09:07

first of all, thank you for putting the effort to think on this.

I’d advice against this version initially. EXIST should belong to WHERE, because you can reference all aliases from WHERE and you can define complex structural and object-constraint EXIST conditions based on that. Going back to my upstream, downstream distinction as adopted by other languages, EXIST is more appropriate as a construct at the WHERE level.

That being said, it can be used under FROM for some advanced cases but I’d suggest we don’t do that as the first step, hence the emphasis on initially.

Seref · 30 March 2020 09:08

oh, also note that if you select both c and o and use EXIST, you’re at the mercy of the implementation if there are more than 1 observations. Because implementations cannot give you a guarantee as to which observation amongst N will be retrieved so EXISTS is more useful when you want to get the c that fits the criteria and dive into it later, if you want to , after the results are returned.

ian.mcnicoll · 30 March 2020 09:33

As long as the OBSERVATION o alias declared in the WHERE clause is valid, I can see that makes sesne.

SELECT
c
FROM
EHR e
CONTAINS COMPOSITION c
WHERE EXISTS (COMPOSITION C CONTAINS OBSERVATION o) and o/origin/value = “2020-03-12”

Seref · 30 March 2020 09:34

almost there. just fix that /origin/value -> it should be based on on alias

ian.mcnicoll · 30 March 2020 09:35

Typo fixed for you

Seref · 30 March 2020 09:36

not yet, should go into the paranthesis and should be based on a nested WHERE. you just invented subqueries in AQL, congrats!!

ian.mcnicoll · 30 March 2020 09:36

Now you are just being annoying

Seref · 30 March 2020 09:38

sorry about that gotta go now, I shouldn’t be doing this on company time, will be back later

thomas.beale · 30 March 2020 09:46

That’s probably the best statement of the ‘permutation interpretation’ of the FROM part of an AQL query I have seen.

The ‘naive’ version of the semantics of AQL (or any query language) is that it is ‘the (distinct) COMPOSITIONs’, i.e. not the ‘OBSERVATIONs’ that form the result of the FROM part. We could call this view the ‘minimalist interpretation’ of the possible data source (I’d prefer to use the word ‘source’ rather than ‘target’ here by the way); the latter view is a ‘maximalist interpretation’.

In the minimalist interpretation, you will get the effect of individual OBSERVATIONs (or smaller things) being cloned anyway, when the query processor executes the SELECT part. So it seems to me that having the ‘cloning’ effect twice does not make sense.

But it may be that if you take the maximalist interpretation (i.e. cloning occurs due to the FROM, on the basis of the finest-grain particles mentioned), then in fact there is no cloning needed in the SELECT part (the WHERE part just reduces the individual instance space, so it doesn’t make any conceptual difference). It would be interesting to know if this is actually true, and also if that is what (some) current implementations are doing.

Another interesting question is why the maximalist interpretation generates unexpected / unwanted clones (i.e. the problem that has been discussed regularly for a few years now).

We could take two approaches to this situation:

whatever all the AQL implementations are doing, even though we think it is generating extra permutations in the result set, is what the AQL spec should state formally;
we might say that we now understand what the problem is with current implementations, and we can also fix it, by removing cloning in the FROM phase, and generating instead the minimalist proper set in that phase, before performing the other processing.

This group and others out there should debate this of course, but my nagging feeling, without having analysed this in fine enough detail, is that two phases of object cloning are occurring, and I think this is creating the ‘unexpected rows’ that we have been talking about for so long.

Seref · 30 March 2020 10:19

almost. that is a special case of matching a tree structure. Because following your line of thought, in a slightly more complex query, the query then becomes a set of statements that define paths starting from the leaf. If you had a logical operator, you’d then have > 1 branches and you’d have to write

give me all CLUSTERS, with parent OBSERVATIONS, with parent compositions
AND
give me all CLUSTER with parent INSTRUCTIONS, with parent compositions
WHERE
the parent of INSTRUCTIONs and OBSERVATIONs are the same composition.

You just defined a tree structure as the constraint for the FROM clause. That’s how I ended up with tree pattern queries around 2013 or so.

If you take a look at the graphical notation I’m using the teasers, with your thinking and what said above in mind, it would hopefully make more sense.

matijap · 10 April 2020 06:29

Sorry I’m late to the party; I’m reading this thread just now and haven’t gotten to the end yet, but I’d just like to say I, as an engineer and a mathematician, think that (2) is actually the correct answer and I am sad to report that our implementation actually returns (1). I haven’t researched why that is the case yet.

birger.haarbrandt · 7 September 2023 08:15

We currently revisited this topic and the vitagroup engineers agree that it should be (2). We might need to make this a bit more formal and add to the spec

sebastian.iancu · 7 September 2023 10:04

Do you also support (or have implemented) the DISTINCT (see Archetype Query Language (AQL)) ?

birger.haarbrandt · 7 September 2023 11:24

yes, this is implemented