AQL - the simplest possible question?

bna · 27 March 2020 07:14

Given a CDR with only two compositions and one EHR.
Composition 1 (C1) has one OBSERVATIONS.
Composition 2 (C2) has two OBSERVATIONS.

And you query with the following AQL:

SELECT c FROM COMPOSITION c CONTAINS OBSERVATION o

What is the expected result?

Alternative 1:
--------
| c |
--------
| C1 |
--------
| C2 |
--------
Alternative 2:
--------
| c |
--------
| C1 |
--------
| C2 |
--------
| C2 |
--------

Seref · 27 March 2020 09:43

Alternative 2. Happy to discuss any particulars you may have in mind.

bna · 27 March 2020 09:57

Interesting answer @Seref - your answer is the same as the developers. I and a few no-developers argued for alternative 1. Why?

I ask for a composition which has an observation and there is only two of them --> two rows.

What we agreed on was that “my intention” actually was some structure with “EXISTS observation o” or DISTINCT© .

Seref · 27 March 2020 10:24

You’re not wrong and neither your developers are. You’re hitting an interpretation problem that I was hoping to discuss and hopefully clarify/document.

The contains statement has two semantics which are related. The most fundamental one is that at least single instances of X and Y “exist” for X CONTAINS Y to have a match. You can (and you are doing so) think this as a “filter” that goes through all data and retrieves the ones that fit the criteria.

However, the operands of the CONTAINS are also placeholders for identifying actual data, this is the pattern matching semantics, which depends on filter semantics above to produce a true/positive result. If there are are not at least one instances of COMPOSITION and OBSERVATION, there is not point in thinking about selecting anything, hence the dependency. However, placeholder semantics introduces the permutative behaviour, because for the second composition, you have two OBSERVATIONs that can be used to satisfy the relationship you described, so you have to return two rows. It does not matter you’re not including “o” in your select clause, there are two matches for that pattern, if you choose to select only composition, it won’t change that fact. this is exactly what happens when joins produce nxm results and select clause of sql does not include the ‘differentiator’ column.

If you did not select the placeholders but used actual hits, you’d end up with alternative 1. At the moment, AQL does not support this semantics, but it would be something like this (in an imaginary version)

SELECT C FROM COMPOSITION C WHERE C CONTAINS OBSERVATION O

Once you define a data item in the FROM clause, you cannot avoid its permutative side effects, unless you introduce some keywords to explicitly do that, which is a whole black hole on its own

sebastian.iancu · 27 March 2020 12:08

It is indeed the 2.)
It wasn’t obvious for me, given that the example is quite simplistic and may put you on wrong track… but in the end your explanation @Seref explains why do you get 2.)

ian.mcnicoll · 27 March 2020 12:37

I haven’t read this closely enough to try to understand why (2) is what will happen, though I agree with Bjorn that (1) is what a ‘naive’ user might expect. One way or another I do think we need to get ‘the query layer’ to meet ‘the man on the bus’ perspective., but first it would be great if we could be sure that all current CDRs would produce (2) - that t least demostrates consistency.

Half of the battle right now is not knowing if an ‘unexpected’ AQL result is because basically I am ‘ignorant’ (in the nicest sense) or if it just the way that a particular vendor has implemented it. SiO iwill be pushing hard for something to simplify but happy to get the consistency sorted out first.

Ian

sebastian.iancu · 27 March 2020 13:08

At this time (current implementation) I will also return 1), but has be changed if specification will state otherwise and explain why.
So all these examples by @Seref and @bna are really important, and we should think about how to include/attach them to specifications

Seref · 27 March 2020 13:40

Your input is always hugely valuable, and I’m always happy for you to push hard: if you don’t think this makes sense to you as a user of AQL, say so.

However, as I said to you in the past; we may not be able to find the simplifications that feels right to you without breaking consistency or making things inconvenient/error-prone for another user profile. In that case, we should be ready to consider another layer on top of this, GraphQL being one suggestion that came from you and is worth looking at.

thomas.beale · 27 March 2020 22:21

A possible articulation of the ‘naive’ point of view:

If we think of the part that comes after the FROM as a filter statement on the total possible instance space (all possible instances of the first mentioned class, i.e. COMPOSITION) that selects instances from that space that match the CONTAINS, and uses the result as the basis for further processing (perform the WHERE and then return the SELECTed elements) … then I would not expect to see duplicate objects in that filtered data space … or are we saying the processing the SELECT generates the duplicates?

bna · 28 March 2020 07:56

I published the example here: https://github.com/bjornna/openehr-conformance/blob/master/aql/example_comp_obs.adoc

The example illustrates how contains has two different semantic meanings.

In the first example it is used as a filter on data. In example 2 it is used as an inner join on data. The interpretation depends on how the client defines the select.

One way to implement the datastructure above might be with a simple database model. Two tables where needed. COMPOSITION and OBSERVATION .

Simple databasemodel for the example

Table COMPOSITION
|----|
| ID |
|----|
| C1 |
| C2 |
|----|


Table OBSERVATION
|---------|
| ID | CID|
|---------|
| O1 | C1 |
| O2 | C2 |
| O3 | C2 |
|---------|

CID = Reference key to the COMPOSITION

SQL equivalent to example 1

select c.ID from COMPOSITION c inner join OBSERVATION o on o.CID = c.ID

SQL equivalent to example 2

select c.ID, o.ID from COMPOSITION c inner join OBSERVATION o on o.CID = c.ID

pablo · 28 March 2020 16:22

If the contains is considered as “contains any” (at least one), then it doesn’t matter how much OBSERVATIONS the COMPOSITION contains, the condition is true if there is at least one OBSERVATION, so I don’t think showing the all the internal permutations to the client is needed.

To check a “contains any” I guess an EXISTS or IN could be used, something like:

SELECT c
FROM composition c, observation o
WHERE c.id IN (
SELECT c1.id
FROM composition c1, observation o1
WHERE o1.parent = c1.id
)

OR

SELECT c
FROM composition c, observation o
WHERE EXISTS (
SELECT 1
FROM composition c1, observation o1
WHERE o1.parent = c1.id
)

Of course I don’t know what’s the mechanism implementers are using or the criteria they chosen to follow.

bna · 28 March 2020 18:41

I agree on this. But the real question is what current clients expect when running such a query. As we all know this is not a theoretical exercise. Several vendors have running systems supporting such queries.

We, DIPS, does disagree internally on how to interpret the query.

As far as I can see the current and previous Oceans guys seem to disagree as well.

Personally I tend to disagree with myself on this.

I need help. Someone to tell the world the one and only mathematical correct way to interpret this simple query.

pablo · 28 March 2020 20:02

The issue I see here is interpreting CONTAINS as a JOIN instead of an EXISTS or IN, because if “o” is not used elsewhere on the query there is no need to retrieve it with the JOIN, it would be sufficient to check if “o belongs to c”. Of course if the developer’s interpretation of CONTAINS is “that is always a JOIN”, then you will always get the permutations even without an explicit projection of the things that were JOINed.

Of course if the SELECT uses “o” in the projection, I agree with Seref, you need to retrieve the data for the OBSERVATION to replace the data placeholder defined by “o”.

If “o” is mentioned in the WHERE, it might also be possible to do EXISTS/IN instead of JOIN, but might depend on the condition. With some AQL like:

SELECT c
FROM COMPOSITION c contains OBSERVATION o
WHERE o/.../value > 140

Might be translated to some SQL like:

SELECT c
FROM composition c, observation o
WHERE c.id IN (
  SELECT c1.id
  FROM composition c1, observation o1
  WHERE o1.parent = c1.id AND o1.id IN (
    SELECT o2.id
    FROM OBSERVATION o2, DATA_VALUE d
    WHERE d.value > 140 AND d.path = ... and d.parent = o1.id
  )
)

Of course this depends on each implementation, I’m just thinking about what would be a possible interpretation + implementation.

So I think if the “o” is not in the SELECT, the results should be just two rows, and if the “o” appears, you need the permutations in the result.

bna · 29 March 2020 05:20

That’s my gut feeling on this as well.
And I like the way you think and communicate with expanded AQL. We’ve learned that most AQL are shorthand and convenient syntax for the more verbose and assembly version of AQL.

Which means is:

It must always be possible to expand an AQL to its full form,
And in the full form the rules of interpretation is clear when it comes to implementation logic.

Seref · 29 March 2020 10:03

Almost all the major/mainstream query languages have different clauses for defining the upstream set (FROM in SQL) and for defining the downstream set (SELECT)
Sql, Sparql, XPath, Xquery all follow this design. The downstream is usually a mapping of upstream, sometimes unmodified, sometimes with a transformation. AQL also follows this approach

From this perspective, your gut feeling and Pablo’s comments imply connecting the upstream and downstream semantically. I could probably come up with an example that explains why this is not a good design choice but I’ll be lazy and say that you’d be taking a different path than all the languages above.

In this situation, I prefer to ask myself, “do I think I got this right and all the others got it wrong?”. For me, personally, the answer is no in this case. that’s not the only decision component, I have more or less a brain of my own and I can see how mixing upstream and downstream semantics can be problematic, but as I said I’ll take the opportunity to be lazy here.

Now, this doesn’t mean you have a legitimate requirement, which is the select data based on a pure filter/exists semantics without having to deal with the permutative behaviour.
What I suggest is we look for a feature of the language (aql) that provides what you’re looking for, and add it if it doesn’t exist. That’s how it works in most languages above: “how do I do this with X”.

Sometimes the feature won’t exists as a first class member of the language, as is the case here, so we can either work around it (using functions etc) or add it as a first class member.

Please look at my from-the-top-of-my-head suggestion above, where I introduced some sort of exists semantics using the WHERE clause. I’ll re-write that with a slight twist:

SELECT C from EHR E CONTAINS COMPOSITION C WHERE EXISTS(C CONTAINS OBSERVATION O)

Now we can say that EXISTS changes the behaviour of CONTAINS and implies a first-hit-is-enough semantics. Which is what SQL does as well. Because SQL users also came up with the same requirement you have.

This keeps us from overloading the semantics of FROM based on what’s in the SELECT, which is a better design for a query language (as evidenced by all other langs following it etc…)

It is incredibly difficult to keep the correct balance between language design, language implementation, performance and user satisfaction/productivity. In a multidimensional optimisation like this someone is bound to walk away less than completely satisfied,but that’s for the good of the whole ecosystem.

Please consider my suggestion above, sleep on it, and discuss it with your team as you see fit. Happy to hear your feedback on it, and thanks for asking the question in the first place.

thomas.beale · 29 March 2020 12:38

I agree. When AQL was originally specified, the CONTAINS operator was understood as a filter for reducing the scope of the data on which the WHERE and then the SELECT would run.

From the current spec: Section 3.10.2.3 Containment:

Since archetypes are in hierarchical structure, AQL has a containment constraint which specifies the hierarchical relationships between parent and child data items. The FROM clause utilises this hierarchical constraint along with class expression to determine the data source to which the AQL query is applied.

Now it may be that people’s ideas of what CONTAINS means have changed over time, but if so, I’d say that’s different from the spec.

This doesn’t mean that the permutational view of the FROM part is wrong of course, but if we do want that view rather than just straight filtering, I think we will need to develop a plain language explanation that gives it some sense … which might be this:

This may also do people’s heads in, as it is a pretty weird idea that the SELECT projection will change the FROM logic.

I’d still prefer to work from the relatively simple idea that what is in the FROM part is a filter on the total possible data implied by the types mentioned in the FROM part, to generate the actual data source for the WHERE and SELECT processing. And if permutations are required, then we need a clear explanation of how to understand them. I suspect there is a gap between the work @Seref already did to prove this, and the ‘common sense’ explanation that would need to be comprehensible to ‘average AQL authors’?

ian.mcnicoll · 29 March 2020 22:01

I am creeping towards a much better understanding of why I, at least, have struggled to understand some of the principles here. Here is my current attempt …

The ‘guts’ of the query is actually in the FROM clause and below. This defines which object structures to return and the filters (WHERE clause) on any matching objects.

so FROM COMPOSITION C where c/start_time/value = ‘2020-03-03’ says “find me all composition objects with that start_time.”

then critically it means “each time you find a unique object matching those criteria, create a new row in the resultset.”

So far so good - we get a row back for each matching object

Now SELECT kicks in and says, from that row you have given me, here are the specific objects or scalar values that I really want to see.

Critically, SELECT is a filter on each object in the array of rows returned by the FROM clause, each of which holds an object.

If we work with Bjorn’s example we will get 2 compositions (assuming they have the same start_time)

We can apply the same principle with FROM OBSERVATION o where o/origin/value = ‘2020-03-03’

If we work with Bjorn’s example we will get 3 Observations (assuming they have the same origin)

In neither example is the WHERE clause necessary - just to demonstrate that the first phase of the process is the query into the FROM space and any WHERE clauses to return an array of matching objects - one per row. I’m also ignoring FROM EHR e for brevity.

So far again so good, no surprises in the returned content in either example?

but what happens when we start nesting the CONTAINS clause.

FROM COMPOSITION C CONTAINS OBSERVATION o where c/start_time/value = ‘2020-03-03’ and o/origin/value = ‘2020-03-03’

Until about an hour ago I would have said that this means

“find me all composition objects with that start_time which also contain OBSERVATION o with that particular origin value”

but I now understand that actually means “find me all Observation objects with that particular origin whose parent composition objects have that particular start_time” and return the whole structure - parent composition and matching observation.

i.e. executing a FROM clause will always generate matches at the lowest level of object matching - one row for each object match - if the matching object happens to share a parent composition with another matching object, it doesn’t care.

So in Bjorn’s example, we now get 3 rows - one for C1 that has one Observation match, 2 for C2 that has two observation matches.

So now the critical thought-error (for me at least) - by specifying SELECT c or SELECT c, o, or anything else, I am somehow controlling the object rows that are matched - I’m not, I am merely filtering the object that the FROM clause returns in each row and that will always look for unique matches at the furthest end of the data tree it has been asked to traverse.

FROM COMPOSITION c CONTAINS OBSERVATION o CONTAINS CLUSTER x where c/start_time/value = ‘2020-03-03’ and o/origin/value = ‘2020-03-03’

So this means…

Find me any Cluster objects which along with their parent structures, and generate an object row for each match,

if each of those OBSERVATIONS has 2 clusters we should end up with 6 rows because the ‘target’ object is a CLUSTER match, which (because of CONTAINS composition, expects to also return it within the context of the parent COMPOSITION.

Anything I express in SELECT has no effect on that number of matches, it merely filters what is returned in each row.

I’ll stop here for a lie down but, at least at one level now my head understands what is happening.

There also does appear to me to be a requirement to allow the ‘projections’ approach as well as the ‘single-composition approach’ so I like Seref;s suggestion of using EXIST as an alternative to CONTAINS to control that aspect, so that at very least we can use the EXIST alias as part of the WHERE clause.

There are further challenges around SELECT but if we can agree an approach

bna · 29 March 2020 22:19

Now we’re talking. This is the discussion I have wanted us to have!!
I have to sleep on the responses and come back to you tomorrow.
Thanks!

thomas.beale · 29 March 2020 23:19

Just so we know where we are diverging, the original spec would say the meaning is:

from the set of Compositions that contain Observation[matching some predicate], filter using the Where clause to generate a matching set of data objects; then apply the Select, which extracts the requested data items, to generate the result rows.

The usual understanding of SELECT FROM WHERE in SQL is
SELECT cols
FROM table/view
WHERE row match condition

i.e.:

the FROM states the initial table (which may be a view)
the WHERE indicates which rows to keep
and the SELECT says which columns from the remaining set of rows (= reduced table/view) to give me back

I’m not yet convinced that we want to go away from that basic understanding.

pablo · 30 March 2020 03:02

It depends on how you think about clauses. For instance, by looking at the MySQL documentation, there is no definition of FROM or WHERE clauses, the definition is for SELECT, and FROM, WHERE, etc. are all part of the SELECT clause. That is of course to differentiate the main top-level clauses: SELECT, INSERT, UPDATE, DELETE, … So if we think of SELECT being the top-level clause, it might not sound so weird to change the internal behavior in terms of how the projections are done in the SELECT. The issue here is AQL never specified the internal processing rules as an specification, so anything we specify would be OK, even the SELECT modifying other clause processing rules.

I think we should actually be looking at how DBMS define these things, since we are newbies in this area.

Considering the case exposed, there are two views: what the client wants and how the developer implements the AQL processing. The client creates the query and wants a specific result, if they say SELECT c, they only want the compositions, the rest is internal and the client shouldn’t be exposed to the internal processing rules, that is: for the user is weird to get multiple results with the same composition, because they don know the internal processing rules and they don’t know how many observations are in each compo. And if they need the observations, they will put that in the SELECT projection and actually see the permutations.

Until the internal rules are specified we will have that mismatch between users and developers, and between different implementations. And we need to create that specification based on what we think is correct, which sometimes can be subjective. What I tried to do above is to analyze this from both points of view, trying to be considerate with what users (query creators and who receive the query results) expect, and being strict with the rules for developers, if that implies that depending on how the SELECT is defined, other clauses need to behave a little different (generate different SQL or whatever underlying query language), I don’t see that as “weird” (either way I prefer “weird” over “wrong”).