I am creeping towards a much better understanding of why I, at least, have struggled to understand some of the principles here. Here is my current attempt …
- The ‘guts’ of the query is actually in the FROM clause and below. This defines which object structures to return and the filters (WHERE clause) on any matching objects.
so FROM COMPOSITION C where c/start_time/value = ‘2020-03-03’ says “find me all composition objects with that start_time.”
then critically it means “each time you find a unique object matching those criteria, create a new row in the resultset.”
So far so good - we get a row back for each matching object
Now SELECT kicks in and says, from that row you have given me, here are the specific objects or scalar values that I really want to see.
Critically, SELECT is a filter on each object in the array of rows returned by the FROM clause, each of which holds an object.
If we work with Bjorn’s example we will get 2 compositions (assuming they have the same start_time)
- We can apply the same principle with FROM OBSERVATION o where o/origin/value = ‘2020-03-03’
If we work with Bjorn’s example we will get 3 Observations (assuming they have the same origin)
In neither example is the WHERE clause necessary - just to demonstrate that the first phase of the process is the query into the FROM space and any WHERE clauses to return an array of matching objects - one per row. I’m also ignoring FROM EHR e for brevity.
So far again so good, no surprises in the returned content in either example?
- but what happens when we start nesting the CONTAINS clause.
FROM COMPOSITION C CONTAINS OBSERVATION o where c/start_time/value = ‘2020-03-03’ and o/origin/value = ‘2020-03-03’
Until about an hour ago I would have said that this means
“find me all composition objects with that start_time which also contain OBSERVATION o with that particular origin value”
but I now understand that actually means “find me all Observation objects with that particular origin whose parent composition objects have that particular start_time” and return the whole structure - parent composition and matching observation.
i.e. executing a FROM clause will always generate matches at the lowest level of object matching - one row for each object match - if the matching object happens to share a parent composition with another matching object, it doesn’t care.
So in Bjorn’s example, we now get 3 rows - one for C1 that has one Observation match, 2 for C2 that has two observation matches.
So now the critical thought-error (for me at least) - by specifying SELECT c or SELECT c, o, or anything else, I am somehow controlling the object rows that are matched - I’m not, I am merely filtering the object that the FROM clause returns in each row and that will always look for unique matches at the furthest end of the data tree it has been asked to traverse.
- FROM COMPOSITION c CONTAINS OBSERVATION o CONTAINS CLUSTER x where c/start_time/value = ‘2020-03-03’ and o/origin/value = ‘2020-03-03’
So this means…
Find me any Cluster objects which along with their parent structures, and generate an object row for each match,
if each of those OBSERVATIONS has 2 clusters we should end up with 6 rows because the ‘target’ object is a CLUSTER match, which (because of CONTAINS composition, expects to also return it within the context of the parent COMPOSITION.
Anything I express in SELECT has no effect on that number of matches, it merely filters what is returned in each row.
I’ll stop here for a lie down but, at least at one level now my head understands what is happening.
There also does appear to me to be a requirement to allow the ‘projections’ approach as well as the ‘single-composition approach’ so I like Seref;s suggestion of using EXIST as an alternative to CONTAINS to control that aspect, so that at very least we can use the EXIST alias as part of the WHERE clause.
There are further challenges around SELECT but if we can agree an approach