Safety features in AQL: subject

thomas.beale · 13 November 2019 23:56

As Seref said earlier, we don’t want to build special cases into a general language. And normal queries should return any matching data, that’s also basic.

Ideally we find a way to make it possible and easy to state subject = self or /= self in a way that is completely general (e.g. could be used to only get INTERVAL_EVENTs, not all EVENTs). Pablo’s suggestion of a type predicate is worth considering. If ‘obs’ identifies the OBSERVATION then something like

is_type (obs/subject , 'PARTY_SELF') -- or not is_type (...)

in the WHERE clause is not particularly onerous.

birger.haarbrandt · 14 November 2019 01:06

I hope this is not becoming repetitive but I’m not sure if my point here is fully considered by your answer. The issue is the following: everytime people use a query like this:

select
    e/ehr_id,
    a_a/data[at0002]/events[at0003]/data[at0001]/items[at0004]/value/magnitude,
    a_a/data[at0002]/events[at0003]/time/value
from EHR e
contains COMPOSITION a
contains OBSERVATION a_a[openEHR-EHR-OBSERVATION.body_temperature.v1]
where a_a/data[at0002]/events[at0003]/data[at0001]/items[at0004]/value/magnitude>38
AND  a_a/data[at0002]/events[at0003]/data[at0001]/items[at0004]/value/units = '°C'
AND e/ehr_id/value MATCHES {
    '849bf097-bd16-44fc-a394-10676284a012',
    '34b2e263-00eb-40b8-88f1-823c87096457'}

there is a risk of not receiving the expected result (I assume that everybody agrees that average developers or analysts would expect the body temperature measurements of THE patient when even the experienced colleagues from Better seem to have made this assumption) and not even knowing that the result set contains data that is not directly about the subject. Hence, I think it is fair to ask if average devs really expect the behavior you described and would be aware that they would additionally have to filter for subject on potentially multiple single entries. In a patient-centered record, average people, that did not work on the spec like you and Seref, might actually be surprised by not only receiving data about the subject from the query above. Putting an info box with a warning into the spec/docs would certainly be a first step (besides implementing Pablo’s suggestion regarding predicates. I was not even aware of this technical problem).

Maybe we need to collect some more real-world experience first from colleagues at Tieto, DIPS to learn if they are aware of this.

heather.leslie · 14 November 2019 06:02

I totally endorse Vebjorn’s comments. Storing data on behalf of another individual is fraught with all sorts of privacy, authorisation issues etc as well.
It is an absolute no brainer from a clinician’s point of view that querying should be able to clearly, safely and unambiguously identify health data contributions to a health record sourced from other individuals.
We definitely don’t want to have to hack this in models.

thomas.beale · 14 November 2019 11:09

Well there’s no technical problem; the query will just return whatever is in the system that conforms to it. If you have Body temps of another subject B in subject A’s EHR that you are querying, you will get subject B data as well. No developer should expect anything else

However, the semantic expectation is another question. It’s actually not that much different from the expectation that when you query for blood pressure and heart rate that you get actual measured BP and HR. However, in many systems today, you could easily pick up target BP or HR as well. This is a question of epistemic status, solved (in the great majority of cases) in openEHR by a) the distinct Entry subtypes and b) archetypes specific to the various epistemic categories.

The question of subject isn’t however an epistemic category, it’s just a contingent fact. The problem though is that we mentally think of everything in subject A’s EHR as being about subject A (including parts, systems, tissue etc). In fact, it contains everything relevant to the care of subject A, which can include foetal heart rates, candidate organ genetic match profile, etc.

No-one is generally surprised by the fact that foetal heart rates are included in the EHR of a pregnant mother, so I am not convinced that the contents of the EHR are contrary to the expectations of clinical professionals. It’s just that many of us often … forget.

We can make documentary additions to the specs to make it clearer, probably a good idea. But in terms of a technical fix, I suggest that @pablo’s type-checking approach I posted above, or something similar is probably the kind of thing to look at. Secondly, I can imagine some preset query ‘modes’ like the one @Seref posted above could be implemented in a generic way, so that if you did something like

SET SUBJECT_SELF_FILTER ON

it either modified the original query to include the type-check to force only PARTY_SELF Entries to be returned, or maybe it re-filters the result set (clearly the latter is not a great idea if you have SUBJECT_SELF_FILTER OFF, in terms of performance). Either way, some technical trick is performed that has the effect of adding this condition into the query, thus always creating the expected answer.

In the above, SET SUBJECT_SELF_FILTER could just be a symbolic name of a configured query filter / modifier that is used a lot, so that it saves the query author some typing. I would consider the true query to be the result of having applied any such filters, i.e. these filters are just a tooling convenience that saves time. If this is not the case, then a bunch of saved queries have to include these filter setting statements as well, to be complete. Not out of the question, but more complicated I would suggest.

Seref · 14 November 2019 11:48

@thomas.beale as I added to my response above, I regret having let that worm out of the can.
The problem with SET SUBJECT_SELF_FILTER ON is that it won’t be sufficient to address all instances of the case Birger is raising. @varntzen already gave another example which is related to ENTRY level. Adopting this approach would lead to lots of flags/variables and boolean flags become a nightmare to change behaviour for complicated cases. This is exactly the situation with SQL server for example.

I’m a strong proponent of pushing things towards the use of functions and keeping the core lang as small as possible as you know. @pablo’s suggestion, with your suggested function based approach is what I’d see as an acceptable solution and I’d prefer that over SET…ON.
If the semantics of execution we’d like to express becomes more complicated, which is bound to happen (again, see @varntzen 's example), we have more flexibility using functions, which we can add more variables to etc etc.

birger.haarbrandt · 14 November 2019 12:20

No-one is generally surprised by the fact that foetal heart rates are included in the EHR of a pregnant mother, so I am not convinced that the contents of the EHR are contrary to the expectations of clinical professionals. It’s just that many of us often … forget.

Thanks Thomas, that’s exactly the point why I think we need to have a safety net that enforces/motivates developers to make concious decision.

I think it is worth taking a look at other domains: of course you can say that C++ is safe to use because all behavior is clearly defined in the specifications. Still we see segmentation faults because C++ allows to shoot yourself in the food (and head). Now we have to think about if openEHR which is applied in safety critical environements actively helps us preventing doing harm to people. From my perspective, patient safety comes before technical consistency. This is one of the reasons why I think that CDA is badly designed with lots of flags in the middle of the paths inside large trees.

Now the question: does the WHERE condition solve the issue? I think it is not sufficient because AQL does not actively motivates its use. I would rather have an explicit statement enforced by the generator of the query to make sure there is a concious decision.

What do you think about including a mandatory predicate like (not sure about the actual syntax but just to explain the idea):

a_a/data[at0002, subject=self]/events[at0003]/data[at0001]/items[at0004]/value/magnitude

Seref · 14 November 2019 12:38

You cannot have patient safety without conceptual consistency. You’d be improving one case while doing harm in another, ironically arriving at inconsistent patient safety due to inconsistent behaviour.

A predicate like this would require aql type checking, because in order to have that mandatory subject constraint in the predicate you’d have to know that that path is indeed an RM object with the subject attribute. So this is an approach above the AQL, which is similar to how flow from Facebook deals with javascript’s dynamic type system. It is conceptually a solution at the tooling/CDR level (if it does the check during parsing), which is what I’ve been suggesting. Unless I’m missing something, it is not your initial suggestion; which is the aql execution changing behaviour based on subject’s type.

Did I get any of the above wrong? Happy to be corrected.

birger.haarbrandt · 14 November 2019 13:05

Hi Seref,

I quote myself:

Hence, I would like to re-open the discussion if subject self should be the default and other paths belonging to entries about the donor (or fetus etc.) are only retrieved when this is stated explicitly

If we had a predicate as suggested that is mandatory, this would somewhat fulfill one of my initial ideas. This is also not too far away from the idea of changing the path, even when technically the predicate is a bit different from that. I argued a bit later that always explicitly stating the subject (may it be within the predicate or a WHERE clause) would bloat the AQL statements. If this is accepted, I would be happy with this solution. Hence, it might be that I did not fully understand your suggestion and our ideas were not too far away. Does this make sense to you?

thomas.beale · 14 November 2019 13:10

Well doing anything like that is a tooling issue; it’s just an alternate shortcut to the SET SUBJECT_SELF_FILER ONor any other such thing.

I don’t disagree with any of your comments about patient safety (at all). But a basic rule of reliable, maintainable software, and tractable data is that you don’t build special cases into general languages / models; you take care of them in another layer dedicated to handling special cases. As soon as we start hacking AQL or the RM or anything else to have special semantics in certain places because of the subjective (at a technical level) importance of that particular model element, we are dead.

Now, having said that, we do actually have a general feature in BMM (and even in UML, via profiling) that enables specific classes and attributes to be classified in certain ways - currently you can classify model attributes in BMM as infrastructure, runtime, or (the default) semantic. These classifiers were originally proposed in order to alert tools as to which things not to allow the archetype author to constraint - e.g. most date/times in the models (only knowable at runtime). The ADL Workbench reads these markers, and other tools could as well. In theory, we could have a classifier that marks certain parts of the model as having relevance to e.g. ‘safety’ or somesuch. Personally, I think this method is far too obscure to achieve the desired effect, just mentioning it for completeness.

thomas.beale · 14 November 2019 13:19

Not sure what you mean here. AQL doesn’t describe how AQL should be used, only what AQL is (or it will when our super-heros @Seref and @pablo and others are done with it

What we are talking about really is that there is nothing in any of the tooling environments that alerts query authors, and one assumes report definers etc, to certain domain-level semantic issues, one being the correct understanding of what ‘subject’ any given query is targetting, and whether that is the intended one.

We can imagine a tool, something like Better’s EhrExplorer, with a friendly query builder, and one of the check boxes is:

□ include data with subject /= self

(or maybe pre-checked with the text ‘subject = self’).
The tool would then do something along the lines of what has already been suggested, and modify the queries being created by the author appropriately.

The whole question of patient safety really is a layer above the mechanics of AQL, the RM or anything else in the data-processing layer. We just have to work out where to handle it.

Having said that, adding some prominent NBs or whatever in certain parts of the RM spec may be a good idea to alert tool implementers to actually think about e.g. ENTRY.subject and Querying.

pablo · 14 November 2019 15:25

On that case, with what we currently have, you need to return the OBSERVATION, so it should be SELECT …, a_a FROM …

Then check a_a.subject on post-processing of the result set. I don’t think there will be any system just displaying on a screen directly what comes from the database without any processing in the middle, there should be a business logic layer between persistence and presentation.

birger.haarbrandt · 14 November 2019 17:37

Not sure what you mean here. AQL doesn’t describe how AQL should be used, only what AQL is (or it will when our super-heros @Seref and @pablo and others are done with it

Well, AQL gives its users a frame to express his or her information need. I consider AQL to be a domain specific language that makes certain assumptions about the Reference Model. For example, AQL does not allow to search on abstract entry classes, we need to be explicit about the Archetype. That’s a decision the designers may have made on purpose. Forcing the user to be explicit about the information need by requiring a predicate is imo just another design decision for the DSL’s syntax.

I have to say that for me it is not really clear if the tool layer is really the one that should fix this issue. If the RM was designed differently regarding the representation of other subjects, this would also affect AQL and potentually prevent the issue by design. Hence, I think this is more about fixing the issue on a different layer without introducing breaking changes to the RM. Though, I have to say that I’m not 100% sure if my thinking is correct.

thomas.beale · 14 November 2019 17:52

Actually, there is nothing preventing AQL searching on abstract ENTRY types; if there were archetypes based directly on ENTRY (we don’t have them AFAIK, but nothing prevents that), it should search all ENTRYs based on that archetype. Whether current implementations are using the machine representation of the RM (i.e. the BMM) to do that properly I don’t know.

In any case, if the FROM part just states ENTRY, with no archetype id, the querying will operate over all ENTRYs, within whatever kind of COMPOSITIONs are stated in the FROM (if any) - again, assuming this is implemented properly.

In general AQL doesn’t know anything specific about any specific RM; but it does need access to a meta-model of the RM so it can determine things like concrete child classes of an abstract class it should look for etc.

W.r.t. the subject issue, I don’t see why you think the RM is broken or should be defined differently - what should it look like?

birger.haarbrandt · 14 November 2019 19:06

Without being able to provide all the details, I think that having paths like this in compositions would make accessing the EHR safer:

[openEHR-EHR-OBSERVATION.body_temperature.v1]/party[subject=‘self’]/data[at0002]/events[at0003]/data[at0001]/items[at0004]/value/magnitude

Edit: gosh, I think I learned something today. Until 1 second ago, I had the idea in my mind that only the following expression should return a non-empty result in AQL:

a_a/data[at0002]/events[at0003]/data[at0001]/items[at0004]/value/magnitude

However, I really did not recognize that this is still valid AQL:

a_a/data/events/data/items/value/magnitude

Nice to feel very stupid

thomas.beale · 14 November 2019 21:09

Well that is going to break every path-based processor in the land, since there is no attribute called ‘party’ under class OBSERVATION…

pablo · 14 November 2019 22:39

I don’t like that solution: that rule needs to be added everywhere in the system and repository logic, will need several tweaks, and fragments the EHR information, as commented before: the context (or domain) of AQL is the whole EHR, and doesn’t depend on the type of certain attributes. If we need to check types, the proposed type checking operator would be better and implementation refactoring will be minimal.

I’m talking from the standpoint of someone that needed to implement a query DSL from scratch and did those refactors many times, and adding new operators or functions is always easier than messing with path internal structures.

ian.mcnicoll · 14 November 2019 23:17

Crikey, I forget to switch Discourse notifications on and AQL carnage ensues. I have not quite absorbed all the discussion but I think we can agree that it would be preferable to find away to protect ‘naive’ queries that could be performed in a way that is potentially unsafe. By naive I don’t imply stupidity, just that the underlying system may be using more advance constructs than the querier is used to e.g. mixed parent/child records. This will get worse as the scope and complexity of the records increases.

I’d agree with Heather that we probably should avoid partitioning non-subject data with a specialisation or separate archetype though I have done so on one occasion for a very obscure and potentially risky example of parental medication in the child record - yuck.

There are a few places in the RM where a naive query which ignores a related ‘status’ attribute might mangle the intended result - subject, current_state, careflow_step, but not any others that I can think of and of course, as @Seref says, us modellers can really go to town but we have to take responsibility there. The examples that Birger has highlighted are by-design in the RM - nothing to do with the clinical modelling layer. It is an elegant design and maximises re-use of archetypes but it does have this weakness around querying.

The various arguments for and against different solutions are all well made but I’m starting to come round to Thomas’s idea of a togglable ‘safe-search’ mode, on by default, such that somewhere in the AQL processing chain, subject is required to be PARTY_SELF, unless otherwise specified, or safe_search is off. Similarly for current_state, when retreiving an ACTION

Seref · 15 November 2019 08:01

Ok, I may have misunderstood you because I thought you were suggesting that aql execution changes behaviour and excludes data unlebss subject is explicitly stated
This suggestion of making an attribute mandatory is at least explicit and Sounds like a more promising direction for discussion than the one I thought you were suggesting Thanks for the clarification (on a train and writing on the phone so bad formatting is inevitable)

Seref · 15 November 2019 08:03

Seref:

as I added to my response above, I regret having let that worm out of the can.
The problem with SET SUBJECT_SELF_FILTER ON is that it won’t be sufficient to address all instances of the case Birger is raising. @varntzen already gave another example which is related to ENTRY level. Adopting this approach would lead to lots of flags/variables and boolean flags become a nightmare to change behaviour for complicated cases. This is exactly the situation with SQL server for example.

I’m a strong proponent of pushing things towards the use of functions and keeping the core lang as small as possible as you know. @pablo’s suggestion, with your suggested function based approach is what I’d see as an acceptable solution and I’d prefer that over SET…ON.
If the semantics of execution we’d like to express becomes more complicated, which is bound to happen (again, see @varntzen 's example), we have more flexibility using functions, which we can add more variables to etc etc

@ian.mcnicoll see above please

ian.mcnicoll · 15 November 2019 09:04

Thanks @Seref,

I did read it but have not fully absorbed all of the suggestions/ counter-suggestions. I’ll re-read.

and this is a cat that needed to be let out of the bag, and the worm out of the can… I think it is a great example of how this community. While on a rainy walk, it occurs to me that this is actually part of a wider question about potentially limit ing the query traversal. AQL gives us great power but we do need to limit its ability to dig out data, whether for semantic reasons (subject), privacy rules or Bjorn’s ‘report’ flag on compositions to indicate that these should not normally be found by queries.

So I’ll re-read the various comments but here is another suggestion - why do we not consider adding some kind of RM atttribute ?? on PATHABLE that says ‘do not traverse in queries’ - with perhaps a few flavours. It would be up to the system designer to set that attribute on compositions, gnerally aided by tooling and/or system defaults, but capable of being overriden as per Thomas’s example. That moves the problem to the system design space, outside of AQL.