Cyclic datatypes: OpenEHR virus

system · 12 May 2014 14:07

Hi,

I found a peculiarity which causes me some trouble. Not that my trouble is a problem, I can solve that, but not without breaking some rules, and the solutions is quite arbitrarily.
The solution is to check if there is any cyclic recursive going on and break at a certain arbitrary moment. But it is not a nice solution.

How many times do we see an archetype with an ELEMENT with DV_TEXT matches {*} in it?
So, there are almost no constraints at all on that value.

Condition: The validator always needs to check the parents of a node to find its (parents) attributes, because, the parents-attributes are legal attributes.
So, in a non-constrained DV_CODED_TEXT, the attributes of DV_TEXT are valid.

In DV_TEXT a legal attribute is mappings->TERM_MAPPING (because there are no constraints defined), it is legal to use the mappings-attribute.
In TERM_MAPPING, there is attribute: purpose-> DV_CODED_TEXT, DV_CODED_TEXT inherits from DV_TEXT, and there is our cycle.

So it is possible to bring every OpenEHR-kernel to its knees, and crash the system if this is the case.

This is a situation which can of course be triggered by an evil person.
But mere likely, by an automated feeding system which breaks no rules.

I wonder, shouldn't it be necessary to have something in the Reference Model to avoid this situation?

Thanks for any suggestion.
Bert

Tim_Cook · 12 May 2014 14:44

It is an implementation issue, not a modelling issue.

yampeku · 12 May 2014 14:45

Well, isn't that the same case that having nesting Sections or
Clusters? I think the way specialization is made in openEHR makes this
a similar issue. However, in your original case I cannot imagine the
use case (or how would work) the automated feeding system you propose.
I would say that instances have fixed size, so even if you have to
process tons of cycles you can assure the process will end (the
instances are finite).

But probably I just don't understand well how the supposed malicious
system works

system · 12 May 2014 14:56

Of course, like I wrote, you can interrupt the data-entry if you suspect recursion. But then you implement against the rules.

system · 12 May 2014 15:07

Yes, or Cluster which has the ITEM attribute, which allows again a CLUSTER.

But you don't see many times: CLUSTER matches {*}, but you see a lot: DV_TEXT matches {*}

In my original DV_TEXT case, there isn't really a legitimate use case, it would be a feeding systems with an error.
The problem is, when and how does the receiving kernel automated decides it is an error?

In you example, it is even harder for the receiving kernel to decide when to interrupt, because, nesting sometimes can go deep, but as said, that situation will not occur often in archetypes, but it is legal.

Anyway, interrupting is against the rules of the Reference Model.

I found the problem in a test-case in where I was experimenting how to validate a wildcarded DV_TEXT according the rules.
It occurs that a 100% good validation of a wildcarded DV_TEXT is not possible, because then recursion is permitted.

I think this is a shortcoming in the Reference Model.

Bert

thomas.beale · 12 May 2014 15:25

Hi Bert,

Hi,

I found a peculiarity which causes me some trouble. Not that my trouble is a problem, I can solve that, but not without breaking some rules, and the solutions is quite arbitrarily.
The solution is to check if there is any cyclic recursive going on and break at a certain arbitrary moment. But it is not a nice solution.

How many times do we see an archetype with an ELEMENT with DV_TEXT matches {*} in it?

BTW, in ADL 1.5, this is no longer used, the constraint could just be something like

attr matches {
DV_TEXT
}

if there are no other constraints at all. The semantics are the same however, so it doesn't change your question.

So, there are almost no constraints at all on that value.

Condition: The validator always needs to check the parents of a node to find its (parents) attributes, because, the parents-attributes are legal attributes.
So, in a non-constrained DV_CODED_TEXT, the attributes of DV_TEXT are valid.

the right way to see this is that the validator needs to be using the (inheritance-)flattened form of the node, as is available in the flattened template.

In DV_TEXT a legal attribute is mappings->TERM_MAPPING (because there are no constraints defined), it is legal to use the mappings-attribute.
In TERM_MAPPING, there is attribute: purpose-> DV_CODED_TEXT, DV_CODED_TEXT inherits from DV_TEXT, and there is our cycle.

I don't see the problem here; the DV_CODED_TEXT of the TERM_MAPPING.purpose is always a different instance from the root DV_TEXT or DV_CODED_TEXT instance. So how can a loop occur? Are you saying that it could due to buggy software or data? But that's the same possibility for any data processing framework where a type T has a property p of type V with a property q of type T... I have to agree with Tim here, this is probably just about implementation quality.

- thomas

grahamegrieve · 12 May 2014 15:47

there’s usually edge cases in validation where you can stack crash if you’re not really careful (been there many times). That doesn’t mean that the specs are wrong, but if you can clearly describe the case, it might be worth documenting the risks around it

Grahame

ANASTASIOU_A · 12 May 2014 16:00

Hello everyone

If it is any more help, here is an earlier discussion on cyclic references:

http://lists.openehr.org/pipermail/openehr-technical_lists.openehr.org/2012q2/007015.html

I think that the ADL 1.5 modifications took care of this discrepancy at the model level as well.

All the best
Athanasios Anastasiou

system · 12 May 2014 21:20

Thomas Beale schreef op 12-5-2014 17:25:

I don't see the problem here; the DV_CODED_TEXT of the TERM_MAPPING.purpose is always a different instance from the root DV_TEXT or DV_CODED_TEXT instance. So how can a loop occur?

What I was doing was writing validation-rules for: DV_TEXT matches{*}
I am working on the finishing part of software to write validation-rules automated, archetypes are translated to validation-rules, and I am doing the last bits, so I came to this which I had saved in a TODO list.
I had a stack-overflow, and first I thought it was a bug, but after investigating, I found, it was as designed.

For this situation, I had to write a rule for attribute: mappings, which can be used, because there is no constraint.
And I wanted to validate the expression completely, so every possible attribute had to be handled, with occurrence as defined in the XML-Schema. Attribute: mappings is optional, so it is allowed.
Every attribute that is not a simple type, but a complex OpenEHR-type needs to be treated the same way (recursive), so in the mappings-attribute, there is the purpose-attribute which again can have a mappings-attribute, which again can have purpose-attribute, and so on.

A data-set which would look like that recursive situation would still match inside the archetype-definition.
Even if this would repeat ten times inside that data-set, it would still be matching against the archetype.

I admit that the problem is a theoretical one, and I suggested an automated feeding system, which could create that to make it less theoretical.
I have seen systems which go to every detail and every bit, thinkable, automated systems sometimes go very deep.

The problem is, how can validation software distinguish erroneous nesting from legitimate nesting.
I don't think that is possible, so the validating software has to stop at a certain arbitrary level of depth.
At a certain point, the validating software will mark a part of a data-set as erroneous: "too deeply nested", even if it still fits inside the archetype

Then I remembered a teacher from years ago, he said: Don't write perfect software, write feasible software

But OK, thank you all for your reply's, I am now convinced that it is not a 100% solvable problem.

best regards
Bert

pablo · 13 May 2014 05:22

If the value is not constrained, the validator should return true without continuing checking in cascade-recursive mode. For this to work as expected, the data structure should be validated before than the data validation. The easiest way of validating the structure is serializing the instance to XML and using XSD.

system · 13 May 2014 06:47

That is the problem, I do not agree, it has to check in cascade because there can be required properties left out, or fantasized properties which make no sense put in. Every occurring class in a dataset needs, in my opinion, to be validated, if there are no constraints, against the Reference Model-rules.

By the way, you cannot validate OpenEHR datasets against an archetype by using XSD. You cannot create XSD's according archetype-constraints, not even by hand. I have been there, a few years ago.

Best regards
Bert

system · 13 May 2014 06:59

You promote writing feasible software. I agree that that is the way to go.

Bert

system · 13 May 2014 07:03

Yes, I agree with that, it just happened yesterday that I was wondering if it was something in the specs, but now I agree that specs cannot exclude every danger.

Bert

thomas.beale · 13 May 2014 08:12

Thomas Beale schreef op 12-5-2014 17:25:

I don't see the problem here; the DV_CODED_TEXT of the TERM_MAPPING.purpose is always a different instance from the root DV_TEXT or DV_CODED_TEXT instance. So how can a loop occur?

What I was doing was writing validation-rules for: DV_TEXT matches{*}
I am working on the finishing part of software to write validation-rules automated, archetypes are translated to validation-rules, and I am doing the last bits, so I came to this which I had saved in a TODO list.
I had a stack-overflow, and first I thought it was a bug, but after investigating, I found, it was as designed.

For this situation, I had to write a rule for attribute: mappings, which can be used, because there is no constraint.
And I wanted to validate the expression completely, so every possible attribute had to be handled, with occurrence as defined in the XML-Schema. Attribute: mappings is optional, so it is allowed.
Every attribute that is not a simple type, but a complex OpenEHR-type needs to be treated the same way (recursive), so in the mappings-attribute, there is the purpose-attribute which again can have a mappings-attribute, which again can have purpose-attribute, and so on.

A data-set which would look like that recursive situation would still match inside the archetype-definition.
Even if this would repeat ten times inside that data-set, it would still be matching against the archetype.

this bit is true. Do you have such pathological data? The easiest way I think of dealing with this would be to add a counter for how many times a particular type has been traversed in any tree descent, and stop if it goes over some settable limit like 100.

I admit that the problem is a theoretical one, and I suggested an automated feeding system, which could create that to make it less theoretical.
I have seen systems which go to every detail and every bit, thinkable, automated systems sometimes go very deep.

The problem is, how can validation software distinguish erroneous nesting from legitimate nesting.
I don't think that is possible, so the validating software has to stop at a certain arbitrary level of depth.
At a certain point, the validating software will mark a part of a data-set as erroneous: "too deeply nested", even if it still fits inside the archetype

Agree - set an arbitrary (deep) limit.

Then I remembered a teacher from years ago, he said: Don't write perfect software, write feasible software

But OK, thank you all for your reply's, I am now convinced that it is not a 100% solvable problem.

well it's solvable in a heuristic / ad hoc way...

- thomas

yampeku · 13 May 2014 08:15

Just curious, are you using schematron for validation? because you can
always define recursive rules to deal with such cases

system · 13 May 2014 08:20

Just curious, are you using schematron for validation? because you can
always define recursive rules to deal with such cases

I did not know that. I will look into it.

Thanks for the tip

Bert

system · 13 May 2014 08:36

In an open system, or a SOA-environment, or when selling it, one does not always have control over data being offered to a system.

Software-system lives in an ecosystem of software.

That is why nesting needs to be controlled, I have seen systems which, because of a bug dive into a recursive loop, and spit out data and feed it to another system. If it is an automated process, and it does in a few separated threads, your OpenEHR kernel can crash, and if it runs in a operating system with a notorious reputation regarding process-management, even that can crash.

I have seen server systems which did not even allow to login, because there was no processor-time left to handle the login-process.

Thinking about precautions regarding malformed-data-entry in software is a good thing.

Bert

thomas.beale · 13 May 2014 08:58

Agree with all this. But you can't code for every possibility in existence, you still have to pick and choose, according to the feasibility criterion

pablo · 13 May 2014 14:48

Hi Bert, I’ll clarify because what you interpreted is not what I tried to say, but we’re on the same page.

Date: Tue, 13 May 2014 08:47:35 +0200
From: bert.verhees@rosa.nl
To: openehr-technical@lists.openehr.org
Subject: Re: Cyclic datatypes: OpenEHR virus

If the value is not constrained, the validator should return true without continuing checking in cascade-recursive mode. For this to work as expected, the data structure should be validated before than the data validation. The easiest way of validating the structure is serializing the instance to XML and using XSD.

That is the problem, I do not agree, it has to check in cascade because
there can be required properties left out, or fantasized properties
which make no sense put in. Every occurring class in a dataset needs, in
my opinion, to be validated, if there are no constraints, against the
Reference Model-rules.

What I meant with “structure validation” is to validate against the IM (i.e. syntactic validation), when I say “data validation” I mean to validate against archetypes (i.e. semantic validation).

If the constraint over a node is “not constrained at all”, so there are not required values defined by the archetype, but, there might be required values defined by the information model.

By the way, you cannot validate OpenEHR datasets against an archetype by
using XSD. You cannot create XSD’s according archetype-constraints, not
even by hand. I have been there, a few years ago.

The information model can be validated with the XSDs, because the XSDs define the IM constraints.

The XSDs are not to validate against archetypes (totally agree with you), is the IM validation that validates the structure and some required fields (by the IM!).

Once you receive a well formed structure (should be valid against the IM) you can validate it against archetypes.

If you already checked the instance against the IM and is valid, you’ll have all the required values (required by the IM), then when validating data (this is the archetype validation!) and you find a {*} constraint, why the validator should need to continue traversing the instance?

Hope that helps (or at least make sense

Kind regards,
Pablo.

system · 13 May 2014 23:40

why the validator should need to continue traversing the instance?

Hi Pablo, because in the attributes are often also complex OpenEhr datatypes, so the validator needs to check these complex data types in the attributes too, and those datatypes again can have complex datatypes. In case of this example: Dv_Text matches {*} you’ll need to check everything, every structure, until you reach the leaf nodes, which, in this example can be anything. Only then, you can be sure that the data set is OpenEhr compliant.

The thing is that a DvText can have the attribute: mappings and then can find a the attribute: purpose, of type DvCodedText, which again can have an attribute: mappings, which can again have an attribute: purpose, etc.

So, the occurrence of the leafnode can be far away, and still be compliant with the statement: DvText matches {*}, and a 100% compliant validator will need to follow al these steps. Of course this is not a normal situation, but it can happen. As said, we cannot always control incoming data sets. There maybe buggy software in the ecosystem where a kernel runs.

To be safe and with feasibility in mind, a validator would need to stop validating, at some arbitrary point, although there is no error. So a validator which follow the rules for 100% is dangerous! it can crash a system.

That was my point.

You are right in your statement, that when a part of an archetype is wildcarded, the XSD is the place where to find the validation rules.

Best regards
Bert

Topic		Replies	Views
The Truth About XML was: openEHR Subversion => Github move progress Implementers (archive)	61	7	13 April 2013
Chair of openEHR Clinical Review Board (CRB) announced Clinical (archive)	31	12	14 September 2008
aus health it Clinical (archive)	38	11	25 January 2007
How to start Technical (archive)	26	33	15 August 2013
Introducing myself + question Technical (archive)	29	23	29 March 2003
Demographic archetypes for clinical record purposes Clinical demographics	43	2520	10 August 2023
SNOMED Technical (archive)	31	45	10 May 2016
lessons from Intermountain Health, and starting work on openEHR 2.x Technical (archive)	30	48	8 October 2012
Please respond by Nov. 5th: Known Free/Open Source EHR/EMR Deployment Count. Technical (archive)	27	21	10 November 2008
openEHR future directions Clinical (archive)	7	9	18 March 2011

Cyclic datatypes: OpenEHR virus

Related topics