Cyclic datatypes: OpenEHR virus

pablo · 14 May 2014 01:04

Hi Bert,

why the validator should need to continue traversing the instance?

Hi Pablo, because in the attributes are often also complex OpenEhr datatypes, so the validator needs to check these complex data types in the attributes too, and those datatypes again can have complex datatypes. In case of this example: Dv_Text matches {*} you’ll need to check everything, every structure, until you reach the leaf nodes, which, in this example can be anything. Only then, you can be sure that the data set is OpenEhr compliant.

That was my point The validation that needs to reach leaf nodes is not the archetype validation, but the IM structure validation. That has nothing to do with the open constraint {*} in the archetype. In fact, that validation can be done completely without considering the archetype. What I said about using the XSD is just one way of implementation, you can do that by code also.

The thing is that a DvText can have the attribute: mappings and then can find a the attribute: purpose, of type DvCodedText, which again can have an attribute: mappings, which can again have an attribute: purpose, etc.

I got it

So, the occurrence of the leafnode can be far away, and still be compliant with the statement: DvText matches {*}, and a 100% compliant validator will need to follow al these steps. Of course this is not a normal situation, but it can happen. As said, we cannot always control incoming data sets. There maybe buggy software in the ecosystem where a kernel runs.

That really depends on implementation. Let say the system doesn’t control the input, so you can receive anything, for example binary data where you expect a dv_quantity. In that case, what I proposed implicitly is to have a 2 phase validator, 1st syntactic (against the IM, yes we need to reach leaf nodes here!), 2nd semantic (IMO we can prune the validator if we reach stuff like {*}). If the 1st phase returns invalid, there’s no need to execute the 2nd. If you execute the second, you’ll never reach an infinite recursion because of pruning.

Sorry, maybe I can’t explain myself clearly, is difficult to show the on email. Maybe others can validate or deny this.

To be safe and with feasibility in mind, a validator would need to stop validating, at some arbitrary point, although there is no error. So a validator which follow the rules for 100% is dangerous! it can crash a system.

Having two phase validators, I don’t know if there’s any case that you didn’t cover 100% and might get valid from invalid data or cover 100% and end with stack overflow. Finding a counter case would be enough to invalid my proposal

That was my point.

You are right in your statement, that when a part of an archetype is wildcarded, the XSD is the place where to find the validation rules.

Maybe the problem is trying to validate against the archetype at first and then validate the IM. I think it should be IM 1st and AM 2nd. But of course, I may overlooked some pathological case and this might not work on 100% of the cases.

Another thing that might be helpful is not to use archetypes directly, use OPTs. I learned that in the hard way. OPTs can contain the whole structure and constraints of specific compositions. So if someone specifies DV_TEXT in the OPT, my interpretation is they don’t need a DV_CODED_TEXT there. Also, an OPT is all in one file, while with archetypes you have to deal with slots (argghhhh). In fact, right now I’m changing all my systems adding OPT support. Simpler to validate, simpler to query.

Cheers,
Pablo.

Best regards
Bert

system · 14 May 2014 09:05

This, I do not agree, an archetype constraints the structure of a dataset, and it constraints the contents of a leafnode. The archetype constraints, IMHO, which class-attributes are to be used, and what/where they will lead to. Of course the archetype is modeled inside the boundaries of the reference model (possible expressed in XSD), so that is always something that has to fit. If there are no constraints in an archetype (wildcarded) then everything, which is possible in the reference model, is legal in a dataset. There is a lot of validation done by the libraries you probably use. For example, if you import XML, the Xerces, or other libraries check the XML if it contains no errors. If it is a character-stream, libraries check for illegal binary-codes, many many checks are done inside libraries. And before and after you libraries, the operating system does a lot of checking too, for example if you do nothing illegal with memory, and you have virus-scanners which sit on your network-stack. I guess half of the code on a computer, on all levels, does nothing but checking for errors. The problem with this situation is that nothing illegal happens, there is no error in the dataset, the problem can occur inside a dataset, which is fully compliant with the Reference Model. I was thinking about something like that, but I could not imagine a pattern which would handle this. Maybe you can give us some pointers. As I set, the dataset which can cause the problem is fully OpenEHR compliant, there are no invalid data involved, in that case, it would be easy to handle. I don’t know why you call it pathological, as if it has to do with humans, something with freaks who want to crash systems, it can surely be the case. But more likely is a system which has a bug, and creates a faulty dataset. A “pathological” dataset can be the result of a buggy system. And you know Murphy: If a problem CAN occur, it WILL occur. This is not conforming the specs. If in an archetype a DV_TEXT is defined, a DV_CODED_TEXT is legal in the dataset. Inheritance-rules are valid in datasets. I see that you dislike slots, it is another discussion, but in my view, slots are the best way to make a data-definition flexible, extensible. But lets not distract from the original discussion theme. Maybe you can work this point out in a new thread. I would welcome discussion about slots. Best regards Bert

thomas.beale · 14 May 2014 09:49

Both of these statements are correct I guess, but I would suggest that what Pablo is saying is usable at a practical level. We could in theory (and I remember thinking about doing this 10 eras ago) add an invariant to the DV_TEXT class of the form: context DV_TEXT mappings.for_all (m: TERM_MAPPING | m.purpose /= self) This invariant would be expressible in e.g. OCL and maybe even Schematron, and could be evaluated completely independently from the archetype level validation. Admittedly these kind of checks are not routinely built into programming languages or XML, so you have to go to some effort to implement them. But it is certainly doable, and I would say this kind of check would belong in a small library of data validation checks that are executed in various passes over the data, with the archetype checks coming later than more basic checks. I would recommend a 2 or 3 pass validator. It’s tempting to try to do everything in one pass, and it may be more efficient, but it’s much harder to get the logic right. - thomas

Tim_Cook · 14 May 2014 10:20

issue. Especially in openEHR or anywhere that you are using a DSL and
there are not a existing tools to choose from that have been tested across
thousands of use cases.

The implementation programming language, operating system and platform will
have an impact on your decisions about allowed recursion depth. Take a
look at a Google search for "patterns for data validation".

system · 14 May 2014 10:41

I have a one-phase validator, and because an archetype is strictly hierarchical, it is easy going from the top down to the leafnodes, and at every CAttribute or CObject, validate the constraints. I have no problem with this, except for some things, like the one we are discussing, which I solved by using a recursion-counter, which starts counting as soon an CComplexObject has no attributes in the AOM (then it is wildcarded)
But that is an arbitrary-solution. It works, but it gives an unpleasant feeling because, in fact, it is breaking in the logic.

So, just for learning. To get rid of that unpleasant feeling, which phases would you distinguish in validating a dataset?

system · 14 May 2014 10:56

Precisely. This is why I said before that it is an implementation level issue. Especially in openEHR or anywhere that you are using a DSL and there are not a existing tools to choose from that have been tested across thousands of use cases.

I guess, you wanted to say: "there are existing tools to choose from that have been tested across thousands of use cases" (without "not a")

In that case, I would be interested in an example use case, one of the thousands, and according tools, I will be happy to learn from you.

The implementation programming language, operating system and platform will have an impact on your decisions about allowed recursion depth.

So you advice to stop recursion on an arbitrary point? In that case, we have the same strategy, as I already have written a few times last days.

Take a look at a Google search for "patterns for data validation".

Sorry Tim, that is a too easy answer for a man of your qualities.
That is not a useful advice, everyone, even children know you can google something.

Why do you bother typing this?

Bert

Tim_Cook · 14 May 2014 12:04

It is not an answer to anything. It is an illustration of how varied the
approaches can be based on the implementation situation.

system · 14 May 2014 12:15

I tried googling it, of course. I always try to find an answer by myself, before discussing it on a mailinglist.
I got 6 million hits to the question you proposed, and the first 30 were not very useful.

I would like to see an approach to one specific problem I discussed under this subject-line.
And not that approach that I already proposed/explained myself a few times last few days ago.

Can you help me there?

Thanks,
Bert

Tim_Cook · 14 May 2014 14:01

I cannot.

system · 14 May 2014 14:49

Dear Tim,

I don't think that MLHIM has a good base, I explained to you why, in an open discussion. More then one time.
There is no need to be angry at me, for that, it is just my opinion.
Don't take it so serious. It is not important what I think about MLHIM.

So please stop breaking in subjects I start, with the purpose to give it a bullshit distraction.

It hinders me, for me, these discussions are serious matters, I do this for living.

I don't walk in your way. I never do that. Please don't walk in my way.
Just ignore me, is that so hard?
Try again.

Best regards
Bert Verhees

Tim_Cook · 14 May 2014 15:53

Well, excuse the #$% out of me. I didn't know this was Bert's Q&A list. I
thought it was a discussion list for openEHR related technical issues.

My comments were certainly appropriate for the topic of data validation.
I did not mention MLHIM or even XML.

I frankly do not care what you think about MLHIM. Though you are free to
express your opinions.

I think that *I* am not the angry one here.

Kind Regards,
Tim

pablo · 14 May 2014 21:51

Just for fun: http://en.wikipedia.org/wiki/Pathological_(mathematics)

Maybe the problem is trying to validate against the archetype at first and then validate the IM. I think it should be IM 1st and AM 2nd. But of course, I may overlooked some pathological case and this might not work on 100% of the cases.

I don’t know why you call it pathological, as if it has to do with humans, something with freaks who want to crash systems, it can surely be the case.
But more likely is a system which has a bug, and creates a faulty dataset. A “pathological” dataset can be the result of a buggy system.

system · 15 May 2014 06:20

Specially this section: “awareness of pathological inputs is important as they can be exploited to mount a on a computer system”

system · 15 May 2014 16:51

I have a one-phase validator, and because an archetype is strictly hierarchical, it is easy going from the top down to the leafnodes, and at every CAttribute or CObject, validate the constraints. I have no problem with this, except for some things, like the one we are discussing, which I solved by using a recursion-counter, which starts counting as soon an CComplexObject has no attributes in the AOM (then it is wildcarded)
But that is an arbitrary-solution. It works, but it gives an unpleasant feeling because, in fact, it is breaking in the logic.

So, just for learning. To get rid of that unpleasant feeling, which phases would you distinguish in validating a dataset?

system · 15 May 2014 16:58

Sorry, sent by accident

(stupid iPad)

---------- Doorgestuurde bericht ----------

pablo · 16 May 2014 01:55

I mentioned the phases, several times, in my previous messages

Maybe Thomas can break that up into more phases.

thomas.beale · 16 May 2014 07:16

I think Pablo has summarised some useful things:

system · 16 May 2014 08:28

On 16-05-14 09:16, Thomas Beale wrote: I agree that there is no standard answer, and there will never be one. There will always be technical progress. That keeps us, developers, at work. I wish more people would discuss their technical working out of the standards, so we can learn. ---- I am not familiar with the term “OPT”. I assume, this is opt-out. As Pablo gave an example, if some has a DV_TEXT in an archetype, he does not want a DV_CODED_TEXT at that point. I agree partly with this. But only if the DV_TEXT is not wildcarded. If it has an attribute mentioned in the archetype: DV_TEXT matches { value matches {} }. Then there can come nothing but a value-attribute which is a constrained string, in this case, without constraints. (and if applicable, other required attributes also) But if in the archetype is DV_TEXT matches {} then every attribute is allowed to use, and it is allowed to derive inheritors from DV_TEXT, which is DV_CODED_TEXT. I had this discussion some years ago, at that time you agreed that inheritance is allowed, according to the standard. Divergence from the theoretical standard on technical/practical reasons makes code in the context of that standard less flexible and less extensible and once diverged code leads to more divergence. It maybe can even lead to another standard on new premiss, of which I know an example. But as they say in Fawlty Towers: Never mention the war If I misunderstood the term OPTs, please forgive me, it was not with rhetorical intention. ------ The RM-validation pass is very easy in code. Just check it against the RM-XSD and let it roll. Maybe we are not aware, because everything happens in a library. There may be a performance issue. Will it be more efficient to check before, what you are checking anyway afterwards? Also you do not detect all pathological structures, because the one I mentioned is perfectly legal in the RM, when DV_TEXT matches {*} is in the archetype. That makes this example dangerous, it is not possible to detect by a basic RM-check. But you can find other problems, and have a quick way out. That is true. The punishment is for data-sets which are valid, they need more processor-time to get accepted. ------ I don’t understand what is meant with “archetype (OPT) pass”, so I cannot comment on that. ----- The one pass situation: the logical path through an archetype is very hierarchical and very easy. It is a kind of classical visitor-pattern which is followed, but, in my case, without unnecessary formalities. I have rewritten the AOM validation-interpretation three times, every time to create an XML-validation. First for XML-Schema, then for RelaxNG, both combined with Schematron. XML-schema and RelaxNG have shortcomings which are trouble in relation to the features of the AOM/ADL. Some developers have written workarounds for that. But that is divergence of the standards. Now I have rewritten it for schematron-only. But the base-structure remained the same, just the simple one-pass validation. Schematron seems to be the best way, for now. The asserts are sorted to the context, so all tests for a specific node will be done in one group. This avoids, at validation-time, repeated retrieval of nodes, by the XML-interpreter. By the way, I have a basic RM-check in my validator, I have converted the XSD’s to Schematron, which is not hard to do. I use it to check for existence of required attributes, which are not in the archetype, and check for valid inheritance. But this basic RM validator runs in the same one-pass validation. Thanks. Bert

system · 16 May 2014 09:03

For those who don't know Fawlty Towers

https://www.youtube.com/watch?v=yfl6Lu3xQW0

thomas.beale · 16 May 2014 10:54

OPT = Operational Template - it's the fully compiled version of a template. See the Template Designer <http://www.openehr.org/downloads/modellingtools> for this - it generates them. Or else you can just do a fully flattened template in the ADL 1.5 workbench. I can provide details on this if you need.

- thomas

(attachments)

Topic		Replies	Views
The Truth About XML was: openEHR Subversion => Github move progress Implementers (archive)	61	7	13 April 2013
Chair of openEHR Clinical Review Board (CRB) announced Clinical (archive)	31	12	14 September 2008
aus health it Clinical (archive)	38	11	25 January 2007
How to start Technical (archive)	26	33	15 August 2013
Introducing myself + question Technical (archive)	29	23	29 March 2003
Demographic archetypes for clinical record purposes Clinical demographics	43	2520	10 August 2023
SNOMED Technical (archive)	31	45	10 May 2016
lessons from Intermountain Health, and starting work on openEHR 2.x Technical (archive)	30	48	8 October 2012
Please respond by Nov. 5th: Known Free/Open Source EHR/EMR Deployment Count. Technical (archive)	27	21	10 November 2008
openEHR future directions Clinical (archive)	7	9	18 March 2011

Cyclic datatypes: OpenEHR virus

Related topics