Regex in Archetypes must include TYPE

Dear All,

The regex locator used to get an archetype from another archetype is too
loose. Given the Archetype ID is something like :

openEHR-EHR-CLUSTER.address.v1

instead of :

archetype_id/value matches {/address\.v1/}

we should have (at the very least):

archetype_id/value matches {/CLUSTER\.address\.v1 /}

or (preferably)

archetype_id/value matches {/openEHR-EHR-CLUSTER\.address\.v1 /}

(A) address\.v1 matches no particular string exactly as the Archetype ID
is openEHR-EHR-CLUSTER.address.v1

As such you have to match on anything prior to address. i.e.

openEHR-EHR-CLUSTER.address.v1 matches

but so would

openEHR-EHR-CLUSTER.person_address.v1

At the moment I am having to add a dot (\.) to each regex to bookend the
start of the string.

e.g.:

    [xslt] Archetype found from pattern
    [xslt] p_pattern = address\.v1
    [xslt] v_Dotpattern = \.address\.v1 [xslt] Archetype
name = openEHR-EHR-CLUSTER.address.v1

This is both dangerous and pointless as you either have a valid/useable
regex or you don't.

B) As a result I am now having to split bar'ed regex'es such as:

checklist_item-general-cvs1.v1|checklist_item-general-cvs2.v1|checklist_item-general-cvs3.v2|checklist_item-general-cvs4.v2draft|checklist_item-general.v1|checklist_item-general.v2|checklist_item-general.v3

i.e. I am having to pre-process the regex instead of being able to apply
it directly as a regex.

C) It is long term dangerous anyway as the only thing guaranteeing
uniqueness of ID/Name is the filename in a given folder. This is fine
until you consider folders such as "structure" where you could have
ITEM_TREE.adam.v1 & SINGLE,LIST, or TABLE.adam.v1 ......

So if in the Structure folder then... /adam\v1/ could possibly be
ambiguous. ie. if I am adding a "structure" archetype & I am show a
choice of adam.v1,adam.v1,adam.v1 or adam.v1 which one do I want/should
I add?
Seeing ITEM_TREE.adam.v1, SINGLE.adam.v1 etc would fix that potential
problem.

D) Given the Archetype ID/Filename is openEHR-EHR-CLUSTER.address.v1
then what is saved by not showing at the very least CLUSTER.address.v1
in the drop down list & putting in a regex of: "CLUSTER\.address\.v1" ?

E) Going for a more exact regex in the first place then allows one to be
more exact even in the looser cases e.g.

CLUSTER\.address\.v

or where you want to say "any CLUSTER exam".

F) If other sorts of Regex expression are required for use then I am not
prepared to keep adding to the "regex pre-processor" to in effect create
another regex processor.

Either the regex is correct or it's not. If it's not then don't use a
regex.

Summary / Conclusion

1) I would like to see the Archetype designer creating regex's which
include at the very least the RM_TYPE part of the name e.g. CLUSTER or
ITEM_TREE.

Preferably it should be the entire Archetype ID as that is the actual
string being matched.

This should be a design time thing and not a publishing problem.

This will allow the correct use of regex for model location.

Either that or change all the Archetype ID/Filenames from :

openEHR-EHR-CLUSTER.address.v1

to

address.v1

Adam

Adam

Hi Adam

I know Tom has talked to you about this. I was involved in the original discussions about this many years ago and many points of view were expressed. The issue is that, as you know, the CLASS is specified in the slot definition and there is a regex for the remainder of the ID. What we have been doing is setting the regex to:

openEHR-EHR-CLASS_NAME.REGEX_EXPRESSION

This seems very safe. There are problems if we have the regex including the class when it is already specified as they will have to be the same. I would be quite happy to just use a regex for the whole id constraint, or leave it as it is. Just including the class does not help much as far as I can see, and having both and a potential bug (different class specified) is also a problem.

Can you pick up the class name from the slot and append it to ‘openEHR-EHR-’ and the regex to get the full statement? This can be explicit in the spec. Or lets have a suggestion to change to a full regex so we get it completed that way.

Cheers, Sam

Adam Flinton wrote:

(attachments)

OceanInformaticsl.JPG

Sam Heard wrote:

Hi Adam

I know Tom has talked to you about this. I was involved in the
original discussions about this many years ago and many points of view
were expressed. The issue is that, as you know, the CLASS is specified
in the slot definition and there is a regex for the remainder of the
ID. What we have been doing is setting the regex to:

openEHR-EHR-/CLASS_NAME/\./REGEX_EXPRESSION
/
This seems very safe. There are problems if we have the regex
including the class when it is already specified as they will have to
be the same. I would be quite happy to just use a regex for the whole
id constraint, or leave it as it is. Just including the class does not
help much as far as I can see, and having both and a potential bug
(different class specified) is also a problem.

Can you pick up the class name from the slot and append it to
'openEHR-EHR-' and the regex to get the full statement? This can be
explicit in the spec. Or lets have a suggestion to change to a full
regex so we get it completed that way.

Cheers, Sam

I would definitely say that it should be a full regex as:

(A) Otherwise it's not a regex, it's a partial regex & then that causes
problems further down e.g.

checklist_item-general-cvs1.v1|checklist_item-general-cvs2.v1|checklist_item-general-cvs3.v2|checklist_item-general-cvs4.v2draft|checklist_item-general.v1|checklist_item-general.v2|checklist_item-general.v3

Is a pseudo regex & thus it needs to be split by a "regex pre-processor"
& then each sub statement needs to have the "openEHR-EHR-CLASS_NAME"
appended to it & then put through the regex engine.

i.e. either have a regex or don't. A pseudo regex means creating an
entire "pseudo-regex" processor which is crazy & for what?

Already your own HTML XSLT fails for precisely this reason as you get:

Include entries
openEHR-EHR-CLUSTER.checklist_item-general-cvs1.v1|checklist_item-general-cvs2.v1|checklist_item-general-cvs3.v2|checklist_item-general-cvs4.v2draft|checklist_item-general.v1|checklist_item-general.v2|checklist_item-general.v3

B) You are then asking for repetitive code in every implementation thus
introducing the the possibilities of bugs again for no good reason.

I repeat...:

If you want to use a regex then use a regex which is useable as a regex.
At present it is not & for no good reason.

i.e. saying "take the pseudo-regex & append xyz to it to create the real
regex" is both error prone & means that you can't actually use the regex
as a regex.

Adam

+1

--Tim

Hi Adam

I take this point and in that case I would suggest that resulting issue to discuss is:

Should we drop the class name from the Archetype Slot in ADL and just use the regex? There does not appear to be any reason in the AOM to include the class name. We do need the occurrences for the slot.

allow_archetype CLUSTER occurrences matches {0..5} matches {
include
archetype_id/value matches {/exam.v1|exam-uterus.v1|exam-fetus.v1/}

might become:

allow_archetype occurrences matches {0..5} matches {
include
archetype_id/value matches {/openEHR-EHR-CLUSTER.exam.v1|openEHR-EHR-CLUSTER.exam-uterus.v1|openEHR-EHR-CLUSTER.exam-fetus.v1/}

This would have advantages in controlling ordering of included archetypes of mixed classes.

Interested in others views.

Cheers, Sam

Adam Flinton wrote:

(attachments)

OceanInformaticsl.JPG

Sam Heard wrote:

Hi Adam

I take this point and in that case I would suggest that resulting issue to discuss is:

Should we drop the class name from the Archetype Slot in ADL and just use the regex? There does not appear to be any reason in the AOM to include the class name. We do need the occurrences for the slot.

allow_archetype CLUSTER occurrences matches {0..5} matches {
include
archetype_id/value matches {/exam.v1|exam-uterus.v1|exam-fetus.v1/}

might become:

allow_archetype occurrences matches {0..5} matches {
include
archetype_id/value matches {/openEHR-EHR-CLUSTER.exam.v1|openEHR-EHR-CLUSTER.exam-uterus.v1|openEHR-EHR-CLUSTER.exam-fetus.v1/}

no - this is definitely wrong. The class name is always needed in all ADL object blocks. There is no reason to drop it. Why would we do that? That would be rewriting the formalism.

(attachments)

OceanCsmall.png

Hi Thomas

I had a look at the AOM and did not see anything, just include and exclude statements - didn’t read the ADL spec. The point here is that we could have a slot that allowed different classes which would simplify things for the archetype authors.

Could we have a slot that allows two different classes?

Cheers, Sam

Thomas Beale wrote:

(attachments)

OceanInformaticsl.JPG

Sam Heard wrote:

Hi Thomas

I had a look at the AOM and did not see anything, just include and
exclude statements - didn't read the ADL spec. The point here is that
we could have a slot that allowed different classes which would
simplify things for the archetype authors.

in the AOM all C_OBJECTs have the rm_type_name attribute.

Could we have a slot that allows two different classes?

We already have that you can put more than one object constraint in a
slot, either as alternatives for a single-valued attribute or as
multiple co-existing items under a multiply-valued attribute.

- thomas

OK - brilliant. An example of how to add more than one class in a slot…?

Cheers, Sam

Thomas Beale wrote:

(attachments)

OceanInformaticsl.JPG

Sam Heard wrote:

... What we have been doing is setting the regex to:

openEHR-EHR-CLASS_NAME\.REGEX_EXPRESSION

This is not quite right because REGEX_EXPRESSION might contain patterns for
multiple concepts, as Adam mentioned. You would have to wrap it up in
parentheses:

openEHR-EHR-CLASS_NAME\.(REGEX_EXPRESSION)

Then it should work.

Adam Flinton wrote:

... thus it needs to be split by a "regex pre-processor"
& then each sub statement needs to have the
"openEHR-EHR-CLASS_NAME"
appended to it & then put through the regex engine.

No need to do that, Adam, just wrap the regex within parentheses. So taking
the example you gave, the correctly-wrapped regex would be:

Include entries
openEHR-EHR-CLUSTER\.(checklist_item-general-cvs1.v1|checklist_item-general-cvs2.v1|checklist_item-general-cvs3.v2|checklist_item-general-cvs4.v2draft|checklist_item-general.v1|checklist_item-general.v2|checklist_item-general.v3)

The code to generate this is trivial in any programming language (assuming
your programming language has the ability to concatenate strings, which I
reckon is a safe bet). Unless you're programming in assembly language, it's
probably one simple line of code.

- Peter

Sam Heard wrote:

Hi Adam

I take this point and in that case I would suggest that resulting issue to discuss is:

Should we drop the class name from the Archetype Slot in ADL and just use the regex? There does not appear to be any reason in the AOM to include the class name. We do need the occurrences for the slot.

allow_archetype CLUSTER occurrences matches {0..5} matches {
                                        include
                                            archetype_id/value matches {/exam\.v1|exam-uterus\.v1|exam-fetus\.v1/}

might become:

allow_archetype occurrences matches {0..5} matches {
                                       include
                                            archetype_id/value matches {/openEHR-EHR-CLUSTER\.exam\.v1|openEHR-EHR-CLUSTER\.exam-uterus\.v1|openEHR-EHR-CLUSTER\.exam-fetus\.v1/}

This would have advantages in controlling ordering of included archetypes of mixed classes.

Interested in others views.

Cheers, Sam

That would be fine by me.

That would allow me to drop my "regex pre-processor" which would be nice & would give me some peace of mind wrt people using regex.

Adam

Thomas Beale wrote:

Sam Heard wrote:

Hi Adam

I take this point and in that case I would suggest that resulting
issue to discuss is:

Should we drop the class name from the Archetype Slot in ADL and just
use the regex? There does not appear to be any reason in the AOM to
include the class name. We do need the occurrences for the slot.

allow_archetype CLUSTER occurrences matches {0..5} matches {
                                        include
                                            archetype_id/value
matches {/exam\.v1|exam-uterus\.v1|exam-fetus\.v1/}

might become:

allow_archetype occurrences matches {0..5} matches {
                                       include
                                            archetype_id/value
matches
{/openEHR-EHR-CLUSTER\.exam\.v1|openEHR-EHR-CLUSTER\.exam-uterus\.v1|openEHR-EHR-CLUSTER\.exam-fetus\.v1/}

no - this is definitely wrong. The class name is always needed in all
ADL object blocks. There is no reason to drop it. Why would we do
that? That would be rewriting the formalism.

Either way is fine by me as the bit I care about is:

"archetype_id/value matches
{/openEHR-EHR-CLUSTER\.exam\.v1|openEHR-EHR-CLUSTER\.exam-uterus\.v1|openEHR-EHR-CLUSTER\.exam-fetus\.v1/}"

Adam

Peter Gummer wrote:

Sam Heard wrote:
  

... What we have been doing is setting the regex to:

openEHR-EHR-CLASS_NAME\.REGEX_EXPRESSION
      
This is not quite right because REGEX_EXPRESSION might contain patterns for
multiple concepts, as Adam mentioned. You would have to wrap it up in
parentheses:

openEHR-EHR-CLASS_NAME\.(REGEX_EXPRESSION)

Then it should work.

Adam Flinton wrote:
  

... thus it needs to be split by a "regex pre-processor"
& then each sub statement needs to have the
"openEHR-EHR-CLASS_NAME"
appended to it & then put through the regex engine.
    
No need to do that, Adam, just wrap the regex within parentheses. So taking
the example you gave, the correctly-wrapped regex would be:

Include entries
openEHR-EHR-CLUSTER\.(checklist_item-general-cvs1.v1|checklist_item-general-cvs2.v1|checklist_item-general-cvs3.v2|checklist_item-general-cvs4.v2draft|checklist_item-general.v1|checklist_item-general.v2|checklist_item-general.v3)

The code to generate this is trivial in any programming language (assuming
your programming language has the ability to concatenate strings, which I
reckon is a safe bet). Unless you're programming in assembly language, it's
probably one simple line of code.

Hum.

I agree that that would work but it's still wrong in principle IMHO i.e.
then the regex is still a pseudo/meta regex i.e. it requires processing
to turn it into a valid regex.

This would have to be duplicated in all the different implementations
etc.etc.

Adam

Adam Flinton wrote:

No need to do that, Adam, just wrap the regex within parentheses. So taking
the example you gave, the correctly-wrapped regex would be:

Include entries
openEHR-EHR-CLUSTER\.(checklist_item-general-cvs1.v1|checklist_item-general-cvs2.v1|checklist_item-general-cvs3.v2|checklist_item-general-cvs4.v2draft|checklist_item-general.v1|checklist_item-general.v2|checklist_item-general.v3)

The code to generate this is trivial in any programming language (assuming
your programming language has the ability to concatenate strings, which I
reckon is a safe bet). Unless you're programming in assembly language, it's
probably one simple line of code.

Hum.

I agree that that would work but it's still wrong in principle IMHO i.e.
then the regex is still a pseudo/meta regex i.e. it requires processing
to turn it into a valid regex.

This would have to be duplicated in all the different implementations
etc.etc.

*All,

I also agree with Adam. A regex should be able to be used over a
population of strings (identifiers in this case) and have the effect of
filtering out what you want. Having to put the regex together first is
inviting problems - some implementations will forget, others will do it
wrongly, the specifications of how to do it will change....

Practically speaking this does not change the specifications, but I
suspect we should put some guidance in to the effect that regexes based
on full identifiers should be used in archetype slots.

- thomas

I agree with Adam and Tom, if a REGEX is being used to specify the
constraints on the slot then it should be a valid regular expression which
allows each permissible archetype to be matched on the basis of its full
archetype id as specified by the Archetype ID Syntax.

John

Thomas Beale wrote:

I also agree with Adam. A regex should be able to be used over a
population of strings (identifiers in this case) and have the effect of
filtering out what you want. ...

Practically speaking this does not change the specifications, but I
suspect we should put some guidance in to the effect that regexes based
on full identifiers should be used in archetype slots.
  
Surely the specifications should be stronger than just guidance.
Existing tools that are massaging the regex will cease to work if they
are given a full regex. It would be a breaking change, so I think it
should be spelled out in the specification. Otherwise, tools are going
to have to try to do some clever guesswork to decide whether a given
pattern is intended to match the full archetype id or just the domain
concept part of it.

- Peter

Well the problem here is that the specifications don't actually say
anything about the regexes, or even that you have to use regexes to
identify archetypes in slots - it is just one way of doing it. So any
tools today that take a particular approach to regexes are already
outside the standard.

I think what we should probably do is to state that regexes, if used,
must be assumed to be usable as a filter on whole archetype ids without
prior modification. This still does not prevent some tool using the
short regexes now in use in the archetype editor, since the clearly can
be used at a technical level - it is just that they might create errors.
And there may be some short patterns which are actually correct. I'm not
sure how we can formally state this....

- thomas

Peter Gummer wrote:

Peter Gummer wrote:

Thomas Beale wrote:
  

I also agree with Adam. A regex should be able to be used over a
population of strings (identifiers in this case) and have the effect of
filtering out what you want. ...

Practically speaking this does not change the specifications, but I
suspect we should put some guidance in to the effect that regexes based
on full identifiers should be used in archetype slots.
  
Surely the specifications should be stronger than just guidance.
  
I agree.

Existing tools that are massaging the regex will cease to work if they
are given a full regex. It would be a breaking change, so I think it
should be spelled out in the specification. Otherwise, tools are going
to have to try to do some clever guesswork to decide whether a given
pattern is intended to match the full archetype id or just the domain
concept part of it.

I agree in that it should state that the regex is the regex, that's it,
nothing else is required etc.

Wrt existing tools

A) We already know some tools break with the current system e.g. the
XSLT for rendering a choice i.e. ABC | DEF | GHI etc.
B) The string you are actually matching on is the Archetype ID. As such
that should be the basis of the regex. doing a pseudo-meta regex will
hurt in the long run.

Quick example:

NB: These are simply examples & are not intended as a source of
discussion in & of themselves.

Imagine English speaking people want to use archetypes whose names have
meaning to them e.g. "clinician".

Now imagine a variety of English speaking jurisdictions all wanting to
have their own definition of "clinician".

You could have

openEHR-EHR-CLUSTER.clinician-AUS.v1, openEHR-EHR-CLUSTER.clinician-NZ.v1, openEHR-EHR-CLUSTER.clinician-UK.v1, openEHR-EHR-CLUSTER.clinician-US.v1 etc.

But then what happens if you then specialize one to show it's a surgeon
e.g. would it be

openEHR-EHR-CLUSTER.clinician-surgeon-AUS.v1 or openEHR-EHR-CLUSTER.clinician-AUS-surgeon.v1, etc.

Or to avoid that sort of problem you could namespace it at the other
end e.g.:

openEHR-EHR-AUS-CLUSTER.clinician.v1, openEHR-EHR-NZ-CLUSTER.clinician.v1, openEHR-EHR-UK-CLUSTER.clinician.v1, openEHR-EHR-US-CLUSTER.clinician.v1

or even

AUS-openEHR-EHR-CLUSTER.clinician.v1

Thus having {/clinician\.v1} and adding "openEHR-EHR-CLUSTER." would not work.

If the archetype is chosen then someone would have chosen openEHR-EHR-CLUSTER.clinician.v1 if that is the archetype ID or openEHR-EHR-AUS-CLUSTER.clinician.v1 if that was.

Fix it now & something like the above becomes a non-issue in the future.

Adam