(in)Valid TERMINOLOGY_IDs in templates

Because of HiGHmed there are a lot of new templates out there in German, which includes all sorts of strange characters in template ids and terminology ids.

While doing some tests for EHRBASE, I have found this template Clinical Knowledge Manager which uses “§21 KHEntgG” as a terminology_id for coded text constraints.

Looking at the base spec, there is actually a grammar there saying how the terminology_id values should be constructed Base Types

That grammar does not accept special characters or spaces (if I understood correctly)

terminology_id = name-str, [ '(', name-str, ')' ] ;
name-str     = letter, { letter | digit | '_' | '-' | '/' | '+' } ;
...

I guess current modeling tools are not enforcing that grammar or checking anything related with the format or valid characters in terminology_id for C_CODE_PHRASE, which could generate processing issues later in data.

Also the OPT XSD can have a validator for those values, allowing only valid characters.

I’m sure there are other OPTs out there in many languages that might have similar violations.

Any ideas on how can we check / validate / enforce these things in modeling tools, CKM, etc.?

Note: the mentioned OPT is published in the HiGHmed CKM, I don’t know if the CKM has OPT validation rules like it has for ADL. OPT validation would be helpful to detect issues before downloading/using the OPTs.

Because of this:

terminology_id = name-str, [ '(', name-str, ')' ] ;
name-str     = letter, { letter | digit | '_' | '-' | '/' | '+' } ;

Terminology IDs mentioned in the same spec are invalid:

  1. ICD9(1999)
  2. ICD10AM(3rd_ed)
  3. ICD10AM(4th_ed)

Note name-str requires the first character to be a letter, so a number after “(” is not allowed.

I think I need to stop reading the specs for today…

This might be related:

So, in the Archetype designer, you are allowed to set a “query” URI as the terminology for coded text elements:

@Athul_K_Nair figured that the templates don’t get saved with some query strings, however, some others work:

Doesn’t work

SNOMED-CT::<123456
SNOMED-CT::404684003 |Clinical finding|
SNOMED-CT::<  404684003
SNOMED-CT::descendantOf  404684003 |Clinical finding|
SNOMED-CT::descendantOf  404684003
SNOMED-CT::<<  73211009 |Diabetes mellitus|
SNOMED-CT::<<73211009|Diabetes mellitus|
SNOMED-CT::>123456
SNOMED-CT:>123456
SNOMED-CT::^  700043003
SNOMED-CT::^12345678
SNOMED-CT::<<  138875005 |SNOMED CT concept|
SNOMED-CT:: << *
SNOMED-CT::ANY :  246075003 |Causative agent|  =  387517004 |Paracetamol|

Works:

SNOMED-CT::123456
SNOMED-CT::&lt123456
SNOMED-CT::descendantOf404684003
SNOMED-CT::*
SNOMED-CT::ANY

We figured out that almost all URL encoded strings are valid here and are getting saved. They just need to follow the scheme: URI = scheme:[//authority]path[?query][#fragment].

EHRbase’s terminology validation also recommends it to have URLs to the FHIR ValueSet and works on referenceSetUri

<referenceSetUri>terminology://fhir.hl7.org/ValueSet/$expand?url=http://hl7.org/fhir/ValueSet/surface</referenceSetUri>

and a valid composition is supposed to have the terminology_id value like so:

"terminology_id": {
            "_type": "TERMINOLOGY_ID",
            "value": "http://hl7.org/fhir/ValueSet/surface"
        },

So are these valid terminology IDs according to the EBNF grammar rules?

@borut.fabjan the Archetype Designer doesn’t really warn the user that the Terminlogy URI is invalid. It just throws an error while saving/exporting the template. It’ll be really hard to figure out where the error is coming from if the user has no idea that some terminology ids are not allowed. Especially while designing a big template.

Error while saving:

Error while exporting:
image

I don’t see an issue with the uri, but in the terminology_id field value.

This isn’t a valid terminology id in openEHR. It might be the human name of some real terminology, but an id will need to be constructed for it for computational use, just like the string ‘snomed_ct’ of ‘snomed-ct’ etc is used to mean Snomed, even though Snomed’s proper name is ‘SNOMED CT’ (containing a space).

Terminology Ids in openEHR should be understood as namespace names, that (sometimes) can be mapped to URIs, e.g. https://snomed.info etc.

Ideally they would be controlled and probably we will need to create our own registry, since I failed to convince IHTSDO to do this over a period of some years. Historically we relied on the NLM id list, which was reliable (and contained valid ids) but had disappeared last time I looked.

Probably @sebastian.garde might be able to see how this passed validation in CKM.

1 Like

That’s actually a paragraph from German hospital financing law. A bit of a stretch to call it a terminology.

This is from the oet:

                <includedValues>§21 KHEntgG::01::Krankenhausbehandlung, vollstationär</includedValues>
                <includedValues>§21 KHEntgG::02::Krankenhausbehandlung, vollstationär mit vorausgegangener vorstationärer Behandlung</includedValues>
                <includedValues>§21 KHEntgG::03::Krankenhausbehandlung, teilstationär</includedValues>
                <includedValues>§21 KHEntgG::04::vorstationäre Behandlung ohne anschließende vollstationäre Behandlung</includedValues>
                <includedValues>§21 KHEntgG::05::Stationäre Entbindung</includedValues>
                <includedValues>§21 KHEntgG::06::Geburt</includedValues>...

I assume the opt generator just takes whatever is before the double colon and just assumes this is the terminology id.

1 Like

:rofl: :rofl: :rofl:

1 Like

Thanks @thomas.beale did you see the current grammar for TERMINOLOGY_ID? Terminology IDs mentioned in the same spec document are not compliant with the grammar in that doc (see my second message).

Note newer templates are using URLs everywhere for terminology_id too, which is neither supported in the current grammar.

The (Antlr) grammar rules we are actually using are:

// e.g. [ICD10AM(1998)::F23]; [ISO_639-1::en]
TERM_CODE_REF : '[' TERM_CODE ']' ;
TERM_CODE : TERM_CODE_CHAR+ ( '(' TERM_CODE_CHAR+ ')' )? '::' TERM_CODE_CHAR+ ('|' ~[|\]]+ '|')?;
fragment TERM_CODE_CHAR: NAME_CHAR | '.' ;

fragment NAME_CHAR     : WORD_CHAR | '-' ;
fragment WORD_CHAR     : ALPHANUM_CHAR | '_' ;
fragment ALPHANUM_CHAR : ALPHA_CHAR | DIGIT ;

fragment ALPHA_CHAR  : [a-zA-Z] ;

Which looks the same to me. These should be expanded to allow more than just [a-zA-Z] although I don’t think allowing any symbol at all is a good idea. The ids you pasted above are (as @sebastian.garde points out) are something like paragraph headings / markers from a document (hence the §, numbers and spacing).

Using URIs directly in archetypes etc for terminology ids isn’t a great idea in my view, it just makes everything less readable, and there are not reliable URIs for a lot of terminologies - and what if the publishing orgs decide to change them? Instead we should host a register of unique terminology human readable identifiers like the NLM one and include a mapping to URIs where available.

As I mentioned, this is in the spec: Base Types

Check TERMINOLOGY_ID currently doesn’t accept “ICD10AM(1998)”

Not sure what you mean with the antlr grammar, we are referring to different things.

(* ------------------------- UID, OID, UUID -------------------------- *)
uid     = iso_oid | uuid | internet_id ;
iso_oid = number, { '.', number } ;
uuid    = hex-number, '-', hex-number, '-', hex-number, '-', hex-number, '-', hex-number ;

(* --------------------------- INTERNET_ID --------------------------- *)
(* According to IETF http://tools.ietf.org/html/rfc1034[RFC 1034] and  *)
(* http://tools.ietf.org/html/rfc1035[RFC 1035], as clarified by       *)
(* http://tools.ietf.org/html/rfc2181[RFC 2181] (section 11),          *)
(* and relaxation of https://tools.ietf.org/html/rfc1123[RFC 1123]     *)
(* The syntax of a domain name follows the grammar below. Slightly     *)
(* reduced for the purpose here, plus allows underscores.              *)
internet_id      = subdomain ;
subdomain        = label | subdomain, '.', label ;
label            = alphanum | alphanum-ext-str, alphanum ;

(* -------------------- HIER_BASED_ID, UID_BASED_ID ------------------ *)
hier_object_id = uid_based_id ;
uid_based_id   = root, [ '::', extension ] ;
root           = uid ;
extension      = ? any string ? ; (* any string *)

(* ------------------------- OBJECT_VERSION_ID ----------------------- *)
object_version_id  = object_id, '::', creating_system_id, '::', version_tree_id ;
object_id          = uid ;
creating_system_id = uid ;

(* ------------------------- VERSION_TREE_ID ------------------------- *)
version_tree_id = trunk_version, [ '.', branch_number, '.', branch_version ] ;
trunk_version   = number ;
branch_number   = number ;
branch_version  = number ;

(* -------------------------- ARCHETYPE_ID --------------------------- *)
archetype_id        = qualified_rm_entity, '.', domain_concept, '.', version_id ;
qualified-rm-entity = rm_originator, '-', rm_name, '-', rm_entity ;
rm-originator       = alphanum-str ;     (* id of org originating the RM on which this archetype is based *)
rm-name             = alphanum-str ;                      (* id of the RM on which the archetype is based *)
rm-entity           = alphanum-str ;                                       (* ontological level in the RM *)
domain-concept      = concept-name, { '-', specialisation } ;
concept-name        = alphanum-str ;
specialisation      = alphanum-str ;
version-id          = 'v', ( '0' | non-zero-digit, [ number ] ) ;            (* numeric version identifier *)

(* ------------------------- TERMINOLOGY_ID -------------------------- *)
terminology_id = name-str, [ '(', name-str, ')' ] ;

(* -------------------------- generic rules -------------------------- *)
alphanum     = letter | digit ;
name-str     = letter, { letter | digit | '_' | '-' | '/' | '+' } ;
alphanum-str = letter, { letter | digit | '_' } ;
alphanum-ext-str = letter, { letter | digit | '_' | '-' } ;
letter       = 'A' | 'B' | 'C' | 'D' | 'E' | 'F' | 'G'
             | 'H' | 'I' | 'J' | 'K' | 'L' | 'M' | 'N'
             | 'O' | 'P' | 'Q' | 'R' | 'S' | 'T' | 'U'
             | 'V' | 'W' | 'X' | 'Y' | 'Z' | 'a' | 'b'
             | 'c' | 'd' | 'e' | 'f' | 'g' | 'h' | 'i'
             | 'j' | 'k' | 'l' | 'm' | 'n' | 'o' | 'p'
             | 'q' | 'r' | 's' | 't' | 'u' | 'v' | 'w'
             | 'x' | 'y' | 'z' ;

number         = digit, { digit } ;
hex-number     = hex-digit, { hex-digit } ;
digit          = '0' | non-zero-digit ;
non-zero-digit = '1' | '2' | '3' | '4' | '5' | '6' | '7'| '8' | '9' ;
hex-digit      = digit | 'A' | 'B' | 'C' | 'D' | 'E' | 'F' | 'a' | 'b' | 'c' | 'd' | 'e' | 'f' ;

In RM 1.0.2 the syntax is a little different:

4.3.12.1 Identifier Syntax
The syntax of the value attribute is as follows:

-------- production rules --------
terminology_id: name [ ‘(’ version ‘)’ ]
name: V_NAME
version: V_VERSION

-------- lexical patterns --------
V_NAME: [a-zA-Z][a-zA-Z0-9_-/+]+
V_VERSION: [a-zA-Z0-9][a-zA-Z0-9_-/.]+

Note it allows XXX(123) and XXX(1.2.3), which the new syntax doesn’t allow.

Ah sorry, I see - because the ‘1998’ is not matched by ‘name-str’, which has to start with a letter. That’s an error :wink:

So we need to fix that.

I believe our aim should be to replace these grammars in the spec with Antlr lexer and parser grammars that we know work, i.e. extracted from the Github grammar files.

1 Like

Yes, those syntaxes in the specs (new and old) might need to be fixed.

Also the corresponding matching items in ITS for XML and JSON schemas, for instance the TERMINOLOGY_ID type can have a regex expression that matches only valid codes. This will prevent violations in OPTs and XML/JSON instances.

Then we need to consider tooling and current usage. CKM and modeling tools might not be enforcing “valid” terminology_ids. And current usage in published OPTs is: 1. including strange characters, 2. including URLs as terminology ids.

This is not only for TERMINOLOGY_ID, but would apply for TEMPLATE_ID and ARCHETYPE_ID too: grammar, schemas, and tools. For instance, we had some issues with strange characters in German template ids.

1 Like

Shouldn’t ‘letter’ allow UTF-8 characters?

I believe allowed characters and encoding are two different problems, i.e. the grammar doesn’t deal with encoding AFAIK.

1 Like

We’re thinking about this :wink: