Natural language based FLAT format and long container/element names

siljelb · 6 May 2025 09:24

Last year we were well on our way to create a modelling style guide for patient reported outcome measure (PROM) archetypes. PROMs are usually copyrighted tools, and we will most of the time require the approval of the copyright holders to be able to publish archetypes. In addition, PROMs are really sensitive with regard to how they are worded and presented, and the questions are sometimes really long.

For these reasons, the first draft of the style guide recommended using the full and exact wording of each PROM section or question as the name of the relevant container or element. This however led to potential complications with how the natural language FLAT formats work, in that a FLAT format path generated from these long container and element names would in turn become very long.

For example, the canonical path

openEHR-EHR-openEHR-EHR-OBSERVATION.expanded_prostate_cancer_index_composite.v0/data[at0001]/events[at0313]/data[at0003]/items[at0319]/items[at0046]

would in natural language become something like

expanded_prostate_cancer_index_composite_EPIC/any_event/how_big_a_problem_if_any_had_each_of_the_following_been_for_you_during_the_last_4_weeks/need_to_urinate_frequently_during_the_day

Not a massive difference, but I’m sure we could find worse examples if we looked harder.

Since this problem popped up, it was proposed to manually shorten each container/element name. This is problematic for several reasons:

the copyright (holder) issues mentioned above
increased workload on modellers to create shorter names for a potentially large number of questions, while keeping the semantic nuances and ability to tell similar questions apart
increased workload on editors to get every reviewer to agree on shortened names for a potentially large number of questions
having to replace the container/element names with the full-length questions again when building templates, leading to even longer canonical paths

So, since this at its root is a technical issue, we thought we’d bring it back here to discuss. How can we solve it without having to create cumbersome modelling workarounds?

Could it for example be an option to truncate “too long” (whatever that might be) names when generating the FLAT paths?

linforest · 6 May 2025 09:37

There seems to be a similar problem with the content of many questionnaires/surveys in LOINC:

For example, for the items/questions in the LOINC Panel 62710-9 PhenX domain - Psychiatric:

Names for its memeber question LOINC 65635-5 When you have sudden anxiety attacks…

Are there any openEHR-native mechanisms in place that allow us to learn from the experience and practices of the LOINC Committee?

Relevant Topic:

How to transfer as much clinical complexity as possible to clinical terminologies

thomas.beale · 6 May 2025 18:09

I thought about this exact question over the years, and my conclusion is that the description of the relevant model code should always be a semantic descriptor of the question, not the question itself. Achieving this would necessitate being able to describe all the questions in a questionnaire in a precise way. For the example above, it might be something like ‘self-assessed level of inconvenience 4 weeks’ or ‘inconvenience last 4 weeks’ or similar. This approach would a) result in shorter paths, and b) potentially force model designers to think a bit harder on what the true semantics of each question are.

The LOINC short names mentioned by @linforest above are an attempt to do this.

sebastian.garde · 6 May 2025 18:41

So, would text contain this short name and then there is an optional (new) full_text (or similar) which in this case contains the full question?

Any tooling could then display/use both or only one, or one prominent and the other less so, depending on context.

Best of both worlds, except for the additional onus on the modellers, which is probably unavoidable if you want shorter paths.

thomas.beale · 6 May 2025 19:43

I would have expected to model a ‘question’ as a Cluster, in which the question text was a child Element. So we are talking about the id-code of the Cluster that represents the whole question. This would allow other question items to be included (e.g. marking the question as optional or whatever). The referencing the actual text is just something like .../items[at0004]/text, where the id-code at0004 is defined as something like the short(ish) semantic description I gave above.

siljelb · 7 May 2025 08:10

We do think hard about this when we design actual clinical archetypes. However, in this case modellers have zero influence on the semantics of the PROM tool, but are merely representing it as an archetype. Doing what you’re suggesting will only add a lot of extra work both for modellers, reviewers and editors, as outlined above. There will likely be hundreds of these archetypes, and spending precious modeller, editor and reviewer time on inventing short versions of element names sounds like bad resource management.

What’s the point of the shorter text, apart from making a shorter path if using a natural language path format? The way archetypes are usually used especially with form building tools, the element name will be the field label displayed to the user unless it has been actively changed in the template or form. Using the actual question you want the user to see as the element name then makes the archetype → template → form process very efficient.

Why is it unavoidable? Is there a reason truncating long names when creating a path won’t work?

sebastian.garde · 7 May 2025 10:31

Maybe I am missing something, but I don’t know how to reliably truncate the text/question into a short path that is guaranteed to be future-proof, unique and stable at least on the same level (which I assume is a requirement?).

You’ll have better examples, but the full question might just differ from another question at the very end, something like
<long question start>…within the last day?
<long question start>…within the last week?

That said, I personally don’t find a natural language path format with shorter paths so extremely compelling - if that is the only reason for a short name I would not do it. There may be other uses for it as @linforest has suggested, e.g. if some people prefer a more compact view of an archetype with the “typically” fairly short element names.

thomas.beale · 7 May 2025 16:37

Maybe I’m missing something but it seems to me that the current situation is that the semantic definition of the model node representing each question is… the question text? I don’t see the alternative to someone figuring out what the questions (= model nodes) actually mean. You would do that anyway for de novo models you create, I think?

Well the field label could be very variable and context-dependent (i.e. some short label that makes sense when visually presented within the larger form). I would have imagined the field label as another ELEMENT within the CLUSTER representing the question.

Well it will clearly work in a purely mechanical sense of achieving a string length of < N characters or whatever, but it will almost certainly generate nearly incomprehensible strings in some cases simply due to the fact that the first N characters in the question text might not on their own convey much useful info.

I’d even imagine that if there was a pre-existing way of identifying questions, e.g. question numbering / lettering, then that would be a better way to identify each one, e.g. ‘Section 1, question 4b’ or similar. Would that be possible?

Out of interest, what is the rough proportion of all questionnaires that are pre-published PROMS? I.e. what’s the size of the problem here?

BTW I know clinical modellers really do think hard about everything, I was more thinking about the upstream people, but of course we have no influence over them

siljelb · 8 May 2025 06:36

Maybe I’m being naive, but wouldn’t something like this work?

if longer, truncate to (for example) 15 characters
if two or more elements or containers within the same container become identically named when truncated, append the AT code to the end of each
- if the original length of the question is less than 21 characters, leave it as the original (so as not to make the shortened question + added AT code longer than the original question)

Sure, but you could replace the end with something unique, like the AT code as suggested above?

Or maybe a better solution is to just leave longer questions, and that the long natural language paths are fine? As shown in my initial post, the difference from a canonical path isn’t necessarily that big, even when the questions are fairly long. And my canonical path example doesn’t even have any renamed elements or containers, which would make that path very long.

siljelb · 8 May 2025 07:06

That’s pretty much the situation, yep. Modellers are only replicating the question and response definitions from whichever PROM tool we’re modelling. We have no way to affect the intended semantics of the tool.

Faithfully representing an existing PROM/score/scale structure as an archetype is a very different exercise from making a model covering the wild world of clinical requirements from scratch.

I’m not sure what you’re thinking of here, but maybe we’re talking past each other? To clarify, this is the kind of thing we’re talking about, represented respectively as a paper form and as an archetype:

Not universally, I think. Some PROMs certainly have this, but very likely not all.

I don’t know an actual number as of this year (maybe @Kanthan_Theivendran or someone else knows?), but it’s at least in the hundreds of PROM tools, with an unknown number of data elements in each (Patient‐reported outcome measures (PROMs): A review of generic and condition‐specific measures and a discussion of trends and issues - PMC)

Yep, no influence at all over the PROM creators. We even struggle to explain to them what we’re doing when we try to get permission to publish archetypes representing the PROMs. But that’s another discussion.

yampeku · 8 May 2025 07:28

It shouldn’t be too hard to add comments to the simplified format, which would allow to have shorter paths for the data and longer paths for the longer/source questions if needed.
Having comment support would also allow to translate the questions to different languages while still being able to pass around English paths, which is currently also a problem for non-English speaking countries

sebastian.garde · 8 May 2025 07:29

Well, it always depends on the use case I would say. If it is just to show the paths somewhere, it may work. But I doubt the value of it. The complete path is long anyway, and typically it would be one element (the leaf with the actual question) where you can save a few chars in exchange for

additional complexity,
more potential of errors, and
less human readability.

Curated short names have more potential to be useful in various ways in my opinion - but I do get that they also are a large overhead and additional source of complexity.

siljelb · 8 May 2025 08:10

I’m sure it’s not a big technical issue to add them, but it’s a huge overhead for modellers to make up a shorter name for every single question, just to replace that shorter name with the original question again when making a template or user interface.

siljelb · 8 May 2025 08:27

So maybe the solution is just to accept that in the natural language flat format, paths could get really long?

linforest · 8 May 2025 09:19

Another relevant example might be FHIR R4 Resource Questionnaire’s elements Questionnaire.item.linkId and Questionnaire.item.text

Snapshot of FHIR R4 Resource Questionnaire’s Structure tab.

The elements used to compose the path should be independent of the natural language textual expressions.

siljelb · 8 May 2025 10:09

Agree, and this is the default for openEHR formats without the natural language variation.

(for reference, see the locatable class, with its name and archetype_node_id attributes)

ian.mcnicoll · 8 May 2025 10:47

One of the lessons from FHIR , is that having human language node identifiers is really helpful for developers, even if these are language-dependent. Our language-neutral atCode approach is of course the correct thing to do but to does impose a barrier. The STRUCTURED and FLAT formats, have, in my experience been of huge value on lowering the barrier to implementers, and a big part of that are the human-language paths, which are automatically generated from the archetype text.

I know not everyone is a fan of FLAT etc, but I suspect there will be future demand for these kind of human language labels in other formats and contexts such as AQL.

I understand that we have to keep copyright holders happy but IMO as long we carry the ‘long form text’ somewhere in the archetype, it does not have to be in the node name, which has until now been seen as short ‘meaning label’ for the node.

So, would text contain this short name and then there is an optional (new) full_text (or similar) which in this case contains the full question?

The investigation that Sebastian and I did suggested to me , at least, that we could use an annotation/directive at the top-level of the archetype to indicate to tooling/ viewers to substitute the node Description for the node Name if the Description was populated e.g in copyright views or form tooling.

As part of the PROMS discussion, we did come up with some ideas that would allow e.g CKM or other tools to display the full text in place of the short form, if needed for copyright holders. Our node names are equivalent to technical database column names , and these would never be of interest to copyright holders.

So I feel that the copyright issue, whilst important, is manageable.

The issue of burden on Editors/ Reviewers etc of creating and reviewing shortened forms is more significant but a few mitigations/guidelines could reduce the burden

expanded_prostate_cancer_index_composite_EPIC/any_event/how_big_a_problem_if_any_had_each_of_the_following_been_for_you_during_the_last_4_weeks/need_to_urinate_frequently_during_the_day

In that example, I would say that need_to_urinate_frequently_during_the_day (42 chars) is fine and does not need truncated but

how_big_a_problem_if_any_had_each_of_the_following_been_for_you_during_the_last_4_weeks (83 chars)

should be truncated to

how_big_a_problem_during_the_last_4_weeks (40 chars)

Roughly speaking, any question under 50 characters is fine. We really only need to consider truncating questions longer than approx 50 characters
The aim for a short question is to retain meaning and context - ‘How big a problem’ and ‘in past 4 weeks’. I’m not sure that exactly equates to ‘semantics’ - just enough to avoid confusion.

having to replace the container/element names with the full-length questions again when building templates, leading to even longer canonical paths.

@siljelb I’m not sure why you would need to do this in templates - in forms yes but that could be triggered by the ‘use long-form question’ directive mentioned above.

siljelb · 28 May 2025 06:45

Sure, I have no problem seeing the pros of natural language based paths.

Are they really equivalent to db column names though? As you say column names aren’t of interest to copyright holders, and I’d extend that to anyone except implementers. We regularly ask clinicians to review node names.

So if I understand you correctly, the proposal is to

Add a new attribute to the RM (PATHABLE class?) to capture the original questions, just for PROM tools with long questions
Implement support for this new attribute in a fairly long toolchain from a lot of different vendors
Add workload for modellers to identify element names longer than some defined number of characters, make up a shortened version which is still unique, and add the original question to the new attribute mentioned above (and for shorter questions, duplicate the element name into this new attribute?)
Get archetype reviewers to look at this other attribute when reviewing PROM archetypes, but the element names for all other archetypes
Convince copyright holders that this is fine
Profit?

To me this looks like a whole lot of extra work for a lot of people, which we don’t need. Could I suggest that we instead accept that sometimes natural language paths will be rather long?

thomas.beale · 28 May 2025 15:27

Just as an aside to the main debate: this is not something we should be doing… that will have the effect of adding that data attribute on every node in every archetype, and in all data in any CDR, and mostly it will be null / blank etc. This is an anti-pattern (satisfying a narrow use case with an attribute in a ‘god class’).

ian.mcnicoll · 28 May 2025 18:16

That was not our thinking. The simple solution is to just to use the Description field in the archetype definition, the rule being that if there is a Description for the element, use that in the UI, otherwise use the Node name. The only risk there is that a Description is being used for a different purpose, so we could add an an archetype annotation which says specifically ‘Use the Description in the UI’

So no changes to the RM are required. Clearly there would be an impact on tooling e.g CKM or Form builder tooling that would have to understand those rules but it’s not a massive change IMO.

I agree that it does add a burden to modellers/reviewers but if we keep the ‘allowed character limit’ fairly high, the burden might be lower than seems the case now.

I agree that our node names are not ‘column names’ as such but what I meant was that other people who are building on copyrighted scores will have their own internal field names for technical purposes and which the copyright owners will not normally be bothered about these internal names, as long as the UI is descibed correctly.

I was not quite sure about the profit bit!!

There is one other option which might solve this conundrum is that we could possibly use the short names at template level I,e long-form questions in the archetype but substitute a short form in templates where the system needs them for whatever technical reason, including FLAT format.

Do we have any sense of the various lengths of ‘long questions’ - maybe the number of really problematic items is lower than we think?