# Natural language based FLAT format and long container/element names **Category:** [ITS](https://discourse.openehr.org/c/its/41) **Created:** 2025-05-06 09:24 UTC **Views:** 289 **Replies:** 22 **URL:** https://discourse.openehr.org/t/natural-language-based-flat-format-and-long-container-element-names/6765 --- ## Post #1 by @siljelb Last year we were well on our way to create a modelling style guide for patient reported outcome measure (PROM) archetypes. PROMs are usually copyrighted tools, and we will most of the time require the approval of the copyright holders to be able to publish archetypes. In addition, PROMs are really sensitive with regard to how they are worded and presented, and the questions are sometimes really long. For these reasons, the first draft of the style guide recommended using the full and exact wording of each PROM section or question as the name of the relevant container or element. This however led to potential complications with how the natural language FLAT formats work, in that a FLAT format path generated from these long container and element names would in turn become very long. For example, the canonical path ``` openEHR-EHR-openEHR-EHR-OBSERVATION.expanded_prostate_cancer_index_composite.v0/data[at0001]/events[at0313]/data[at0003]/items[at0319]/items[at0046] ``` would in natural language become something like ``` expanded_prostate_cancer_index_composite_EPIC/any_event/how_big_a_problem_if_any_had_each_of_the_following_been_for_you_during_the_last_4_weeks/need_to_urinate_frequently_during_the_day ``` Not a massive difference, but I'm sure we could find worse examples if we looked harder. Since this problem popped up, it was proposed to manually shorten each container/element name. This is problematic for several reasons: 1. the copyright (holder) issues mentioned above 2. increased workload on modellers to create shorter names for a potentially large number of questions, while keeping the semantic nuances and ability to tell similar questions apart 3. increased workload on editors to get every reviewer to agree on shortened names for a potentially large number of questions 4. having to replace the container/element names with the full-length questions again when building templates, leading to even longer canonical paths So, since this at its root is a technical issue, we thought we'd bring it back here to discuss. How can we solve it without having to create cumbersome modelling workarounds? Could it for example be an option to truncate "too long" (whatever that might be) names when generating the FLAT paths? --- ## Post #2 by @linforest There seems to be a similar problem with the content of many questionnaires/surveys in LOINC: For example, for the items/questions in the [LOINC Panel 62710-9 PhenX domain - Psychiatric](https://loinc.org/62710-9): Names for its memeber question [LOINC 65635-5 When you have sudden anxiety attacks...](https://loinc.org/65635-5/) ![image|690x301](upload://wcK7J3D21mRHS16mMeLfswDX7j7.png) Are there any openEHR-native mechanisms in place that allow us to learn from the experience and practices of the LOINC Committee? **Relevant Topic**: * [How to transfer as much clinical complexity as possible to clinical terminologies](https://discourse.openehr.org/t/how-to-transfer-as-much-clinical-complexity-as-possible-to-clinical-terminologies/5668/13) --- ## Post #3 by @thomas.beale [quote="siljelb, post:1, topic:6765"] So, since this at its root is a technical issue, we thought we’d bring it back here to discuss. How can we solve it without having to create cumbersome modelling workarounds? Could it for example be an option to truncate “too long” (whatever that might be) names when generating the FLAT paths? [/quote] I thought about this exact question over the years, and my conclusion is that the description of the relevant model code should always be a *semantic descriptor of the question*, not the question itself. Achieving this would necessitate being able to describe all the questions in a questionnaire in a precise way. For the example above, it might be something like 'self-assessed level of inconvenience 4 weeks' or 'inconvenience last 4 weeks' or similar. This approach would a) result in shorter paths, and b) potentially force model designers to think a bit harder on what the true semantics of each question are. The LOINC short names mentioned by @linforest above are an attempt to do this. --- ## Post #4 by @sebastian.garde So, would *text* contain this short name and then there is an optional (new) *full_text* (or similar) which in this case contains the full question? Any tooling could then display/use both or only one, or one prominent and the other less so, depending on context. Best of both worlds, except for the additional onus on the modellers, which is probably unavoidable if you want shorter paths. --- ## Post #5 by @thomas.beale [quote="sebastian.garde, post:4, topic:6765"] So, would *text* contain this short name and then there is an optional (new) *full_text* (or similar) which in this case contains the full question? [/quote] I would have expected to model a 'question' as a Cluster, in which the question text was a child Element. So we are talking about the id-code of the Cluster that represents the whole question. This would allow other question items to be included (e.g. marking the question as optional or whatever). The referencing the actual text is just something like `.../items[at0004]/text`, where the id-code `at0004` is defined as something like the short(ish) semantic description I gave above. --- ## Post #6 by @siljelb [quote="thomas.beale, post:3, topic:6765"] force model designers to think a bit harder on what the true semantics of each question are [/quote] We do think hard about this when we design actual clinical archetypes. However, in this case modellers have zero influence on the semantics of the PROM tool, but are merely representing it as an archetype. Doing what you're suggesting will only add a lot of extra work both for modellers, reviewers and editors, as outlined above. There will likely be hundreds of these archetypes, and spending precious modeller, editor and reviewer time on inventing short versions of element names sounds like bad resource management. [quote="sebastian.garde, post:4, topic:6765"] So, would *text* contain this short name and then there is an optional (new) *full_text* (or similar) which in this case contains the full question? Any tooling could then display/use both or only one, or one prominent and the other less so, depending on context. [/quote] What's the point of the shorter text, apart from making a shorter path if using a natural language path format? The way archetypes are usually used especially with form building tools, the element name will be the field label displayed to the user unless it has been actively changed in the template or form. Using the actual question you want the user to see as the element name then makes the archetype -> template -> form process very efficient. [quote="sebastian.garde, post:4, topic:6765"] Best of both worlds, except for the additional onus on the modellers, which is probably unavoidable if you want shorter paths. [/quote] Why is it unavoidable? Is there a reason truncating long names when creating a path won't work? --- ## Post #7 by @sebastian.garde Maybe I am missing something, but I don't know how to reliably truncate the text/question into a short path that is guaranteed to be future-proof, unique and stable at least on the same level (which I assume is a requirement?). You'll have better examples, but the full question might just differ from another question at the very end, something like \...within the last day? \...within the last week? That said, I personally don't find a natural language path format with shorter paths so extremely compelling - if that is the only reason for a short name I would not do it. There may be other uses for it as @linforest has suggested, e.g. if some people prefer a more compact view of an archetype with the "typically" fairly short element names. --- ## Post #8 by @thomas.beale [quote="siljelb, post:6, topic:6765"] We do think hard about this when we design actual clinical archetypes. However, in this case modellers have zero influence on the semantics of the PROM tool, but are merely representing it as an archetype. Doing what you’re suggesting will only add a lot of extra work both for modellers, reviewers and editors, as outlined above. There will likely be hundreds of these archetypes, and spending precious modeller, editor and reviewer time on inventing short versions of element names sounds like bad resource management. [/quote] Maybe I'm missing something but it seems to me that the current situation is that the semantic definition of the model node representing each question is... the question text? I don't see the alternative to someone figuring out what the questions (= model nodes) actually mean. You would do that anyway for de novo models you create, I think? [quote="siljelb, post:6, topic:6765"] What’s the point of the shorter text, apart from making a shorter path if using a natural language path format? The way archetypes are usually used especially with form building tools, the element name will be the field label displayed to the user unless it has been actively changed in the template or form. Using the actual question you want the user to see as the element name then makes the archetype → template → form process very efficient. [/quote] Well the field label could be very variable and context-dependent (i.e. some short label that makes sense when visually presented within the larger form). I would have imagined the field label as another ELEMENT within the CLUSTER representing the question. [quote="siljelb, post:6, topic:6765"] Why is it unavoidable? Is there a reason truncating long names when creating a path won’t work? [/quote] Well it will clearly work in a purely mechanical sense of achieving a string length of < N characters or whatever, but it will almost certainly generate nearly incomprehensible strings in some cases simply due to the fact that the first N characters in the question text might not on their own convey much useful info. I'd even imagine that if there was a pre-existing way of identifying questions, e.g. question numbering / lettering, then that would be a better way to identify each one, e.g. 'Section 1, question 4b' or similar. Would that be possible? Out of interest, what is the rough proportion of all questionnaires that are pre-published PROMS? I.e. what's the size of the problem here? BTW I know clinical modellers really do think hard about everything, I was more thinking about the upstream people, but of course we have no influence over them ;) --- ## Post #9 by @siljelb [quote="sebastian.garde, post:7, topic:6765"] Maybe I am missing something, but I don’t know how to reliably truncate the text/question into a short path that is guaranteed to be future-proof, unique and stable at least on the same level (which I assume is a requirement?). [/quote] Maybe I'm being naive, but wouldn't something like this work? * if longer, truncate to (for example) 15 characters * if two or more elements or containers within the same container become identically named when truncated, append the AT code to the end of each * if the original length of the question is less than 21 characters, leave it as the original (so as not to make the shortened question + added AT code longer than the original question) [quote="sebastian.garde, post:7, topic:6765"] You’ll have better examples, but the full question might just differ from another question at the very end, something like …within the last day? …within the last week? [/quote] Sure, but you could replace the end with something unique, like the AT code as suggested above? Or maybe a better solution is to just leave longer questions, and that the long natural language paths are fine? As shown in my initial post, the difference from a canonical path isn't necessarily that big, even when the questions are fairly long. And my canonical path example doesn't even have any renamed elements or containers, which would make *that* path very long. --- ## Post #10 by @siljelb [quote="thomas.beale, post:8, topic:6765"] Maybe I’m missing something but it seems to me that the current situation is that the semantic definition of the model node representing each question is… the question text? [/quote] That's pretty much the situation, yep. Modellers are only replicating the question and response definitions from whichever PROM tool we're modelling. We have no way to affect the intended semantics of the tool. [quote="thomas.beale, post:8, topic:6765"] I don’t see the alternative to someone figuring out what the questions (= model nodes) actually mean. You would do that anyway for de novo models you create, I think? [/quote] Faithfully representing an existing PROM/score/scale structure as an archetype is a very different exercise from making a model covering the wild world of clinical requirements from scratch. [quote="thomas.beale, post:8, topic:6765"] Well the field label could be very variable and context-dependent (i.e. some short label that makes sense when visually presented within the larger form). I would have imagined the field label as another ELEMENT within the CLUSTER representing the question. [/quote] I'm not sure what you're thinking of here, but maybe we're talking past each other? To clarify, this is the kind of thing we're talking about, represented respectively as a paper form and as an archetype: https://medicine.umich.edu/sites/default/files/content/downloads/EPIC-2.2002.pdf https://ckm.openehr.org/ckm/archetypes/1013.1.7439 [quote="thomas.beale, post:8, topic:6765"] I’d even imagine that if there was a pre-existing way of identifying questions, e.g. question numbering / lettering, then that would be a better way to identify each one, e.g. ‘Section 1, question 4b’ or similar. Would that be possible? [/quote] Not universally, I think. Some PROMs certainly have this, but very likely not all. [quote="thomas.beale, post:8, topic:6765"] Out of interest, what is the rough proportion of all questionnaires that are pre-published PROMS? I.e. what’s the size of the problem here? [/quote] I don't know an actual number as of this year (maybe @Kanthan_Theivendran or someone else knows?), but it's at least in the hundreds of PROM tools, with an unknown number of data elements in each ([Patient‐reported outcome measures (PROMs): A review of generic and condition‐specific measures and a discussion of trends and issues - PMC](https://pmc.ncbi.nlm.nih.gov/articles/PMC8369118/)) [quote="thomas.beale, post:8, topic:6765"] BTW I know clinical modellers really do think hard about everything, I was more thinking about the upstream people, but of course we have no influence over them :wink: [/quote] Yep, no influence at all over the PROM creators. We even struggle to explain to them what we're doing when we try to get permission to publish archetypes representing the PROMs. But that's another discussion. --- ## Post #11 by @yampeku It shouldn't be too hard to add comments to the simplified format, which would allow to have shorter paths for the data and longer paths for the longer/source questions if needed. Having comment support would also allow to translate the questions to different languages while still being able to pass around English paths, which is currently also a problem for non-English speaking countries --- ## Post #12 by @sebastian.garde [quote="siljelb, post:9, topic:6765"] * if longer, truncate to (for example) 15 characters * if two or more elements or containers within the same container become identically named when truncated, append the AT code to the end of each * if the original length of the question is less than 21 characters, leave it as the original (so as not to make the shortened question + added AT code longer than the original question) [/quote] Well, it always depends on the use case I would say. If it is just to show the paths somewhere, it may work. But I doubt the value of it. The complete path is long anyway, and typically it would be one element (the leaf with the actual question) where you can save a few chars in exchange for - additional complexity, - more potential of errors, and - less human readability. Curated short names have more potential to be useful in various ways in my opinion - but I do get that they also are a large overhead and additional source of complexity. --- ## Post #13 by @siljelb [quote="yampeku, post:11, topic:6765"] It shouldn’t be too hard to add comments to the simplified format, which would allow to have shorter paths for the data and longer paths for the longer/source questions if needed. [/quote] I'm sure it's not a big technical issue to add them, but it's a huge overhead for modellers to make up a shorter name for every single question, just to replace that shorter name with the original question again when making a template or user interface. --- ## Post #14 by @siljelb [quote="sebastian.garde, post:12, topic:6765"] Well, it always depends on the use case I would say. If it is just to show the paths somewhere, it may work. But I doubt the value of it. The complete path is long anyway, and typically it would be one element (the leaf with the actual question) where you can save a few chars in exchange for * additional complexity, * more potential of errors, and * less human readability. Curated short names have more potential to be useful in various ways in my opinion - but I do get that they also are a large overhead and additional source of complexity. [/quote] So maybe the solution is just to accept that in the natural language flat format, paths could get really long? --- ## Post #15 by @linforest Another relevant example might be [FHIR R4 Resource Questionnaire](https://hl7.org/fhir/R4/questionnaire.html)'s elements `Questionnaire.item.linkId` and `Questionnaire.item.text` ![image|690x340](upload://uRPnc1MwYihyHZR6oCoLo7arJxs.png) Snapshot of FHIR R4 Resource Questionnaire's **[Structure](https://hl7.org/fhir/R4/questionnaire.html#tabs-struc) tab**. The elements used to compose the path should be independent of the natural language textual expressions. --- ## Post #16 by @siljelb [quote="linforest, post:15, topic:6765, full:true"] Another relevant example might be [FHIR R4 Resource Questionnaire](https://hl7.org/fhir/R4/questionnaire.html)’s elements `Questionnaire.item.linkId` and `Questionnaire.item.text` The elements used to compose the path should be independent of the natural language textual expressions. [/quote] Agree, and this is the default for openEHR formats without the natural language variation. (for reference, see the [locatable class](https://specifications.openehr.org/releases/RM/latest/common.html#_locatable_class), with its `name` and `archetype_node_id` attributes) --- ## Post #17 by @ian.mcnicoll [quote="siljelb, post:1, topic:6765"] * the copyright (holder) issues mentioned above * increased workload on modellers to create shorter names for a potentially large number of questions, while keeping the semantic nuances and ability to tell similar questions apart * increased workload on editors to get every reviewer to agree on shortened names for a potentially large number of questions * having to replace the container/element names with the full-length questions again when building templates, leading to even longer canonical paths [/quote] One of the lessons from FHIR , is that having human language node identifiers is really helpful for developers, even if these are language-dependent. Our language-neutral atCode approach is of course the correct thing to do but to does impose a barrier. The STRUCTURED and FLAT formats, have, in my experience been of huge value on lowering the barrier to implementers, and a big part of that are the human-language paths, which are automatically generated from the archetype text. I know not everyone is a fan of FLAT etc, but I suspect there will be future demand for these kind of human language labels in other formats and contexts such as AQL. I understand that we have to keep copyright holders happy but IMO as long we carry the 'long form text' somewhere in the archetype, it does not have to be in the node name, which has until now been seen as short 'meaning label' for the node. > So, would *text* contain this short name and then there is an optional (new) *full_text* (or similar) which in this case contains the full question? The investigation that Sebastian and I did suggested to me , at least, that we could use an annotation/directive at the top-level of the archetype to indicate to tooling/ viewers to substitute the node Description for the node Name if the Description was populated e.g in copyright views or form tooling. As part of the PROMS discussion, we did come up with some ideas that would allow e.g CKM or other tools to display the full text in place of the short form, if needed for copyright holders. Our node names are equivalent to technical database column names , and these would never be of interest to copyright holders. So I feel that the copyright issue, whilst important, is manageable. The issue of burden on Editors/ Reviewers etc of creating and reviewing shortened forms is more significant but a few mitigations/guidelines could reduce the burden `expanded_prostate_cancer_index_composite_EPIC/any_event/how_big_a_problem_if_any_had_each_of_the_following_been_for_you_during_the_last_4_weeks/need_to_urinate_frequently_during_the_day` In that example, I would say that `need_to_urinate_frequently_during_the_day` (42 chars) is fine and does not need truncated but `how_big_a_problem_if_any_had_each_of_the_following_been_for_you_during_the_last_4_weeks` (83 chars) should be truncated to `how_big_a_problem_during_the_last_4_weeks` (40 chars) 1. Roughly speaking, any question under 50 characters is fine. We really only need to consider truncating questions longer than approx 50 characters 3. The aim for a short question is to retain meaning and context - 'How big a problem' and 'in past 4 weeks'. I'm not sure that exactly equates to 'semantics' - just enough to avoid confusion. > having to replace the container/element names with the full-length questions again when building templates, leading to even longer canonical paths. @siljelb I'm not sure why you would need to do this in templates - in forms yes but that could be triggered by the 'use long-form question' directive mentioned above. --- ## Post #18 by @siljelb [quote="ian.mcnicoll, post:17, topic:6765"] having human language node identifiers is really helpful for developers [/quote] Sure, I have no problem seeing the pros of natural language based paths. [quote="ian.mcnicoll, post:17, topic:6765"] Our node names are equivalent to technical database column names , and these would never be of interest to copyright holders. [/quote] Are they really equivalent to db column names though? As you say column names aren't of interest to copyright holders, and I'd extend that to anyone except implementers. We regularly ask clinicians to review node names. [quote="ian.mcnicoll, post:17, topic:6765"] I’m not sure why you would need to do this in templates - in forms yes but that could be triggered by the ‘use long-form question’ directive mentioned above. [/quote] So if I understand you correctly, the proposal is to 1. Add a new attribute to the RM (PATHABLE class?) to capture the original questions, just for PROM tools with long questions 2. Implement support for this new attribute in a fairly long toolchain from a lot of different vendors 3. Add workload for modellers to identify element names longer than some defined number of characters, make up a shortened version which is still unique, and add the original question to the new attribute mentioned above (and for shorter questions, duplicate the element name into this new attribute?) 4. Get archetype reviewers to look at this other attribute when reviewing PROM archetypes, but the element names for all other archetypes 5. Convince copyright holders that this is fine 5. Profit? To me this looks like a whole lot of extra work for a lot of people, which we don't need. Could I suggest that we instead accept that sometimes natural language paths will be rather long? --- ## Post #19 by @thomas.beale [quote="siljelb, post:18, topic:6765"] Add a new attribute to the RM (PATHABLE class?) to capture the original questions, just for PROM tools with long questions [/quote] Just as an aside to the main debate: this is not something we should be doing... that will have the effect of adding that data attribute on every node in every archetype, and in all data in any CDR, and mostly it will be null / blank etc. This is an anti-pattern (satisfying a narrow use case with an attribute in a 'god class'). --- ## Post #20 by @ian.mcnicoll [quote="siljelb, post:18, topic:6765"] Add a new attribute to the RM (PATHABLE class?) to capture the original questions, just for PROM tools with long questions [/quote] That was not our thinking. The simple solution is to just to use the Description field in the archetype definition, the rule being that if there is a Description for the element, use that in the UI, otherwise use the Node name. The only risk there is that a Description is being used for a different purpose, so we could add an an archetype annotation which says specifically 'Use the Description in the UI' So no changes to the RM are required. Clearly there would be an impact on tooling e.g CKM or Form builder tooling that would have to understand those rules but it's not a massive change IMO. I agree that it does add a burden to modellers/reviewers but if we keep the 'allowed character limit' fairly high, the burden might be lower than seems the case now. I agree that our node names are not 'column names' as such but what I meant was that other people who are building on copyrighted scores will have their own internal field names for technical purposes and which the copyright owners will not normally be bothered about these internal names, as long as the UI is descibed correctly. I was not quite sure about the profit bit!! There is one other option which might solve this conundrum is that we could possibly use the short names at template level I,e long-form questions in the archetype but substitute a short form in templates where the system needs them for whatever technical reason, including FLAT format. Do we have any sense of the various lengths of 'long questions' - maybe the number of really problematic items is lower than we think? --- ## Post #21 by @linforest [quote="ian.mcnicoll, post:20, topic:6765"] the Description field in the archetype definition [/quote] Does this field have a kinda "`purpose`" property like the FHIR [`CodeSystem.concept.designation.use`](https://hl7.org/fhir/R4/codesystem-definitions.html#CodeSystem.concept.designation.use)? --- ## Post #22 by @siljelb Since I had misunderstood the initial proposed step, let me try again: 1. Add an archetype annotation (to the AOM?) specifying that tools are to use the Description field instead of the element name for labelling a UI field. Would this annotation be set per archetype, per element, or perhaps per container? 2. Implement support for this new annotation in relevant tools such as CKM, form builders and possibly form renderers 3. Add workload for modellers to identify element names longer than some defined number of characters, make up a shortened version which is still unique, add the original question to the Description field, and potentially set the annotation mentioned above 4. Get archetype reviewers to look at the Description when reviewing PROM archetypes, but the element names for all other archetypes 5. Convince copyright holders that this is fine And if I understand correctly, the alternative is: 1. Slightly inconvenience PROM implementers who are using the natural language FLAT format paths I don't have a strong impression about the percentage of PROM questions which are longer than 50 characters. Perhaps people who have worked more actively with implementing PROM tools would know? I've had a quick look at the following archetypes. Some of them may not be PROMs, but they're all standardised questionnaires intended to be filled out by the patient themselves. Numbers in brackets are the number of element and container names over 50 characters in length, and the total number of elements and containers. * EPIC (32 of 70) * MAP-hand (2 of 23) * EQ-5D-5L (0 of 7) * Birth Satisfaction Scale-Revised (0 of 14) * Western Ontario and McMaster Universities Arthritis Index (0 of 28) * Simplified Endoscopic Disease Severity Score for Crohn’s Disease (0 of 57) * Short Inflammatory Bowel Disease Questionnaire (2 of 16) --- ## Post #23 by @retwet This question came up again while I was modelling around 20 questionnaires for a German use case. I wasn’t aware of this thread before, so I had posted a related question here: [Long question names from validated instruments](https://discourse.openehr.org/t/long-question-names-from-validated-instruments/11574/2). From what I’ve read so far, it seems like there isn’t a clear consensus yet. My current (and possibly incomplete) understanding is that the modelling approach should focus less on how an instrument is used in real-world settings, and more on how to storing the resulting data in a structured, semantic way. Some of these instruments can be quite complex, questions and items can be several sentences long, include images, or make formatting suggestions. Also, some questionnaires or interviews come with a guidance manual that defines the exact questions, context, rules for orders based on previous answers, and even a separate data capture sheet. It feels like this level of detail can’t really be represented in the element name, or even in an Archetype or Template? I’m also not sure if it’s a good idea to try and convert a paper-based tool directly, one-to-one, into an archetype and then just use that archetype for data capture. Is there a way to define rules about hiding and showing elements based on previous answers? I’d really appreciate any advice on how to approach this, especially as we’d like to propose our new archetypes for CKM review in the future. I just want to make sure we’re following best practices and not missing something fundamental around element naming. --- **Canonical:** https://discourse.openehr.org/t/natural-language-based-flat-format-and-long-container-element-names/6765 **Original content:** https://discourse.openehr.org/t/natural-language-based-flat-format-and-long-container-element-names/6765