# `oct` - a new open clinical terminology **Category:** [General Discussion](https://discourse.openehr.org/c/general-discussion/132) **Created:** 2025-10-27 21:27 UTC **Views:** 302 **Replies:** 25 **URL:** https://discourse.openehr.org/t/oct-a-new-open-clinical-terminology/11556 --- ## Post #1 by @marcusbaw Well, I do like to give you all a laugh now and then, so here goes. I have started a project to build an open clinical terminology. I don't like SNOMED-CT's licensing, or its structure, or even its massive and stupid numeric identifiers. I realise that proposing a project like this is at the extreme fringes of sanity, but I'm up for having the resultant discussions. I would love to collaborate with any other lunatics that think an open clinical terminology can be achieved. Here's the project https://github.com/pacharanero/open-terminology The idea is to crowdsource the clinical content over... well, however long it takes to do that. Early design decisions are being made right now, so if you have views on this kind of thing then you should join in in the issues or discussing the project on Open Health Hub. https://openhealthhub.org/c/oct/58 --- ## Post #2 by @joostholslag Hahah first wishes that come to mind: 1. A standard language for defining terms (I guess RDF) 2. a strict(er) ontological approach 3. easy migration paths for existing terminologies 4. Ability to replace internal codes in ADL on a per archetype/template basis 5. Sets of codes (refsets) as FHIR valuesets 6. A permanent uri per code 7. A defined namespace for the terminology. 8. Political alignment with IHSDO @grahamegrieve mentioned his most desired wish is to fix snomed. He probably has some ideas too. maybe it’s not too crazy after all:p edit: and most important of all, don’t try to solve everything! So don’t try to be an information model, so no default context, no codes for ‘family history of infected breast implant with conservative treatment after a dog bite by a chihuahua’ or whatever exotic combination snomed ct introduced in the latest release. --- ## Post #3 by @grahamegrieve 1. down with RDF. It’s where practical projects go to die 2. I’d go for simple mono-hierarchical tables with a strict grammar 3. mappings to existing terminologies 4. easy to deploy to FHIR terminology servers The really hard thing to resolve is extensibility/customization aka post-coordination and/or distributed governance. It’s easy to think that you don’t need that in the early phases where you can be both nimble and methodical, but later it becomes a growing issue and if you don’t have a solid solution from the start, you’ll be doomed by the time it bites. --- ## Post #4 by @siljelb It's probably not necessary to bring in Cimino, but I'll do it anyway: [Desiderata for Controlled Medical Vocabularies in the Twenty-First Century - PMC](https://pmc.ncbi.nlm.nih.gov/articles/PMC3415631/) --- ## Post #5 by @SevKohler Second Grahames comment in general: 1. I agree no RDF please, its only useful if using existing ontologies which comes with other problems. 2. this has to be carefully designed from the start Input: 1. i like snomed expressions, i would like to see them here too (should OCT be a classification or a nomenclature like snomed ?) 2. can we align it somehow to openEHR internal codings and terminology ? 3. Maybe we can bring omop vocabs in somehow. --- ## Post #6 by @ian.mcnicoll If I have read @grahamegrieve correctly, I would agree with the idea of seeing this as a simplification layer on top of existing terminologies, rather than ‘starting again’. There are other options like ICPC too. The licensing issue with SNOMED does seem to be ‘going away’ albeit slower than we’d like, and building/maintaining a international terminology with all the challenges of translation etc is massive. --- ## Post #7 by @marcusbaw Thanks all for the replies and (what I'm interpreting as) encouragement. At EHRCON25 Grahame said you have to be "naive and optimistic" to do this kind of thing. Bloodyminded is also a Yorkshire trait I will be leaning into. ### Replies [quote="joostholslag, post:2, topic:11556"] Political alignment with IHSDO [/quote] I think I'm more likely to get legal letters than political alignment. The existence of an open terminology might care them into making SNOMED more open (it worked with Microsoft and .docx in the face of OpenOffice/LibreOffice and Apple iWork) [quote="grahamegrieve, post:3, topic:11556"] down with RDF. It’s where practical projects go to die [/quote] Agreed! Trying not to make this project hurt anyone's brain too much, least of all mine. [quote="siljelb, post:4, topic:11556, full:true"] It’s probably not necessary to bring in Cimino, but I’ll do it anyway: [Desiderata for Controlled Medical Vocabularies in the Twenty-First Century - PMC](https://pmc.ncbi.nlm.nih.gov/articles/PMC3415631/) [/quote] Thanks - agree - I read this as part of my planning and made notes. I will be incorporating the bits of the Desiderata that I think are sensible and still relevant in 2025. [quote="grahamegrieve, post:3, topic:11556"] mappings to existing terminologies [/quote] Definitely, although these will have to be crowd-sourced --- ## Post #8 by @marcusbaw #### Notes: * I am listening to/reading carefully all your feedback and will bring into the project what I think makes sense. However I am determined not to make OCT some maddeningly complex superset of SNOMED * **But** the only way to get your views implemented in OCT is to join in, become part of the team. There are no pundits only players. The field of play is the GitHub repo. * I am **hard separating** the Namespace from any Hierarchy/Ontology/Expression Languages. Bundling both in the one system is what makes the dominant terminologies completely unfathomable except to about 14 people worldwide. Namespace first, then 'linking' layers on top of that (layers which need not all come from within the OCT project) * I will incorporate any bits of existing terminologies (eg ICPC, [GPS](https://www.snomed.org/gps), any parts of Read that are public domain...) IF they are license-compatible with OCT. --- ## Post #9 by @linforest More or less, it looks like yet another *UMLS-like* (or *OHDSI Athena database etc.* ) *initiative.* --- ## Post #10 by @grahamegrieve I think it’s quite different to that, which seek to be unification / mapping projects. This is a project to come up with a genuine open source terminology for clinical use --- ## Post #11 by @grahamegrieve BTW, as much as I respect Jim Cimino and his desiderata, he is wrong in one respect: > 2.4 Nonsemantic Concept Identifier That section is incoherent. The concept identifier (= code) is not the display name. And it shouldn’t (as stated in the text) include the heirarchy in the concept identifier (I completely agree with that). But neither of those things is a defense of why the code should be non-semantic. And if you endorse concept-permanence, then the code can’t be redefined. That’s why FHIR code systems have semantic (but not structural) codes. But it does get harder as the code system gets bigger. --- ## Post #12 by @joostholslag [quote="grahamegrieve, post:11, topic:11556"] That’s why FHIR code systems have semantic (but not structural) codes. But it does get harder as the code system gets bigger. [/quote] This is a debate in openEHR as well. openEHR has non semantic codes (atXXXX) which in a specific implementation artefact (webtemplate) gets replaced with the default language’s name for that field. I don’t want to have that debate in this thread, just trying to understand what you’re saying because it seems relevant to that (other) debate. So FHIR names elements but doesn’t number (‘structural codes’ element1, element2 etc, nor hieararchy element1.1, like ICD10 does ) them, right? What get’s harder, using semantic codes in bigger (than FHIR) code systems, or bigger code bases that implement structural codes? I think both are an issue/challenge, but I’m trying to understand your point and wether it’s also a recommendation? --- ## Post #13 by @marcusbaw [quote="grahamegrieve, post:11, topic:11556"] The concept identifier (= code) is not the display name. And it shouldn’t (as stated in the text) include the heirarchy in the concept identifier (I completely agree with that). But neither of those things is a defense of why the code should be non-semantic. [/quote] I think you're right @grahamegrieve but in the end I went with nonsemantic IDs because it's more friendly to internationalisation. However there is another way to have an ID that *has meaning*, but which is still *non-linguistic* - you could make the ID the SHA1 hash of the description (and you could probably get away with a shortened version, say the first 7 chars for convenience). So the ID would have meaning but not to humans. It would enable very simepl checking of the ID, you just hash the description. Is this feature actually **valuable** though, or just a bit too smartypants? --- ## Post #14 by @marcusbaw [quote="linforest, post:9, topic:11556"] yet another *UMLS-like* (or *OHDSI Athena database etc.* ) *initiative.* [/quote] It isn't [quote="grahamegrieve, post:10, topic:11556"] This is a project to come up with a genuine open source terminology for clinical use [/quote] Yes exactly this [quote="joostholslag, post:2, topic:11556"] A permanent uri per code [/quote] I'm working on the current namespace being something like `https://openterminology.org/terms/en-GB/R9J3JQ` (not working yet but not far off) What would you want such an endpoint to return? Don't terminology servers just add a huge TLS overhead to everything? Can't we make it so that the terms can just be built into software directly like a library? --- ## Post #15 by @siljelb [quote="marcusbaw, post:13, topic:11556"] you could make the ID the SHA1 hash of the description (and you could probably get away with a shortened version, say the first 7 chars for convenience) [/quote] So this is basically a further truncated UUIDv5? --- ## Post #16 by @grahamegrieve catching up [quote="grahamegrieve, post:11, topic:11556"] So FHIR names elements but doesn’t number [/quote] I was talking about codes in code systems, not elements in resources [quote="grahamegrieve, post:11, topic:11556"] wether it’s also a recommendation? [/quote] I would use codes linked to the definitions, yes. [quote="grahamegrieve, post:11, topic:11556"] I think you’re right @grahamegrieve but in the end I went with nonsemantic IDs because it’s more friendly to internationalisation. [/quote] can be wrong in all languages at once, instead of right in the most common language? [quote="grahamegrieve, post:11, topic:11556"] you could make the ID the SHA1 hash of the description [/quote] but you will be adjusting and clarifying the description , that is certain And codes as short semantic signifiers really helps people visualise the structure, which is challenging. [quote="grahamegrieve, post:11, topic:11556"] What would you want such an endpoint to return? [/quote] combined terminology service and web site (depending on accept: header) > Don’t terminology servers just add a huge TLS overhead to everything? Can’t we make it so that the terms can just be built into software directly like a library? indeed, you want to be able to do a library, but the point of a service is to decouple between full software upgrade and changing terminology content. So the question is how you decouple UI from content. There’s two models, which should both be supported: the standard terminology server approach, and the software max approach, where only the raw tables are input. Grahame --- ## Post #17 by @joostholslag [quote="marcusbaw, post:14, topic:11556"] I’m working on the current namespace being something like `https://openterminology.org/terms/en-GB/R9J3JQ` (not working yet but not far off) What would you want such an endpoint to return? [/quote] Well, to me it’s more about a consistent identifier (so not nescessarirly locator) available to different implementations (openehr FHIR etc). It would be helpful if the url shows some basic info about the code to the user (I’m thinking modeller, not developer per se). What snomed does in this regard with the sct.info/xxxx works well for me Does that help? --- ## Post #18 by @linforest [quote="marcusbaw, post:13, topic:11556"] However there is another way to have an ID that *has meaning*, but which is still *non-linguistic* - you could make the ID the SHA1 hash of the description [/quote] The HASH value would change as the description changes. [quote="grahamegrieve, post:11, topic:11556"] That’s why FHIR code systems have semantic (but not structural) codes. But it does get harder as the code system gets bigger. [/quote] For humans, the semantic codes within the FHIR code systems indeed appear very user-friendly and straightforward / easy to understand. However, non-semantic concept identifiers/codes are a fundamental principle, although human-readable/understandable concept identifiers/codes may seem feasible for many smaller code systems, especially when their few concepts involved have clear and unambiguous meanings. Because the linguistic expression of concepts is prone to change, especially when their meanings shift, therefore the semantic codes might have to be altered. --- ## Post #19 by @mjlawley What do you do when the code name has to change. For example bronze_diabetes → haemochromatosis or renaming away from Nazi associations (Asperger syndrome, …) The other problem is that words change in meaning over time. Not a great example, clinically, but easy to comprehend: “gay” --- ## Post #20 by @marcusbaw [quote="linforest, post:18, topic:11556"] The HASH value would change as the description changes. [/quote] Yes, I think this is why we'll steer clear of hashed-content IDs. Hashing (of a random source) can still be a way to get ID (as @siljelb points out that is how UUIDv5 works) [quote="grahamegrieve, post:16, topic:11556"] can be wrong in all languages at once, instead of right in the most common language? [/quote] I can see how English-language semantic IDs in FHIR code systems work well. Here is an example shared by a colleague recently on openhealthhub.org: https://nw-gmsa.github.io/CodeSystem-NWGMSA.html ![image|690x275](upload://2tQoiGREekRxgsPrlQsMtwViiRc.png) They seem to be a PascalCase concatenation of the content of the Display (Description) Wouldn't this get cumbersome for very long Descriptions? However I can see the value in the Code being pretty obvious in its meaning. --- ## Post #21 by @marcusbaw [quote="mjlawley, post:19, topic:11556"] What do you do when the code name has to change. For example bronze_diabetes → haemochromatosis or renaming away from Nazi associations (Asperger syndrome, …) [/quote] I am proposing that nothing is ever deleted from the Namespace. New codes will be added to represent the new way to describe that concept. A term which is no longer appropriate to use will be made inactive. We will support ways to **only** discover active concepts (eg an index of active concepts) [quote="mjlawley, post:19, topic:11556"] words change in meaning over time. [/quote] This is true, but more of a problem for those extracting data from the system than thos inputting it at they time the input it (which is generally 'now', historically). The correct solutions will vary for how to extract meaningful data over a long period of time (eg a retrospective longitudinal study) according to the type of project, the question being asked, the clinical situation. --- ## Post #22 by @grahamegrieve [quote="marcusbaw, post:20, topic:11556"] Wouldn’t this get cumbersome for very long Descriptions? [/quote] well, it forces you to be very clear on what is the differentiating parts of the descriptions and what is clarifying. But yes, as the set of codes gets longer, and the granularity gets finer, the codes get longer, and harder to manage. Unique numbers per SCT don’t have this problem, but they have obvious disadvantages. One chooses one’s poison :frowning: --- ## Post #23 by @erik.sundvall Seems people here don't like RDF, does that mean you don't like OWL (in any form/dialect) either? Whatever you pick, it would be good to be able to run a [_reasoner_](https://en.wikipedia.org/wiki/Semantic_reasoner) (see [also this](https://arxiv.org/pdf/2309.06888)) to find inconsistencies etc. otherwise things will be hard to maintain. Reasoner capabilities come for free with OWL and some other formalisms with a sufficiently strong logical foundation. If not using OWL, at least going for some graph that can be queried via [GQL](https://en.wikipedia.org/wiki/Graph_Query_Language) would be good. Having the terminology system and the EHR in the same GQL-queryable database would be nice. Also of course starting a new Snomed CT competitor would be madness yesterday, maybe today too or maybe not with clever use of AI for information gathering and structuring. [Same madness level goes for starting a new Wikipedia, [right Elon](https://grokipedia.com/)? (But Grokipedia does not necessarily contribute to more openess in the world...)] P.S. Also have a chat with @Daniel_Karlsson --- ## Post #24 by @erik.sundvall Regarding unique identifiers I would go for non-semantic short (alpahnumeric) ones, but also allow an optional canonical (unique) alias (likely in English) for those that prefer to maintain and use that for some common codes. And make sure APIs etc can accept both and cross-translate. --- ## Post #25 by @linforest [quote="erik.sundvall, post:23, topic:11556"] Seems people here don’t like RDF, does that mean you don’t like OWL (in any form/dialect) either? … [/quote] OWL and OBO are my favorite. And also SPARQL. --- ## Post #26 by @marcusbaw [quote="erik.sundvall, post:24, topic:11556"] also allow an optional canonical (unique) alias [/quote] I like this idea. Kind of gives us the best of both worlds. If we have a vast, immutable namespace, the 'graphs' built on top of that can be anything we want. They can add hierarchies, groupings/refsets, or convenience alternative naming layers. At the moment I am leaning towards using [Crockford base32](https://www.crockford.com/base32.html) for the identifier, which gets us 5 bits per character of ID length, while being URL-safe and case-insensitive-filesystem-safe. A base32 ID 7 chars long would give us 35 billion terms, is this enough for the time being? SNOMED has a 109 = 1 billion namespace size, of which it is currently using ~360k. Alternatively, a truncated hash of the Description (ie a UUIDv5) might work, this would look a little bit like a Git commit hash. Yes we won't be able to change the Description without changing the ID. But since we would have an insanely large namespace this wouldn't matter. We'd just keep track of the changes in the graphs. --- **Canonical:** https://discourse.openehr.org/t/oct-a-new-open-clinical-terminology/11556 **Original content:** https://discourse.openehr.org/t/oct-a-new-open-clinical-terminology/11556