With the assistance of my trusty ‘Sancho’, even incredibly tedious and forbidding tasks eventually become tractable…
I’ve had the archives from openEHR’s old email lists for a while, but hadn’t got around to processing them for import and going through the pain of shaving all the yaks required to get it looking acceptable. Over the last few days I found the time to look at it and get the job done. I hope you’ll agree, it’s in good shape and I’m pleased that the entirety of the archives are now together on Discourse, covering around 24 years of openEHR.
Content added
9 archive categories containing 4,734 topics and 20,156 posts from 649 unique authors, spanning May 2002 to January 2020. The largest lists are Technical (9,663 posts), Clinical (4,242 posts), and Implementers (2,408 posts). The Eiffel and Java categories are mostly SVN commit notifications. All archive topics are closed and archived.
Announcements (archive)
Clinical (archive)
Decision Support (archive)
Implementers (archive)
ISO 13606 (archive)
openEHR.org Website (archive)
Reference Implementation: Eiffel (archive)
Reference Implementation: Java (archive)
Technical (archive)
Import process
The source material was Thunderbird .sbd mbox archives from 9 openEHR public mailing lists. Discourse’s built-in mbox importer handled the initial parsing and post creation. Threading required extra work — mailing list software often strips or corrupts the In-Reply-To and References headers that link replies to their parent messages, which initially caused over 60% of posts to appear as isolated single-post topics. A subject-line-based fallback was used to reconstruct missing reply chains in the importer’s index database, with cycle detection to avoid circular references. This reduced single-post topics in the conversation-heavy categories (Technical, Clinical, Implementers) from over 30% down to around 24%.
Cleanup
Post formatting cleanup included stripping legacy <big> HTML tags that rendered poorly in Discourse. Email notifications for all staged import users were disabled to prevent unintended mail.
Users
For user attribution, many contributors posted from multiple email addresses over the 18-year span as they changed organisations. A fuzzy name-matching pass identified 70 staged import accounts that corresponded to 42 existing forum members, and these were merged so that archive posts appear under the correct current user profiles. Matches were reviewed manually by @marcusbaw before execution, and only high-confidence matches (exact or near-exact name matches after accent normalisation and title stripping) were merged.
Final checks
I had the content scanned for any content that was offensive or in any way problematic - knowing that it was exceedingly unlikely because this was an internet-facing professional mailing list. Of course if anything has escaped my attention please do let me know and I’ll remedy it immediately.
Using it
Search: It’s all of course searchable, so conversations from way back when are now findable again. Our Discourse has AI-powered search and I will continue to enable better LLM-driven tools to surface knowledge in ways that keep it easily consumable despite its vast size.
MCP: Discourse has an MCP server which you can run locally, this is another way to let your agents access and process this vast trove of information.
llms.txt: All of the content will be reflected in the Discourse llms.txt that is set up here. See also: AI support on specifications.openehr.org - #2 by marcusbaw
Please let me know if there is anything else I can do to improve the forum and it’s knowledgebase!