openEHR Mailing Lists Archive - Imported to Discourse

With the assistance of my trusty ‘Sancho’, even incredibly tedious and forbidding tasks eventually become tractable…

I’ve had the archives from openEHR’s old email lists for a while, but hadn’t got around to processing them for import and going through the pain of shaving all the yaks required to get it looking acceptable. Over the last few days I found the time to look at it and get the job done. I hope you’ll agree, it’s in good shape and I’m pleased that the entirety of the archives are now together on Discourse, covering around 24 years of openEHR.

Content added

9 archive categories containing 4,734 topics and 20,156 posts from 649 unique authors, spanning May 2002 to January 2020. The largest lists are Technical (9,663 posts), Clinical (4,242 posts), and Implementers (2,408 posts). The Eiffel and Java categories are mostly SVN commit notifications. All archive topics are closed and archived.

Announcements (archive)
Clinical (archive)
Decision Support (archive)
Implementers (archive)
ISO 13606 (archive)
openEHR.org Website (archive)
Reference Implementation: Eiffel (archive)
Reference Implementation: Java (archive)
Technical (archive)

Import process

The source material was Thunderbird .sbd mbox archives from 9 openEHR public mailing lists. Discourse’s built-in mbox importer handled the initial parsing and post creation. Threading required extra work — mailing list software often strips or corrupts the In-Reply-To and References headers that link replies to their parent messages, which initially caused over 60% of posts to appear as isolated single-post topics. A subject-line-based fallback was used to reconstruct missing reply chains in the importer’s index database, with cycle detection to avoid circular references. This reduced single-post topics in the conversation-heavy categories (Technical, Clinical, Implementers) from over 30% down to around 24%.

Cleanup

Post formatting cleanup included stripping legacy <big> HTML tags that rendered poorly in Discourse. Email notifications for all staged import users were disabled to prevent unintended mail.

Users

For user attribution, many contributors posted from multiple email addresses over the 18-year span as they changed organisations. A fuzzy name-matching pass identified 70 staged import accounts that corresponded to 42 existing forum members, and these were merged so that archive posts appear under the correct current user profiles. Matches were reviewed manually by @marcusbaw before execution, and only high-confidence matches (exact or near-exact name matches after accent normalisation and title stripping) were merged.

Final checks

I had the content scanned for any content that was offensive or in any way problematic - knowing that it was exceedingly unlikely because this was an internet-facing professional mailing list. Of course if anything has escaped my attention please do let me know and I’ll remedy it immediately.

Using it

Search: It’s all of course searchable, so conversations from way back when are now findable again. Our Discourse has AI-powered search and I will continue to enable better LLM-driven tools to surface knowledge in ways that keep it easily consumable despite its vast size.

MCP: Discourse has an MCP server which you can run locally, this is another way to let your agents access and process this vast trove of information.

llms.txt: All of the content will be reflected in the Discourse llms.txt that is set up here. See also: AI support on specifications.openehr.org - #2 by marcusbaw

Please let me know if there is anything else I can do to improve the forum and it’s knowledgebase!

Lovely work. Thanks for this.

Who is Sancho? :upside_down_face:

Claude, but I call him Sancho.
Sancho Panza was Don Quixote’s squire.

I’m stealing that.

This is great! Fantastic work @marcusbaw and Sancho :wink:

I propose a little game. Try to find your earliest publication in the lists. For me, I found this one. Although it is not correctly attributed, in the following response it still appears my email.

Jun 16th, 2006

I love the game idea @damoca! I noticed many familiar names during the import process, so a LOT of you are in there!

It is possible to re-attribute any misattributed posts, but obvs I’d rather not have to do this en masse! I might be able to automate some of that process if there is need. Sancho might do it if asked nicely.

The exception is - if you’ve ended up with a second ‘staged’ account, and all your imported email list posts are under that account, it’s relatively easy and quick to merge the accounts so everything is back to one account.

Wonderful!

Love the game idea! I’m pretty sure this is my earliest post: Archetypes for "Body fluid or substance" and "Bodily output"

This one is not the first post I was involved in, but brings back old memories from 2013: Erik Sundvall's PhD Defence - Online Edition

In glorious 480px youtube format with opponent Dipak Kalra grounded in Amsterdam due to snow storms - but the intenet solved such problems already then…

And the paper version is described in this old thread PhD thesis online: Scalability and Semantic Sustainability in Electronic Health Record Systems

This was fun times too in 2005: Problem with ADL_parser - #11 by erik.sundvall - we did not have AI writing code back then, but we had students!