I’d like to suggest considering a Distributed SQL database for this project. If you’re not familiar with this database category, it includes databases that are built natively as distributed systems. As “distributed-native” systems, they have that in common with NoSQL databases like Cassandra and MongoDB, mentioned in earlier comments here. But this category of databases also supports “SQL”, as the name indicates. Specifically, “SQL” refers to relational algebra and support for ANSI-standard SQL. In the past, we as an industry moved to NoSQL systems with the idea that relational databases could not scale cost-effectively, or at all in some cases. For the RDBMS technology at that time (~2007) that assertion was true. But now Distributed SQL databases have solved that problem and can effectively scale relational joins in a distributed system. Distributed SQL databases are easily adopted partly because they use existing common database protocols, viz. MySQL and PostgreSQL, so they work with existing drivers in any programming language, like Java, and with ORMs like Hibernate, mostly out of the box. Distributed SQL databases are also multi-model these days. That means that they support documents and JSON as first-class citizens just as well as traditional relational tables and rows. Because of these capabilities, Distributed SQL are replacing the last era of NoSQL databases like MongoDB and Cassandra as well as RDBMSs like Oracle, SQL Server and IBM DB2. Some examples of Distributed SQL databases are SingleStoreDB, CockroachDB and YugabyteDB. (Disclosure: I work for SingleStore.)
Thank you Domenic from your thoughts. Do you have any experience, insights or results (that you can share) about implementing openEHR on SingleStoreDB ?
I hope it’s ok to bump this one back up. @ian.mcnicoll and I have spoken about it before and although I’m no expert by anyones measure its an area that’s interested me and concerned me about OpenEHR.
I’ve seen the comments on scalability of XML databases so called. My info is well out of date but in the early shared care records we used XML and had to separately index the metadata, or pull it out and re-parse it. None of this ideal and certainly not scalable. In the end we had a separate analytics database with a lot of indexing, and an operational database with a few searchable fields.
Then I spoke to various people about how the thing is database agnostic but thought that was something of a sham, as effectively you’re just storing big blobs of data and metadata, and probably aren’t using much of the tuning features of the database as it doesn’t really understand the data, or have a map of it.
I remained sceptical.
However, I recently had a conversation with someone about vector databases, and although it’s yet another thing I know nowt about, I thought aha is this the thing we need to make OpenEHR a scalable technology for some of the bigger real time datasets I envisaged.
In the end then, a simple question, has anyone considered using these vector technologies [or others I don’t know about] to host an OpenEHR resource ?
Am I on the right lines or off beam ?
I am going to the event in Munich in Feb so am just doing a bit of research prior to that.
And apologies if opening an old thread is poor form
Ade
If you want to answer the question:
“Which patients are smokers and have systolic pressure ≥ 140 in the last year?”
then you need to:
- Find the smoking information for each patient and filter on the condition is “smoker.”
- Find all blood pressure measurements for each patient and filter on the condition is “systolic pressure ≥ 140” and “creation date ≥ current time - 1 year.”
- Create a join between them on the relationship that both belong to the same patient.
- Add a “distinct” operator on patients.
This process is independent of any database technology and is purely logical.
Legacy Approach (Relational Database)
If you use a legacy approach, a database engineer would design your table structure in a relational database. You would properly end up with two tables:
- Smoking Status:
- patId (UID)
- status (Boolean)
- Blood Pressure:
- patId (UID)
- systolic (Integer)
- creation_date (Timestamp)
Finding where the data is stored and performing a join is thus straightforward. The process can be optimized by using table statistics. If performance is still insufficient, a database engineer can analyze the query plan and add indexes where necessary.
Document-Based Data Standards
When using document-based standards like openEHR, FHIR, or CDA, the idea is that data modeling is done by medical specialists based on an abstract base model.
This can lead to challenge if one uses a naive Approach , such as having all data stored in large, unstructured “blobs.” As a result even identifying where smoking data and blood pressure measurements are stored becomes difficult, since documents tend to be quite different.
But there is a solution. you can leverage the semantic bindings of these documents, such as:
- Archetypes
- Bindings to terminologies
For example, EHRbase uses a hybrid approach where documents are not stored as one large blob but are divided into paths and just the leaves are stored in Json. And thus specific archetypes can be indexed, allowing for quick retrieval and search.
Vector Databases
Vector databases are particularly useful for semantic or similarity searches. They can quickly answer questions like:
- “Which documents (or parts of a document) concern smoking?”
- “Which documents (or parts of a document) concern blood pressure?”
- “Which documents (or parts of a document) concern both smoking and blood pressure?”
This is interesting for the medical domain, where much data remains unstructured. However, semantic searches are distinct from analytical queries such as:
- “Which patients are smokers and have systolic pressure ≥ 140 in the last year?”
Operational vs. Analytical Databases
It’s important to distinguish between operational and analytical databases, as their requirements differ:
Operational Database:
- High availability
- CRUD operations and single-patient queries (focused on the current patient)
- Strong consistency
Analytical Database:
- Supports complex queries on historical data
- eventual consistency is allowed.
Using separate operational and analytical databases with different technologies is often a good architectural pattern.
Excellent post.
BTW I see that EhrBase is now using YugaByte ‘wrapping’ of Postgres. I had a look at the current DB schema init files - as far as I can see you are using jsonb (JSOB binary blob) representation for Composition.content, path-based index fields, and mostly orthodox tables for the other high-level structures. Is this correct? If so … very interesting, it’s an approach I proposed 20y ago
Performance test results would be very interesting…
The hybrid approach mentioned, such as EHRbase using YugaByte with JSONB for path-based indexing, is promising and reflects the evolution of ideas over decades.
It’s also intriguing to see how vector databases might fit into this landscape, particularly for semantic searches in unstructured medical data. However, there’s still a clear distinction between such innovative methods and traditional analytical queries that require operational consistency.
On another note, the cost of implementing ehr in hospital remains a significant barrier, especially for smaller institutions. While these advanced technologies offer exciting possibilities, their adoption will depend heavily on balancing technical benefits with affordability and long-term ROI. Performance test results from implementations like these could help build the case for broader adoption.
Thank you
The examples usually given are “find me a bunch of patients with this and that”, which is something I’m generally not going to be too bothered about performance and could even wait overnight. As you say there are differences between operational and analytical designs, and I’m always wondering whether the OpenEHR system can satisfy thie requirement here.
I don’t nececcarily agree that these are single patient queries. In outpatients for example you might be working with some really complex lists that need to refresh in RT. I’ve seen well designed databases floored with the queries that support this.
I don’t really have a problem with the fact well designed archetypes etc can support the data queries. I’m still wondering whether I’m going to need a database analyst to decide how to map data and index/tune a platform for these operational purposes.
I’ve already made clear I’m no expert. Not an DBA and not a professional analyst by any stretch. My Codd and Date is well rusty
I have however been around when operational databases have been flattened by patterns of use and hit walls when being scaled up. It’s quite a scary place. The people I worked with were brilliant, and always seemed to optimise things and plan ahead pretty well. I just come back to I am trying to see if I can be confident in OpenEHR for these purposes, and was just pondering on vector databases as a way to support it, but from a very amateur standpoint.
Ade