Best Database for openEHR

Potentially interesting, going by the documentation. It does seem however to be a relational table system, going by the query language pages, and also the storage architecture. However, it does have collections, for small amounts of 0…N data attached to a primary data object.

For openEHR, the main question would be schema design to best optimise the OO data view with the mainly table-based concrete representation, which doesn’t seem that much different from using Postgres or Oracle etc. So it seems the main unique point is scalability and fault-tolerance.

Thank you for your comment. In second thought I find Cassandra related to OpenEhr context, disappointing. Except from scalability and fault tolerance it does not offer what we need.

I was thinking 5about locking, most databases put a lot of effort in smooth and granular locking. But how good it is done, there is always a price to pay.

But in OpenEhr we don’t change data. Never, although there are exceptions which would need to be stored separately.
But for 99.9% data-actions we only read and add. We should choose a database engine which is specially designed for this.

And there is another thing. There are only very few indexes needed in normal practice. An OpenEhr database should never be used for research or datamining. For that purpose there is the possibility to run separate database in another thread or module, perhaps connected to the main stream via messaging.

There is the idea of CQRS, it is worked out in Axon, but I do not talk about Axon but about that idea. It stands for query/storage seperated responsibility. You need a database for storage, which needs to be very fast, only a few indexes, and a database for querying. The second one will receive data over messaging so the main thread will never have to wait for the second to update indexes. The second can have message-queues and it may take a few seconds before fresh stored data are available for querying

In this way we keep the OpenEhr database lean and fast and only doing its core business.

So we need to find a database which stores data in an object oriented way, just dump the data in it. And it needs to be scalable and secure in storage.
And we need a database suited for querying which is connected via messaging.

4 Likes

I agree with this, and for this reason, DBs that write once (and later on, compress data on disk) are interesting. I also agree with your other comments.

1 Like

Cassandra has adopted a query language and a table schema that might look a bit like a relational database on top of their storage engine, but I would say it is not one.
It’s one of the databases built on the ideas of the google Bigtable paper, and as such it is a sort of distributed eventually consistent key-value storage, where the values themselves are structured somewhat like a table.
Each of these column families (relational equivalent is a table) is a key-value store, where the values each can have a different set of columns. This means there is no strictly enforced table schema.
What you can and cannot query is very limited by the way you model your column families and the keys to the column families. This means you need to store data multiple times for efficient querying, or create indexes. Creating indexes has some potential scalability issues, so should be done with some caution.
There is no locking, no ACID transactions. Instead there is an eventual consistency, the most recent write eventually wins, and the possibility to wait until data is written to a specified number of copies/servers. This comes with some new design challenges for any applications.

I have used Cassandra before outside of OpenEHR (we contributed a bit to it as well) and for a few use cases it solved quite a lot of problems with very little work. Mainly when you need to store many terabytes of data and need to have it available for predefined queries all the time.
Its main advantage is scalability and availability beyond that of relational databases, but this comes at a cost. You could build an OpenEHR implementation on it, especially since you can usually split the data into the different EHRs, one for each patient, that you can use as part of the key for the column families.
However, this also means you can use traditional databases with sharding relatively easily, which means a relational database can scale further than one might think at first sight. I would only choose to use Cassandra if traditional alternatives do not provide what I need in terms of scalability.

2 Likes

Thanks Pieter, for this useful and excellent explanation.

Best regards
Bert Verhees

(Readers more interested in NoSQL openEHR storage solutions than “reading issues” can skip all the way to the bullet list of three papers near the end of this post.)

Hi Bert!

Did you actually read through our XML-related papers or just look at the graphs?

What would you say the papers conclude that you actually do not agree with? If you claim the publication to be a mistake rather than your possibly mistaken way of reading it, then you need to first read and then motivate your claims. Or just re-read and be happy and satisfied to find that they actually do not bash XML-databases in general.

Our first XML paper - the one you are likely referring to:

  • Performance of XML Databases for Epidemiological Queries in Archetype-Based EHRs http://www.ep.liu.se/ecp/070/009/ecp1270009.pdf
  • A reader that does not note that it is ad-hoc epidemiological population-wide querying we primarily explore will likely miss most points of the entire paper. (We follow through in later papers showing solutions for that use case, solutions that simultaneously cater for “normal” clinical one-patient-at-a-time EHR usage).
  • From abstract: “For individual focused clinical queries where patient ID was specified the response times were acceptable. This study suggests that the tested XML database configurations without further optimizations are not suitable as persistence mechanisms for openEHR-based systems in production if population-wide ad hoc querying is needed.” (Note the word “if” in the last sentence.)
  • The paper does not primarily attempt to compare openEHR XML DBs to the non-openEHR RDBMS in general. The paper primarily explores size and speed comparing different XML approaches. The RDBMS with real patient data is primarily a data source to generate large amounts of realistic/real test data. The measurements from the simpler RDBMS (also without query specialized indexing) is just a baseline. Quote from the paper “More information is also added to the openEHR data such as context, auditing, archetype ids an so on, which was not present in the anonymized SISCOLO [RDBMS] database. The size of the three sets of XML documents are respectively 556 MBytes, 2.8 GBytes and 23 GBytes. Therefore it is not a surprise that the sizes of the XML databases are much larger than the corresponding SQL database. However it is interesting to notice that the XML database systems differ greatly in the size of the generated databases with BaseX being the most space saving of all…”

That we said “nothing” about indexes is plain wrong. The paper says:

  • “No indexes besides those that are already built-in in the XML databases were created, because we were most interested in ad hoc queries for which it is not known in advance which indexes should be used, and which is a very common use case in health care research. Thus, in the XML databases, no assumptions were made about the kinds of query that would be made.”
  • "The way the openEHR archetypes are designed and the nature of data values that are stored in the database make the automatically generated indexes in the databases inefficient. The archetypes usually have many attributes with the same value, for instance almost all archetypes have an archetype node id equal to “at0001” and the database used in this study has mainly coded values with few options to choose from. This makes xml text and attribute indexes point to a huge number of entries in the database, leading to long inspection of documents in order to return the results. How to best handle querying of the relatively deep openEHR tree structures, often with repeated path segment identifiers, is an interesting topic for future research. "

A proper reading would come to the useful conclusion that BaseX was the most interesting of tested databases and that it works fine for many one-patient-at-a-time use cases, but slow for the tricky use cases of ad-hoc epidemiology queries.

Later XML and NoSQL-related papers:

To avoid mistakes, please read the entire mentioned papers BEFORE possibly commenting or concluding things about them.

“Did you actually read through our XML-related papers or just look at the graphs?”

Really, is that the tone you want to have this discussion? Is me criticizing a report from you still eating you after all that time? Are you having this tone to challenge me? Okay, I give you what you wish.

On private email

Have a nice day

Hi again Bert!

I understand that for example the “…or did you just look at the graphs” and similar thinigs can be provoking, sorry about that and whatever else in my response that annoyed you. I apologize.

Most of my questions were serious though. You wrote “he concluded things about XML databases had many short comings” and yes that actually triggered me since the paper, even though hampered by the short number of pages allowed, still is fairly careful and specific about what limited things can be concluded from the research. I think you claimed the paper concludes things it does not, that’s why I asked you to re-read it and to specify if something was wrong in the conclusions.

This is how the “conclusion” section and the whole paper ends: “For individual focused clinical queries where patient ID was specified the response times were acceptable. This study suggests that the tested XML database configurations without further optimizations are not suitable as persistence mechanism for openEHR-based systems in production if population-wide ad hoc querying is needed.”

If you don’t want to spend more energy on this, it’s fine. Now you at least have my apology and an explanation of a central thing I found unfair in your critique and got triggered by.

Happy new year!

Same to you and all other readers

Hi Eric, I responded to your paper many years ago, in the context from that discussion and era my views still stand. But it is an old discussion with no relevance to current technical development in databases.

Searching for a solution in SQL only or XML only is an unnecesessary restrictive way to think about a good architecture. So comparing both solutions is not very useful.

Nowadays I think a NoSQL database for clinical use and maybe optional supported by a CQRS structure, for datamining is the best way to go. And I prefer to run it on cloud f.e. AWS, because there is native support for this architecture and also many ways to implement security, encryption in rest and on flight, and fine grained authorization structures.

Best regard
Bert Verhees

Hi Bert, I like your suggestion of using two databases, one for clinical use and one for querying. It should be possible to keep them in sync, writing to both.
The initialisation of one from the other after a failure would be interesting.

Colin

Hi Colin,

I think it depends on the use-case. So far, storing and querying data works quite well for us using postgres only. If we would like to introduce high-performance cross-patient queries, I think we would have to separate the data anyway (because we don’t want to lock the database in production) and then we would have to decide if for analytics we still need AQL (which I think makes sense in cross-enterprise scenarios with distributed data but might not be the optimal solution for local needs).

Cheers,

Birger

Hi Colin, it happens in places where data for themselves are hard to query. Imagine an environment where it is very important to store data in exact the same way or order as they arrive, for legal reasons, for example. Healthcare data.

But another time we need to have databases for population-wide researches, which do have completely different technical requirements and maybe they do not need the semantics which are stored in the archetypes.

Do the separate databases need to run in sync? That depends on the use case. Mostly not in the same second, and mostly not even in the same day.

The technical solution how to sync databases is to propagate data by messaging to other databases.

I sincerely think NODEJS could do better than what you are predicting. I am betting on NODEJS for the high traffic transactions and on Python ( at the moment ) for the CPU intensive work e.g medical image/Dicom processing on the platform I am working on. Kubernetes is good for NODEJS and Python as well. Yes, I am relying on the Microservices as a broad strategy.

Note : Mine is a smallish data science driven PHR application aimed at the individuals, not for the Organisations /Providers.

I agree, regarding performance it doesn’t matter anymore which programming ecosystem you use. If it does not perform well just add or rent some extra hardware.

The choice for a programming ecosystem is a cultural one and which tooling is available and which libraries are. Some people just like Java script, some hate it. I hate it, so it will not be my choice. But I like typescript more which gets transpiled to Java script.

But in the end I would write in Java and run it in a serverless cloud architecture like AWS Lamba.

AWS Lambda supports most of the programming languages including Python and Javascript.
And I deliberately stay away from Java (though I am willing and learning Rust for device programming)! :slight_smile:

Completely true, that is the good thing about it. There is nothing wrong with Java script or Python, they are just not my favourites.

In my days, back in 1993-94, people copied Javascript animations for their websites and I hated that and I started ignoring Javascripts. Today, I mostly use Python for the quick/over-night POC design and my data science work - NLP and Computer vision. I realised Javascipt V-8 engine has become very fast and powerful in the years I stayed away from IT. Now, I have started using Javascript for the transaction-heavy works on NODEJS and the interface design. Take a fresh look at Javascript. The web is here to stay alongwith Javascript. In fact I am using Javascript outside the web, on the devices also (I am still very raw in Rust).

Good advice, thank you

@pablo , I struggle with this. I am new to micro-services and am still learning, but my current concept of EHR doesn’t seem to fit with the general micro-service model. From my understanding each micro-service (bounded by business context that is determined by behavior/function) owns its own data. The openEHR model of having 1 massive data repository would seem to violate this ? How can one conform to the core tenet of micro-service architecture (services that have independent behavior and control their own data) and also still maintain the outward appearance of a consolidated data store built with openEHR archetype components ? What are your thoughts on this ?

Thanks for your thoughts,

@woodpk