Best Database for openEHR

pablo · 20 November 2019 02:46

My point was: portability shouldn’t be used as an adoption argument.

I’m the bigger critic of AQL,and also the bigger proposer of fixes and improvements to its spec. At the same time raising awareness to weaknesses and problems, and talking openly about them, generates the environment to fix the issues in an open way. As well as talking and considering the idea of alternative ways of querying in openEHR, that generates a rich discussion from which us, SEC members, could learn from.

This all is very far away from any flame wars, and more close to a friendly convensation between peers. In any case, flame wars would only need fired by business interests, which I have none, not by the need of improving the spec, which is my personal goal.

Seref · 20 November 2019 07:38

Hi Pablo.

Ok, I misunderstood your position, it sounded more negative as an implementer than what it apparently is. My bad. Always happy to have these discussions, to be clear.

I was referring to the title of the topic, “best…” , which is not what the discussion is about anymore, when I said "prone to " flame wars. I wasn’t referring to my exchange with you. We’re good here

Bert_Verhees · 21 November 2019 20:28

In the context of the title of this discussion, maybe Apache Cassandra is a option

http://cassandra.apache.org/

thomas.beale · 22 November 2019 10:47

Potentially interesting, going by the documentation. It does seem however to be a relational table system, going by the query language pages, and also the storage architecture. However, it does have collections, for small amounts of 0…N data attached to a primary data object.

For openEHR, the main question would be schema design to best optimise the OO data view with the mainly table-based concrete representation, which doesn’t seem that much different from using Postgres or Oracle etc. So it seems the main unique point is scalability and fault-tolerance.

Bert_Verhees · 23 November 2019 08:54

Thank you for your comment. In second thought I find Cassandra related to OpenEhr context, disappointing. Except from scalability and fault tolerance it does not offer what we need.

I was thinking 5about locking, most databases put a lot of effort in smooth and granular locking. But how good it is done, there is always a price to pay.

But in OpenEhr we don’t change data. Never, although there are exceptions which would need to be stored separately.
But for 99.9% data-actions we only read and add. We should choose a database engine which is specially designed for this.

And there is another thing. There are only very few indexes needed in normal practice. An OpenEhr database should never be used for research or datamining. For that purpose there is the possibility to run separate database in another thread or module, perhaps connected to the main stream via messaging.

There is the idea of CQRS, it is worked out in Axon, but I do not talk about Axon but about that idea. It stands for query/storage seperated responsibility. You need a database for storage, which needs to be very fast, only a few indexes, and a database for querying. The second one will receive data over messaging so the main thread will never have to wait for the second to update indexes. The second can have message-queues and it may take a few seconds before fresh stored data are available for querying

In this way we keep the OpenEhr database lean and fast and only doing its core business.

So we need to find a database which stores data in an object oriented way, just dump the data in it. And it needs to be scalable and secure in storage.
And we need a database suited for querying which is connected via messaging.

thomas.beale · 23 November 2019 11:05

I agree with this, and for this reason, DBs that write once (and later on, compress data on disk) are interesting. I also agree with your other comments.

pieterbos · 28 November 2019 11:12

Cassandra has adopted a query language and a table schema that might look a bit like a relational database on top of their storage engine, but I would say it is not one.
It’s one of the databases built on the ideas of the google Bigtable paper, and as such it is a sort of distributed eventually consistent key-value storage, where the values themselves are structured somewhat like a table.
Each of these column families (relational equivalent is a table) is a key-value store, where the values each can have a different set of columns. This means there is no strictly enforced table schema.
What you can and cannot query is very limited by the way you model your column families and the keys to the column families. This means you need to store data multiple times for efficient querying, or create indexes. Creating indexes has some potential scalability issues, so should be done with some caution.
There is no locking, no ACID transactions. Instead there is an eventual consistency, the most recent write eventually wins, and the possibility to wait until data is written to a specified number of copies/servers. This comes with some new design challenges for any applications.

I have used Cassandra before outside of OpenEHR (we contributed a bit to it as well) and for a few use cases it solved quite a lot of problems with very little work. Mainly when you need to store many terabytes of data and need to have it available for predefined queries all the time.
Its main advantage is scalability and availability beyond that of relational databases, but this comes at a cost. You could build an OpenEHR implementation on it, especially since you can usually split the data into the different EHRs, one for each patient, that you can use as part of the key for the column families.
However, this also means you can use traditional databases with sharding relatively easily, which means a relational database can scale further than one might think at first sight. I would only choose to use Cassandra if traditional alternatives do not provide what I need in terms of scalability.

Bert_Verhees · 6 December 2019 09:42

Thanks Pieter, for this useful and excellent explanation.

Best regards
Bert Verhees

erik.sundvall · 30 December 2019 09:34

(Readers more interested in NoSQL openEHR storage solutions than “reading issues” can skip all the way to the bullet list of three papers near the end of this post.)

Hi Bert!

Did you actually read through our XML-related papers or just look at the graphs?

What would you say the papers conclude that you actually do not agree with? If you claim the publication to be a mistake rather than your possibly mistaken way of reading it, then you need to first read and then motivate your claims. Or just re-read and be happy and satisfied to find that they actually do not bash XML-databases in general.

Our first XML paper - the one you are likely referring to:

Performance of XML Databases for Epidemiological Queries in Archetype-Based EHRs http://www.ep.liu.se/ecp/070/009/ecp1270009.pdf
A reader that does not note that it is ad-hoc epidemiological population-wide querying we primarily explore will likely miss most points of the entire paper. (We follow through in later papers showing solutions for that use case, solutions that simultaneously cater for “normal” clinical one-patient-at-a-time EHR usage).
From abstract: “For individual focused clinical queries where patient ID was specified the response times were acceptable. This study suggests that the tested XML database configurations without further optimizations are not suitable as persistence mechanisms for openEHR-based systems in production if population-wide ad hoc querying is needed.” (Note the word “if” in the last sentence.)
The paper does not primarily attempt to compare openEHR XML DBs to the non-openEHR RDBMS in general. The paper primarily explores size and speed comparing different XML approaches. The RDBMS with real patient data is primarily a data source to generate large amounts of realistic/real test data. The measurements from the simpler RDBMS (also without query specialized indexing) is just a baseline. Quote from the paper “More information is also added to the openEHR data such as context, auditing, archetype ids an so on, which was not present in the anonymized SISCOLO [RDBMS] database. The size of the three sets of XML documents are respectively 556 MBytes, 2.8 GBytes and 23 GBytes. Therefore it is not a surprise that the sizes of the XML databases are much larger than the corresponding SQL database. However it is interesting to notice that the XML database systems differ greatly in the size of the generated databases with BaseX being the most space saving of all…”

That we said “nothing” about indexes is plain wrong. The paper says:

“No indexes besides those that are already built-in in the XML databases were created, because we were most interested in ad hoc queries for which it is not known in advance which indexes should be used, and which is a very common use case in health care research. Thus, in the XML databases, no assumptions were made about the kinds of query that would be made.”
"The way the openEHR archetypes are designed and the nature of data values that are stored in the database make the automatically generated indexes in the databases inefficient. The archetypes usually have many attributes with the same value, for instance almost all archetypes have an archetype node id equal to “at0001” and the database used in this study has mainly coded values with few options to choose from. This makes xml text and attribute indexes point to a huge number of entries in the database, leading to long inspection of documents in order to return the results. How to best handle querying of the relatively deep openEHR tree structures, often with repeated path segment identifiers, is an interesting topic for future research. "

A proper reading would come to the useful conclusion that BaseX was the most interesting of tested databases and that it works fine for many one-patient-at-a-time use cases, but slow for the tricky use cases of ad-hoc epidemiology queries.

Later XML and NoSQL-related papers:

Comparing the Performance of NoSQL Approaches for Managing Archetype-Based Electronic Health Record Data https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0150069 (Here a NoSQL approach, Couchbase, including full openEHR datamodel content beats an RDBMS in many ad-hoc-population-query cases)
ORBDA: An open EHR benchmark dataset for performance assessment of electronic health record servers https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0190028 (Here you get a huge dataset in openEHR-format to play with, and as a side finding a hint that ElasticSearch is Worth exploring further.)
Querying Archetype-Based Electronic Health Records Using Hadoop and Dewey Encoding of openEHR Models http://ebooks.iospress.nl/publication/46372 (How to optimize archetype aware indexing in systems based on relational algebra, for exampe RDBMs and Hadoop)

To avoid mistakes, please read the entire mentioned papers BEFORE possibly commenting or concluding things about them.

Bert_Verhees · 30 December 2019 14:21

“Did you actually read through our XML-related papers or just look at the graphs?”

Really, is that the tone you want to have this discussion? Is me criticizing a report from you still eating you after all that time? Are you having this tone to challenge me? Okay, I give you what you wish.

On private email

Have a nice day

erik.sundvall · 30 December 2019 17:54

Hi again Bert!

I understand that for example the “…or did you just look at the graphs” and similar thinigs can be provoking, sorry about that and whatever else in my response that annoyed you. I apologize.

Most of my questions were serious though. You wrote “he concluded things about XML databases had many short comings” and yes that actually triggered me since the paper, even though hampered by the short number of pages allowed, still is fairly careful and specific about what limited things can be concluded from the research. I think you claimed the paper concludes things it does not, that’s why I asked you to re-read it and to specify if something was wrong in the conclusions.

This is how the “conclusion” section and the whole paper ends: “For individual focused clinical queries where patient ID was specified the response times were acceptable. This study suggests that the tested XML database configurations without further optimizations are not suitable as persistence mechanism for openEHR-based systems in production if population-wide ad hoc querying is needed.”

If you don’t want to spend more energy on this, it’s fine. Now you at least have my apology and an explanation of a central thing I found unfair in your critique and got triggered by.

Happy new year!

Bert_Verhees · 30 December 2019 18:28

Same to you and all other readers

Bert_Verhees · 6 January 2020 07:09

Hi Eric, I responded to your paper many years ago, in the context from that discussion and era my views still stand. But it is an old discussion with no relevance to current technical development in databases.

Searching for a solution in SQL only or XML only is an unnecesessary restrictive way to think about a good architecture. So comparing both solutions is not very useful.

Nowadays I think a NoSQL database for clinical use and maybe optional supported by a CQRS structure, for datamining is the best way to go. And I prefer to run it on cloud f.e. AWS, because there is native support for this architecture and also many ways to implement security, encryption in rest and on flight, and fine grained authorization structures.

Best regard
Bert Verhees

Colin_Sutton · 25 January 2020 05:38

Hi Bert, I like your suggestion of using two databases, one for clinical use and one for querying. It should be possible to keep them in sync, writing to both.
The initialisation of one from the other after a failure would be interesting.

Colin

birger.haarbrandt · 25 January 2020 13:00

Hi Colin,

I think it depends on the use-case. So far, storing and querying data works quite well for us using postgres only. If we would like to introduce high-performance cross-patient queries, I think we would have to separate the data anyway (because we don’t want to lock the database in production) and then we would have to decide if for analytics we still need AQL (which I think makes sense in cross-enterprise scenarios with distributed data but might not be the optimal solution for local needs).

Cheers,

Birger

Bert_Verhees · 25 January 2020 15:04

Hi Colin, it happens in places where data for themselves are hard to query. Imagine an environment where it is very important to store data in exact the same way or order as they arrive, for legal reasons, for example. Healthcare data.

But another time we need to have databases for population-wide researches, which do have completely different technical requirements and maybe they do not need the semantics which are stored in the archetypes.

Do the separate databases need to run in sync? That depends on the use case. Mostly not in the same second, and mostly not even in the same day.

The technical solution how to sync databases is to propagate data by messaging to other databases.

Prabhat_MOHANTY · 27 January 2020 19:55

I sincerely think NODEJS could do better than what you are predicting. I am betting on NODEJS for the high traffic transactions and on Python ( at the moment ) for the CPU intensive work e.g medical image/Dicom processing on the platform I am working on. Kubernetes is good for NODEJS and Python as well. Yes, I am relying on the Microservices as a broad strategy.

Note : Mine is a smallish data science driven PHR application aimed at the individuals, not for the Organisations /Providers.

Bert_Verhees · 27 January 2020 22:56

I agree, regarding performance it doesn’t matter anymore which programming ecosystem you use. If it does not perform well just add or rent some extra hardware.

The choice for a programming ecosystem is a cultural one and which tooling is available and which libraries are. Some people just like Java script, some hate it. I hate it, so it will not be my choice. But I like typescript more which gets transpiled to Java script.

But in the end I would write in Java and run it in a serverless cloud architecture like AWS Lamba.

Prabhat_MOHANTY · 28 January 2020 01:35

AWS Lambda supports most of the programming languages including Python and Javascript.
And I deliberately stay away from Java (though I am willing and learning Rust for device programming)!

Bert_Verhees · 28 January 2020 05:56

Completely true, that is the good thing about it. There is nothing wrong with Java script or Python, they are just not my favourites.

Best Database for openEHR

Hi Bert, I like your suggestion of using two databases, one for clinical use and one for querying. It should be possible to keep them in sync, writing to both. The initialisation of one from the other after a failure would be interesting.

Hi Bert, I like your suggestion of using two databases, one for clinical use and one for querying. It should be possible to keep them in sync, writing to both.
The initialisation of one from the other after a failure would be interesting.