Best Database for openEHR

erik.sundvall · 30 December 2019 09:34

(Readers more interested in NoSQL openEHR storage solutions than “reading issues” can skip all the way to the bullet list of three papers near the end of this post.)

Hi Bert!

Did you actually read through our XML-related papers or just look at the graphs?

What would you say the papers conclude that you actually do not agree with? If you claim the publication to be a mistake rather than your possibly mistaken way of reading it, then you need to first read and then motivate your claims. Or just re-read and be happy and satisfied to find that they actually do not bash XML-databases in general.

Our first XML paper - the one you are likely referring to:

Performance of XML Databases for Epidemiological Queries in Archetype-Based EHRs http://www.ep.liu.se/ecp/070/009/ecp1270009.pdf
A reader that does not note that it is ad-hoc epidemiological population-wide querying we primarily explore will likely miss most points of the entire paper. (We follow through in later papers showing solutions for that use case, solutions that simultaneously cater for “normal” clinical one-patient-at-a-time EHR usage).
From abstract: “For individual focused clinical queries where patient ID was specified the response times were acceptable. This study suggests that the tested XML database configurations without further optimizations are not suitable as persistence mechanisms for openEHR-based systems in production if population-wide ad hoc querying is needed.” (Note the word “if” in the last sentence.)
The paper does not primarily attempt to compare openEHR XML DBs to the non-openEHR RDBMS in general. The paper primarily explores size and speed comparing different XML approaches. The RDBMS with real patient data is primarily a data source to generate large amounts of realistic/real test data. The measurements from the simpler RDBMS (also without query specialized indexing) is just a baseline. Quote from the paper “More information is also added to the openEHR data such as context, auditing, archetype ids an so on, which was not present in the anonymized SISCOLO [RDBMS] database. The size of the three sets of XML documents are respectively 556 MBytes, 2.8 GBytes and 23 GBytes. Therefore it is not a surprise that the sizes of the XML databases are much larger than the corresponding SQL database. However it is interesting to notice that the XML database systems differ greatly in the size of the generated databases with BaseX being the most space saving of all…”

That we said “nothing” about indexes is plain wrong. The paper says:

“No indexes besides those that are already built-in in the XML databases were created, because we were most interested in ad hoc queries for which it is not known in advance which indexes should be used, and which is a very common use case in health care research. Thus, in the XML databases, no assumptions were made about the kinds of query that would be made.”
"The way the openEHR archetypes are designed and the nature of data values that are stored in the database make the automatically generated indexes in the databases inefficient. The archetypes usually have many attributes with the same value, for instance almost all archetypes have an archetype node id equal to “at0001” and the database used in this study has mainly coded values with few options to choose from. This makes xml text and attribute indexes point to a huge number of entries in the database, leading to long inspection of documents in order to return the results. How to best handle querying of the relatively deep openEHR tree structures, often with repeated path segment identifiers, is an interesting topic for future research. "

A proper reading would come to the useful conclusion that BaseX was the most interesting of tested databases and that it works fine for many one-patient-at-a-time use cases, but slow for the tricky use cases of ad-hoc epidemiology queries.

Later XML and NoSQL-related papers:

Comparing the Performance of NoSQL Approaches for Managing Archetype-Based Electronic Health Record Data https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0150069 (Here a NoSQL approach, Couchbase, including full openEHR datamodel content beats an RDBMS in many ad-hoc-population-query cases)
ORBDA: An open EHR benchmark dataset for performance assessment of electronic health record servers https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0190028 (Here you get a huge dataset in openEHR-format to play with, and as a side finding a hint that ElasticSearch is Worth exploring further.)
Querying Archetype-Based Electronic Health Records Using Hadoop and Dewey Encoding of openEHR Models http://ebooks.iospress.nl/publication/46372 (How to optimize archetype aware indexing in systems based on relational algebra, for exampe RDBMs and Hadoop)

To avoid mistakes, please read the entire mentioned papers BEFORE possibly commenting or concluding things about them.