Which database would you advise for OpenEhr, and is there documentation about more details for that choice?
I think all we can say is that we know what works…
- O-R mapping like Hibernate but may not scale.
- XML databases do not scale
- Simple use of RDBMS with smart indexing and/or JSONB works well - virtually all of the CDRs in production use this approach,- ThinkEhr, DipsEhr, OceanEHR, Base24, Ethercis, EhrBase
- Graph DBs or Mumps etc, should work but no real-world examples.
The question is, why would anyone want to build another, rather than license a proprietary CDR or contribute to one of the OSS examples.
I agree with most of it, but you left one important possibility out which is proven technology and widely spread but not in OpenEhr land.
I consider OpenEhr as not so unique DB technically seen. It is about storing documents. The validating of archetyped objects is unique, but after that stage, it is just low level technical database handling.
That process of object/document oriented databases is widely tested and there is tooling to have it scalable. But that is technical, check the Apache sides. Google for sharding.
You ask why anyone would choose for an own solution instead of contributing to an existing solution.
Maybe one does not experience an existing solution as the best solution. Remember that is also how OpenEhr was born, not by contributing to an existing DB solution on legacy model which were and still are many. But by disagreeing, seeing opportunities for the better. A remarkable achievement by you, Thomas, Sam, Heather and others.
But now the same is happening again, what others do is used as argument to not try something else for which may be good reasons which I like to discuss.
The advantage of using a document structure is the simplicity of code that is needed. Not only in storing and retrieval but also in path based querying.
Remember the saying of the worldwide known programmers guru: Venkat Subramaniam:
Don’t walk away from complexity…
Simplicity means large costs reduction, less risks, better maintainability.
I don’t have the figures at hand, but a database like MongoDb is scalable and simple to use and is used in large projects.
Typical for OpenEhr is that it never needs to change or delete existing data. This is something we see more nowadays, also in other fields. That is because storage has never been so cheap and fast. We can afford this kind of solutions.
These software features bring document databases like MongoDb forward as a good potential candidate for openEHR storage.
Thanks Ian, I appreciate your reply. Sorry for the many edits, but while writing I grow to a better wording of what I think.
And now I would also like a more technical reply
Is this from personal experience or from the paper published some years ago? Because I’m pretty sure it didn’t define indexes etc.
I think it would need further evaluation.
In any case I think from the available implementations RDB+JSONB is the way to go. Neo4j works surprisingly well and probably is really useful for research purposes (being able to query the data easily from different parts of the graph)
Yes I should have made that clear - XML databases ‘out of the box’ do not scale.
My (as a General Hactitioner -thx @marcusbaw) comments were made from knowing what has been tried and publicly worked (and from a little insider information). Other options such as Mumps, Neo4J and MongoDb, have been tried but for whatever reason, these have not really come into the public, other than as research reports. That’s not to say that they could not be made to work but my takeway, as a clinical hacker, is that nothing will just work well out of the box. openEHR data has a document commit paradigm but a document-neutral retreival paradigm. It also demands full access to all of the data tree, and a no-engineering data model upgrade approach. All of these tend to confound some of the potential advantages of e.g mumps or MongoDb. What seems like a good and natural fit, may not be so, in practice.
Having said that, I would love to see openEHR CDRs on these non-RDBMS options be tried but I suspect it might be harder, and less advantageous, than first appears.
I’m certainly not wanting to discourage anyone from trying but I would caution anyone new to this space to strongly consider making use of one of the existing CDRs, at least initially, as you can burn a huge amount of time and resource trying to build a CDR. Get your project up and running and get your team familiar with using a CDR before embarking on building one.
Is true, I have great respect for Eric Sundvall, he often says very wise and clever things, but the paper he wrote in which he concluded things about XML databases had many short comings.
He said nothing about indexes, he said nothing about the model of the MySql database which was the opponent, the query he did was a data mine query, not a normal practice query.
So we cannot take that too seriously. I regard that as a mistake, we all make every now and then.
I am quite sure that every database, Mongo, XML, relational, hierarchical, etc deliver almost instantly if you query for an indexed field or primary key.
And for atypical queries it is always possible to optimize circumstances in which the query has to run. That is also very common to do in data mining on big data.
There is a lot information about performance tests in large databases, many people want to know.
I am looking for arguments against document databases, because pros are too many to ignore.
Have you seen this paper in the past?
I may have shared it with you before…
I find it helps understand the world of databases and why some old & unpopular tech has done so well in healthcare over the decades…
(PS don’t want to get into a debate about the pros/cons of the related language, just to make the point that the related database engine is a v interesting one)
I must have missed it I have never seen it before. A quick look makes an interesting impression. I have to read it later.
just a quick feedback with my personal experience: In the past, we tried two approaches/technologies. One was Intersystems Caché. In principle, it looks like a good fit, providing good handling for hierarchies and performance at the same time. The biggest issue were the tools to handle java classes. The name of the framework is/was Jalapeño. However, there were lots of issues (e.g. handling of generics) which demanded some workarounds. Inserts took forever and we finally abandoned the approach.
For data warehousing, we used Microsoft SQL Server 2012. This was merely for prototyping purposes. In general, this worked somewhat well because you can provide XML schemas and do some indexing. We were able to shred compositions and put them in different entry tables (automatically generated) that allowed for some optimization. When the whole composition was needed, I was able to put them back together using stored procedures. T-SQL provides also very nice capabilities to combine Xpath and SQL. So generally, I think this is a good candidate for openEHR.
Tony, I have readThe PDF, 20years ago I have worked with Mumps, so it was easy to understand for me and the most of it I already knew. The idea of storing JSON structures in such a database never came to my mind but it is well explained and easy to do, but I don’t see much advantage.
A disadvantage is that extra code-layers are needed, even extended to query engines which is not what I am looking for, on the contrary, I think there is advantage in simplicity.
(I wrote an SQL engine to access a medical MUMPS database in 2004 for a product I created called HISlink)
It is a solution for the hard-core Mumps fans to give new live to it.
So it did not answer my question why not to use MongoDb or a similar database.
But thanks anyway.
Thanks Birger for your story. I have heard InterSystems Caché is a fast performing database, but you have other experiences. My objection is that the price is too high and for that reason there is not a big and open community.
Such a community is for me a requirement in a choice for a database.
MSSql is a good database, like Oracle is or MYSQL. But these are relational databases and that is not the kind of database I want to discuss because using a relational database for document based storage requires a complex software layer, and that layer becomes even more complex when a Path based query engine is needed.
So that is not what I want to discuss, I started this discussion to find good reasons for not to use MongoDb or similar, and I have not seen an argument against it.
But thanks anyway for your contribution
See this recent & related presentation… as followup to that earlier paper
“Background to Qewd: Data Persistence as a Language Feature”
I think you’ll find it of interest, Tony
just on the notion of ‘documents’… openEHR doesn’t specify anything to do with documents as such. The current specifications are all about data.
If an implementer chooses to serialise openEHR data first into XML, JSON or whatever else ‘documents’ and then store it, rather than storing atomised (or mostly atomised) data, then that’s fine, it’s an implementation choice. It may well be just as fast as other kinds of storage, I don’t have a personal opinion right now, because I don’t think we have enough current scientific studies showing the performance of various persistence technologies for complex data.
But there is no natural conclusion from the openEHR specifications that storing documents is the right way to implement persistence.
In my view we should be very careful using the word ‘document’. For me a serialisation of data is not a document; a ‘document’ should be understood as an authored entity that is standalone and may be signed (off) by its author. But the Compositions containing vital signs data generated by bed monitors are not ‘documents’ in this sense. Neither are most structured clinical recordings, other than where the author is consciously writing a report or summary of some kind.
I think your argument is really this: the advantages of using DBs based on storing serial formats rather than atomic data are being under-estimated and should be revisited. This may well be true. One thing to note though: ‘complex’ data tends to have a lot of cross-linking, and this is increasingly the case with openEHR data. So it has to be solved as part of any serial-persistence method.
Right, Thomas, it aren’t really documents, it is just a way of speaking, databases like Mongo or Couch are sometimes (not correctly) called document-based DB’s.
I wasn’t aware of cross-linking, are you referring to references like Locatable-Ref?
If so, they do indeed need extra data-access software-logic, but that hasn’t to do with the kind of DB you use.
I think that is good news, I remember the days when I was working many months, even more then a year, experimenting, writing complex code for an implementation, now I can do the same, maybe in one month.
This basic kernel would consist then functionality for validating, storing, retrieving and querying datasets, also transforming to FHIR can be much easier.
A even better idea would be, if I would have time, I could create a Spring-module so that implementing OpenEhr in a Java-project is only a matter of a few annotations connected to html or app-forms. Spring really scales, especially in cloud and with Kubernetes
But I have no time, maybe a good guy with money reads this and can buy time form me. Or else, I am happy with what I do now.
The Lego-box becomes reality;-)
Tony, I think that a language with a data-access feature is a disadvantage. We had them a lot in the nineties, programming-languages more or less connected to database-handling
Because I want software to be modulai, languages/ paradigms /databases need to be interchangable, no product-lockin. I don’t really see a role in this for M.
Best of luck with your further research Bert, regards, Tony
Thanks. Same for you
Not sure about the details regarding your intended use of Spring, but EHRbase uses Spring and in a client library we automatically transform OPTs to JPA compatible classes to allow easier handling of openEHR in clients. Would be interesting to learn if this is aligned with your thinking. See https://github.com/ehrbase/ehrbase_client_library.
That is very interesting, I also think about having basic OpenEhr functionality in Spring modules, so that OpenEhr becomes a kind of Hibernate, of course not the same, but similar, and only for clinical purposes.
Not having it all in mind, but I was thinking about annotated archetype representation to data representation in forms, messaging, etc.
Building an OpenEhr application would become a short traject in which most functionality is ready to use.
But at the moment I have no ready architecture and I will never have it until someone is ready to talk about financing. My career does not depend on OpenEhr. It is something I like to do, but I can do other things as well.
As always, there is no “best” ins abstract. The best depends on specific requirements, needs, preferences and criteria, it could be personal preference, it could be the best comparing many options and choosing the one that adapts better to the current project. Some projects require multiple approaches because of multiple competing requirements, etc, etc, etc. I’ve been on this area for a long time
First we need to classify requirements, and understand some of them are not compatible, e.g. one thing is a transactional DB, another an analytical DB, another a datawarehouse, another an epidemiology DB. Then we have the scale factor, is this for a small practice, for a hospital, for a network, for a country? Again, different needs, different solutions.
I developed my first openEHR DB for a generic EMR system back in 2009, then on 2013 released the first openEHR open source CDR, the EHRServer, and I tried to focus on some of the requirements, knowing I could not cover all of them with only one approach. Basically I focused on Clinical Decision Support and Shared EHR. Here is a summary: https://cloudehrserver.com/learn
The EHRServer might not be a good solution for people requiring AQL, since for AQL support you need a complete representation of the data trees at the database level and we don’t have that. On the other hand, AQL is a spec that has that hard requirement over the DBMS technology, if you don’t have the trees, you can’t implement AQL, period (mainly because of how the CONTAINS clause works).
I also mentioned many times, that AQL is not needed on every single project, and that Domain Specific Languages is a better solution for querying openEHR data: https://docs.google.com/document/d/1pGJXIWHCgjyiofLmNHRTG7UFuB-tVWzm625X5rkiHiw/edit?usp=sharing
So for some projects there is a need of relational DBs for transactional data, document DBs to support AQL and analytics, and for scaling, the key is on the architecture more than on the implementation technology: redundancy, synchronization and load balancing are for sure required at the repository/data management level.
It is key to understand the core requirements of each project to know 1. which DB technology or technologies are needed, 2. check if AQL is really required or if it adds any value, 3. don’t be afraid of experimenting, there is no “one fits all solution” and your project might require some custom tools to do the job.
So far I tested Ethercis/EHRBASE, EHRSCAPE and EHRServer/Atomik, and each has it’s own design approach, satisfies certain sets of requirements and focus on specific use cases and types of users. Not understanding the differences, leads to not selecting the right tool for the job. My best advice is to try what is out there, compare and choose based on your needs. There is no best.