# Best Database for openEHR **Category:** [Platform](https://discourse.openehr.org/c/platform-implem/7) **Created:** 2019-11-09 09:56 UTC **Views:** 8147 **Replies:** 86 **URL:** https://discourse.openehr.org/t/best-database-for-openehr/117 --- ## Post #1 by @Bert_Verhees Which database would you advise for OpenEhr, and is there documentation about more details for that choice? --- ## Post #2 by @ian.mcnicoll I think all we can say is that we know what works.. 1. O-R mapping like Hibernate but may not scale. 2. XML databases do not scale 3. Simple use of RDBMS with smart indexing and/or JSONB works well - virtually all of the CDRs in production use this approach,- ThinkEhr, DipsEhr, OceanEHR, Base24, Ethercis, EhrBase 4. Graph DBs or Mumps etc, should work but no real-world examples. The question is, why would anyone want to build another, rather than license a proprietary CDR or contribute to one of the OSS examples. --- ## Post #3 by @Bert_Verhees I agree with most of it, but you left one important possibility out which is proven technology and widely spread but not in OpenEhr land. I consider OpenEhr as not so unique DB technically seen. It is about storing documents. The validating of archetyped objects is unique, but after that stage, it is just low level technical database handling. That process of object/document oriented databases is widely tested and there is tooling to have it scalable. But that is technical, check the Apache sides. Google for sharding. You ask why anyone would choose for an own solution instead of contributing to an existing solution. Maybe one does not experience an existing solution as the best solution. Remember that is also how OpenEhr was born, not by contributing to an existing DB solution on legacy model which were and still are many. But by disagreeing, seeing opportunities for the better. A remarkable achievement by you, Thomas, Sam, Heather and others. But now the same is happening again, what others do is used as argument to not try something else for which may be good reasons which I like to discuss. The advantage of using a document structure is the simplicity of code that is needed. Not only in storing and retrieval but also in path based querying. Remember the saying of the worldwide known programmers guru: Venkat Subramaniam: Don't walk away from complexity........ Run!!! Simplicity means large costs reduction, less risks, better maintainability. I don't have the figures at hand, but a database like MongoDb is scalable and simple to use and is used in large projects. Typical for OpenEhr is that it never needs to change or delete existing data. This is something we see more nowadays, also in other fields. That is because storage has never been so cheap and fast. We can afford this kind of solutions. These software features bring document databases like MongoDb forward as a good potential candidate for openEHR storage. Thanks Ian, I appreciate your reply. Sorry for the many edits, but while writing I grow to a better wording of what I think. And now I would also like a more technical reply Best regards Bert Verhees --- ## Post #4 by @yampeku [quote="ian.mcnicoll, post:2, topic:117"] XML databases do not scale [/quote] Is this from personal experience or from the paper published some years ago? Because I'm pretty sure it didn't define indexes etc. I think it would need further evaluation. In any case I think from the available implementations RDB+JSONB is the way to go. Neo4j works surprisingly well and probably is really useful for research purposes (being able to query the data easily from different parts of the graph) --- ## Post #5 by @ian.mcnicoll Yes I should have made that clear - XML databases 'out of the box' do not scale. My (as a General Hactitioner -thx @marcusbaw) comments were made from knowing what has been tried and publicly worked (and from a little insider information). Other options such as Mumps, Neo4J and MongoDb, have been tried but for whatever reason, these have not really come into the public, other than as research reports. That's not to say that they could not be made to work but my takeway, as a clinical hacker, is that nothing will just work well out of the box. openEHR data has a document commit paradigm but a document-neutral retreival paradigm. It also demands full access to all of the data tree, and a no-engineering data model upgrade approach. All of these tend to confound some of the potential advantages of e.g mumps or MongoDb. What seems like a good and natural fit, may not be so, in practice. Having said that, I would love to see openEHR CDRs on these non-RDBMS options be tried but I suspect it might be harder, and less advantageous, than first appears. I'm certainly not wanting to discourage anyone from trying but I would caution anyone new to this space to strongly consider making use of one of the existing CDRs, at least initially, as you can burn a huge amount of time and resource trying to build a CDR. Get your project up and running and get your team familiar with using a CDR before embarking on building one. --- ## Post #6 by @Bert_Verhees Is true, I have great respect for Eric Sundvall, he often says very wise and clever things, but the paper he wrote in which he concluded things about XML databases had many short comings. He said nothing about indexes, he said nothing about the model of the MySql database which was the opponent, the query he did was a data mine query, not a normal practice query. So we cannot take that too seriously. I regard that as a mistake, we all make every now and then. I am quite sure that every database, Mongo, XML, relational, hierarchical, etc deliver almost instantly if you query for an indexed field or primary key. And for atypical queries it is always possible to optimize circumstances in which the query has to run. That is also very common to do in data mining on big data. There is a lot information about performance tests in large databases, many people want to know. I am looking for arguments against document databases, because pros are too many to ignore. --- ## Post #7 by @tonyshannon Hi Bert Have you seen this paper in the past? http://www.mgateway.com/docs/universalNoSQL.pdf I may have shared it with you before.. I find it helps understand the world of databases and why some old & unpopular tech has done so well in healthcare over the decades.. (PS don't want to get into a debate about the pros/cons of the related language, just to make the point that the related database engine is a v interesting one) regards Tony --- ## Post #8 by @Bert_Verhees Thanks Tony, I must have missed it I have never seen it before. A quick look makes an interesting impression. I have to read it later. Thanks Bert Verhees --- ## Post #9 by @birger.haarbrandt Hi Bert, just a quick feedback with my personal experience: In the past, we tried two approaches/technologies. One was Intersystems Caché. In principle, it looks like a good fit, providing good handling for hierarchies and performance at the same time. The biggest issue were the tools to handle java classes. The name of the framework is/was Jalapeño. However, there were lots of issues (e.g. handling of generics) which demanded some workarounds. Inserts took forever and we finally abandoned the approach. For data warehousing, we used Microsoft SQL Server 2012. This was merely for prototyping purposes. In general, this worked somewhat well because you can provide XML schemas and do some indexing. We were able to shred compositions and put them in different entry tables (automatically generated) that allowed for some optimization. When the whole composition was needed, I was able to put them back together using stored procedures. T-SQL provides also very nice capabilities to combine Xpath and SQL. So generally, I think this is a good candidate for openEHR. Cheers, Birger --- ## Post #10 by @Bert_Verhees Tony, I have readThe PDF, 20years ago I have worked with Mumps, so it was easy to understand for me and the most of it I already knew. The idea of storing JSON structures in such a database never came to my mind but it is well explained and easy to do, but I don't see much advantage. A disadvantage is that extra code-layers are needed, even extended to query engines which is not what I am looking for, on the contrary, I think there is advantage in simplicity. (I wrote an SQL engine to access a medical MUMPS database in 2004 for a product I created called HISlink) It is a solution for the hard-core Mumps fans to give new live to it. So it did not answer my question why not to use MongoDb or a similar database. But thanks anyway. Best regards --- ## Post #11 by @Bert_Verhees Thanks Birger for your story. I have heard InterSystems Caché is a fast performing database, but you have other experiences. My objection is that the price is too high and for that reason there is not a big and open community. Such a community is for me a requirement in a choice for a database. MSSql is a good database, like Oracle is or MYSQL. But these are relational databases and that is not the kind of database I want to discuss because using a relational database for document based storage requires a complex software layer, and that layer becomes even more complex when a Path based query engine is needed. So that is not what I want to discuss, I started this discussion to find good reasons for not to use MongoDb or similar, and I have not seen an argument against it. But thanks anyway for your contribution Best regards --- ## Post #12 by @tonyshannon Hi Bert, Just to followup your point, you may be interested in an OS implementation of those ideas in that DB engine paper. Aka QewdJS.. it swaps the M language out & swaps Javascript in while leveraging the same DB/Globals engine. See this recent & related presentation... as followup to that earlier paper "Background to Qewd: Data Persistence as a Language Feature" http://ec2.mgateway.com/qewd-background.pdf I think you'll find it of interest, Tony --- ## Post #13 by @thomas.beale Hi Bert, just on the notion of 'documents'... openEHR doesn't specify anything to do with documents as such. The current specifications are all about data. If an implementer chooses to serialise openEHR data first into XML, JSON or whatever else 'documents' and then store it, rather than storing atomised (or mostly atomised) data, then that's fine, it's an implementation choice. It may well be just as fast as other kinds of storage, I don't have a personal opinion right now, because I don't think we have enough current scientific studies showing the performance of various persistence technologies for complex data. But there is no natural conclusion from the openEHR specifications that storing documents is the right way to implement persistence. In my view we should be very careful using the word 'document'. For me a serialisation of data is not a document; a 'document' should be understood as an authored entity that is standalone and may be signed (off) by its author. But the Compositions containing vital signs data generated by bed monitors are not 'documents' in this sense. Neither are most structured clinical recordings, other than where the author is consciously writing a report or summary of some kind. I think your argument is really this: the advantages of using DBs based on storing serial formats rather than atomic data are being under-estimated and should be revisited. This may well be true. One thing to note though: 'complex' data tends to have a lot of cross-linking, and this is increasingly the case with openEHR data. So it has to be solved as part of any serial-persistence method. --- ## Post #14 by @Bert_Verhees Right, Thomas, it aren't really documents, it is just a way of speaking, databases like Mongo or Couch are sometimes (not correctly) called document-based DB's. I wasn't aware of cross-linking, are you referring to references like Locatable-Ref? If so, they do indeed need extra data-access software-logic, but that hasn't to do with the kind of DB you use. My argument really is, is that an OpenEhr implementation can be really simple, because, when using the right tools, the kernel can be generated for large parts, and it could even run in Javascript/typescript-frameworks like node.js, although I would not advise that, but also not advise against it, for a small scale application (max 100 concurrent users) this would be fine. I think that is good news, I remember the days when I was working many months, even more then a year, experimenting, writing complex code for an implementation, now I can do the same, maybe in one month. This basic kernel would consist then functionality for validating, storing, retrieving and querying datasets, also transforming to FHIR can be much easier. A even better idea would be, if I would have time, I could create a Spring-module so that implementing OpenEhr in a Java-project is only a matter of a few annotations connected to html or app-forms. Spring really scales, especially in cloud and with Kubernetes But I have no time, maybe a good guy with money reads this and can buy time form me. Or else, I am happy with what I do now. The Lego-box becomes reality;-) --- ## Post #15 by @Bert_Verhees Tony, I think that a language with a data-access feature is a disadvantage. We had them a lot in the nineties, programming-languages more or less connected to database-handling Because I want software to be modulai, languages/ paradigms /databases need to be interchangable, no product-lockin. I don't really see a role in this for M. --- ## Post #16 by @tonyshannon Best of luck with your further research Bert, regards, Tony --- ## Post #17 by @Bert_Verhees Thanks. Same for you --- ## Post #18 by @birger.haarbrandt Not sure about the details regarding your intended use of Spring, but EHRbase uses Spring and in a client library we automatically transform OPTs to JPA compatible classes to allow easier handling of openEHR in clients. Would be interesting to learn if this is aligned with your thinking. See https://github.com/ehrbase/ehrbase_client_library. --- ## Post #19 by @Bert_Verhees That is very interesting, I also think about having basic OpenEhr functionality in Spring modules, so that OpenEhr becomes a kind of Hibernate, of course not the same, but similar, and only for clinical purposes. Not having it all in mind, but I was thinking about annotated archetype representation to data representation in forms, messaging, etc. Building an OpenEhr application would become a short traject in which most functionality is ready to use. But at the moment I have no ready architecture and I will never have it until someone is ready to talk about financing. My career does not depend on OpenEhr. It is something I like to do, but I can do other things as well. --- ## Post #20 by @pablo As always, there is no "best" ins abstract. The best depends on specific requirements, needs, preferences and criteria, it could be personal preference, it could be the best comparing many options and choosing the one that adapts better to the current project. Some projects require multiple approaches because of multiple competing requirements, etc, etc, etc. I've been on this area for a long time :slight_smile: First we need to classify requirements, and understand some of them are not compatible, e.g. one thing is a transactional DB, another an analytical DB, another a datawarehouse, another an epidemiology DB. Then we have the scale factor, is this for a small practice, for a hospital, for a network, for a country? Again, different needs, different solutions. I developed my first openEHR DB for a generic EMR system back in 2009, then on 2013 released the first openEHR open source CDR, the EHRServer, and I tried to focus on some of the requirements, knowing I could not cover all of them with only one approach. Basically I focused on Clinical Decision Support and Shared EHR. Here is a summary: https://cloudehrserver.com/learn The EHRServer might not be a good solution for people requiring AQL, since for AQL support you need a complete representation of the data trees at the database level and we don't have that. On the other hand, AQL is a spec that has that hard requirement over the DBMS technology, if you don't have the trees, you can't implement AQL, period (mainly because of how the CONTAINS clause works). I also mentioned many times, that AQL is not needed on every single project, and that Domain Specific Languages is a better solution for querying openEHR data: https://docs.google.com/document/d/1pGJXIWHCgjyiofLmNHRTG7UFuB-tVWzm625X5rkiHiw/edit?usp=sharing So for some projects there is a need of relational DBs for transactional data, document DBs to support AQL and analytics, and for scaling, the key is on the architecture more than on the implementation technology: redundancy, synchronization and load balancing are for sure required at the repository/data management level. It is key to understand the core requirements of each project to know 1. which DB technology or technologies are needed, 2. check if AQL is really required or if it adds any value, 3. don't be afraid of experimenting, there is no "one fits all solution" and your project might require some custom tools to do the job. So far I tested Ethercis/EHRBASE, EHRSCAPE and EHRServer/Atomik, and each has it's own design approach, satisfies certain sets of requirements and focus on specific use cases and types of users. Not understanding the differences, leads to not selecting the right tool for the job. My best advice is to try what is out there, compare and choose based on your needs. There is no best. --- ## Post #21 by @Bert_Verhees I will take a look at the project and see how it goes. Using a JPA interface makes it modulair which is good. I wonder if you are able to separate the DB layer from the dats access code in a way that it is possible to use different kind of DB's. This has as consequence that table-structures are invisible at data access level. I think it is necessary for creating a modular OpenEhr engine. --- ## Post #22 by @Bert_Verhees It is true Pablo, there is no best way, there are advantages/disadvantages to an architecture. And also legacy is important, this counts extra if modularity is lost, which happens often when a product exists longer. I agree with the lessons you learn us, and I want to add one. All software guru's learn us to keep the architecture modular, even if it costs extra money or extra time or extra refactoring. Also extensive tests are important to make refactoring possible. --- ## Post #23 by @birger.haarbrandt In theory, this might work. In (our) practice, EtherCIS and EHRbase are using lots of specific functions from Postgres. I think that especially AQL is the hard part here that requires some tweaking. So migrating to a different database technology would actually be very expensive right now. --- ## Post #24 by @Bert_Verhees That is often that way, that keeping it strictly modular costs more then letting the different layers becoming visible in each other. But in the longer term, strictly modular code always pays it back and strictly modular code is also better testable. However, who am I to lay that upon you. You are building a great project. Congratulations with that and good luck. --- ## Post #25 by @pablo I don't think dependency on one specific DBMS is an issue, since most openEHR vendors actually use one specific DBMS, I guess is because all the tweaks needed to actually support AQL. In the EHRServer we don't support AQL, we have a Simplified Domain Specific Language for querying, that works similarly to AQL but doesn't need customization at the database level. Also because we use Grails Framework, that uses Hibernate, the DMBS could be changed from configuration, and any relational DB is supported. This was also made by design, knowing different clients might prefer one DBMS over others, and they might have licenses for a specific DMBS. On the other hand, if we think of microservices, this is another microservice in the architecture and is like a black box working independently form the rest of the components, so the specific DBMS might not matter, it is just that for some it does. For instance, open source projects might choose MySQL/MariaDB over PostgreSQL or viceversa. --- ## Post #26 by @Bert_Verhees Hi Pablo, very intesting paper you have written. I did not click on it the first time that I saw it. https://docs.google.com/document/d/1pGJXIWHCgjyiofLmNHRTG7UFuB-tVWzm625X5rkiHiw/edit?usp=sharing I think like you that AQL is not necessarily needed to build a good CDR. Also because it is not a standard and not portable. Path-based query languages like XQuery are better portable and maybe a better choice. Google for fun on XQuery and mongodb. Writing an AQL engine is of course possible but what is the point, what is it worth if there are high quality portable alternatives. One can also consider in the context of being interoperable one should avoid using proprietary languages or languages with small user-base. --- ## Post #27 by @thomas.beale Well AQL makes for portable queries, which are very interesting, especially to the CDS industry. No EMR systems have this at the moment. Secondly, there is no reason why an openEHR system shouldn't support more than one query language. There is no reason in principle not to add a specification corresponding to Pablo's query language for example - we've just never tried to do it. --- ## Post #28 by @Bert_Verhees In my opinion it would have been better when OpenEhr had chosen XQuery as preferred query language. It is an acknowledged w3 standard and guess capable of the same queries and like many systems using XQuery, OpenEhr has a path-based storage system. But time will learn which query language will be used most. It is encouraging that more people consider other solutions. --- ## Post #29 by @pablo My paper states that some AQL capabilities actually are against portability, like using name based predicates make the queries impossible to use on multi-language implementations. Also some vendors have their own extensions, which generate portability issues, like we have on MySQL vs. PostgreSQL queries, some are slightly different so are not 100% portable. --- ## Post #30 by @pablo IMHO for some types of projects, there might be a specific way of querying that might be easier / simpler to implement than AQL and that actually satisfies the projects requirements very well. I'm not against AQL at all, just saying it is not the only solution, and it is not the best solution for all requirements, and sometimes it is not possible to implement. So I thought of the idea of having many domain specific languages for querying, designed to satisfy specific requirements, like support clinical decisions, help on analytics or population health. Another nice touch in the openEHR REST API is that for queries, there is actually a "formalism" parameter that allows to send other things than AQL to the query services. In the near future that will be implemented in the EHRServer, using this simplified querying model. --- ## Post #31 by @thomas.beale Well if literal strings are used in names, that's true, but that's going to happen in queries in any language, it's not specific to AQL. It's really about what's in the data. Vendor differences are hopefully a temporary thing. --- ## Post #32 by @thomas.beale I would have been quite happy with XQuery, if it worked. But it doesn't, directly. XQuery works fine expressing paths and predicates over serial XML documents. It doesn't know about archetypes or two-layer modelling. AQL is about as close to XQuery as you can get for the more sophisticated data we have. --- ## Post #33 by @Bert_Verhees True, working with XQuery requires a technical approach instead of a more conceptual, but on the other hand, shouldn't querying always be something for technicians instead of clinicians. The target user group that is focused with AQL is very small. Limited on: One side by people who cannot understand that archetype identifier is nothing more then a technical property of a datastructure indicating the structure and RM class to expect. On the other side by people who are not able to understand that blood pressure has a measurement-unit and a location in a datastructure, people who are not aware of datastructures in general. In fact, the focussed group consists of non-technical clinical information analysts. Good query editors therefor hide for the focussed groups the dirty technical details and generate queries in any language that is required. So AQL should not be a database oriented query language but a language that is GUI-oriented and then underwater creates the needed technical query. So being an intermediate language used in the GUI. It should not be implemented in a data access layer, but be abstracted independent of database-brand or - type --- ## Post #34 by @thomas.beale [quote="Bert_Verhees, post:33, topic:117"] shouldn’t querying always be something for technicians instead of clinicians. [/quote] AQL is exactly not a 'database-oriented query language' - it's a model-based query language. For that reason, it is eminently suitable for clinician use. I am assuming the use of tools that hide ugly technical things like path codes etc - as Better(Marand) have done in their tools, and probably others by now. Having a query language at the domain-model level, in a model-driven editor with intellisense makes building queries pretty easy for anyone. Our tooling is what is lagging behind. --- ## Post #36 by @pablo If you consider just the syntactical part, I agree, but I think the underlying problem is on how archetypes are modeled. There are two cases, 1. a free text search where the text value used to compare is partial and is probable that comes from user input, which I think will be common but we are not doing that much in AQL, and, 2. when the text is hardcoded into the AQL expression as a filter for specific data, which I've seen a lot in AQL, but this case uses the text as it is a code, which I don't think is too reliable and affects portability. The issue with 2 is that could be avoided if the archetype model the text as a coded text, which is how the text-based criteria is currently used. Another point is translations could change in archetypes, those are fixed and improved more frequently than the internal structure, so if just one character is changed in the archetype, all queries that used that string to filter will be broken. IMO using hardcoded text in AQL is bad design of queries and bad design of archetypes, which could be fixed by using codes. Note this is not a problem of AQL, is just that AQL is affected by how archetypes are design, but we end up with queries we can't move between *any* openEHR system compliant with AQL. --- ## Post #37 by @pablo I agree, in fact I recommend my students and customers to delegate query creation to clinicians. We just need to provide tools they can use. --- ## Post #38 by @thomas.beale Well the choice of codes rather than text is always a runtime application choice, since any DV_TEXT can always be instantiated as a DV_CODED_TEXT. Maybe certain templates should force DV_TEXT to DV_CODED_TEXT more often, but that's a local choice. Generally though, it would be better if free text search keys were substituted via parameters rather than hard-wired in. I'm not sure any of this is a problem of AQL as such (let's assume we agree that we need better general text / terminology operators, semantics etc). --- ## Post #39 by @ian.mcnicoll As in everything, this is largely about the art of the possible and having to fit your design decisions into existing tooling, client budget and the exact use-case. I think what you are referring to, Pablo, is the very common practice of using templated name/values i.e renaming an existing archetype node. We generally do this for two purposes. 1. To localise the node name so that it fits better with local requirements. Of course the disadvantage here is that this is not (in adl.1.4) language independent. This is sort of worked around in the Better Archetype Designer which uses ADL2 under the hood but once we get into proper ADL2.0 we will get language independence. It would actually be possible to add language independent terms to name/value and I have done this on some occasions but in general, this is overkill for the particular client requirement, as they will need to manage that terminology. However this is not really an issue in querying as the local name should not differ semantically from the parent. 2. To distinguish cloned nodes whether whole archetypes, clusters or individual elements. This is more problematic, since the cloned name is often subtly semantically different from the parent (though normally be a sub-type) and we are definitely reliant on that name/value override to distingush nodes with identical atcodes but slightly different semantics. Again we could use coded_text here, though the tools do not make that easy, and ADL 2.0 should further mitigate but there may also be other situations where we need to distinguish cloned nodes at runtime - vis-a-viz the other conversation. I guess, so far most of the time we have not had to worry about universla portability since it depends on a huge number of things - common external terminologies, common archetypes, common templating practice. This is just another design trade-off that will be easier to mange over time. --- ## Post #40 by @pablo Yes @ian.mcnicoll I'm referring to that practice of using name nodes to define what data is inside, and keep those as DV_TEXT. The issue would be solved by using DV_CODED_TEXT, since the problem is really using a free text as some kind of code to define which information is recorded. IMO there should be a recommendation somewhere to specialize archetypes that have DV_TEXT and are used in that way, and define DV_CODED_TEXT in the specialization. The problem now is: the archetype that should appear on the data and queries should be the specialized one, not the one (maybe) downloaded from the CKM. I would say that AQL is as portable between different implementations as SQL is portable between different implementations. In SQL portability ends when you use some custom implementation feature or when the query is written in a way that is not compatible with other SQL implementations. --- ## Post #41 by @Seref [quote="pablo, post:40, topic:117"] In SQL portability ends when you use some custom implementation feature or when the query is written in a way that is not compatible with other SQL implementations [/quote] which is the case for pretty much every computing standard out there. SQL is a hugely successful query language in terms of adoption. What you seem to point out as a weakness is more of a misuse and pales "if" you consider the fact that SQL achieved and delivered more than any other query language . Don't want to comment further in this particular thread which is way too subjective and flamewar-prone , but humble suggestion from one implementer to another: stick to AQL, it'll go to places. Implement alternatives if you want to, but you'll find those hard to sell compared to AQL. Whatever shortcomings you can come up with will mostly be handled within the standard, just like SQL did with window functions etc in time. --- ## Post #42 by @pablo My point was: portability shouldn't be used as an adoption argument. I'm the bigger critic of AQL,and also the bigger proposer of fixes and improvements to its spec. At the same time raising awareness to weaknesses and problems, and talking openly about them, generates the environment to fix the issues in an open way. As well as talking and considering the idea of alternative ways of querying in openEHR, that generates a rich discussion from which us, SEC members, could learn from. This all is very far away from any flame wars, and more close to a friendly convensation between peers. In any case, flame wars would only need fired by business interests, which I have none, not by the need of improving the spec, which is my personal goal. --- ## Post #43 by @Seref Hi Pablo. Ok, I misunderstood your position, it sounded more negative as an implementer than what it apparently is. My bad. Always happy to have these discussions, to be clear. I was referring to the title of the topic, "best..." , which is not what the discussion is about anymore, when I said "prone to " flame wars. I wasn't referring to my exchange with you. We're good here :) --- ## Post #44 by @Bert_Verhees In the context of the title of this discussion, maybe Apache Cassandra is a option http://cassandra.apache.org/ https://dzone.com/asset/download/346431 --- ## Post #45 by @thomas.beale Potentially interesting, going by the documentation. It does seem however to be a relational table system, going by the [query language pages](http://cassandra.apache.org/doc/latest/cql/index.html), and also the [storage architecture](http://cassandra.apache.org/doc/latest/architecture/storage_engine.html). However, it does have [collections](http://cassandra.apache.org/doc/latest/cql/types.html#collections), for small amounts of 0..N data attached to a primary data object. For openEHR, the main question would be schema design to best optimise the OO data view with the mainly table-based concrete representation, which doesn't seem that much different from using Postgres or Oracle etc. So it seems the main unique point is scalability and fault-tolerance. --- ## Post #46 by @Bert_Verhees Thank you for your comment. In second thought I find Cassandra related to OpenEhr context, disappointing. Except from scalability and fault tolerance it does not offer what we need. I was thinking 5about locking, most databases put a lot of effort in smooth and granular locking. But how good it is done, there is always a price to pay. But in OpenEhr we don't change data. Never, although there are exceptions which would need to be stored separately. But for 99.9% data-actions we only read and add. We should choose a database engine which is specially designed for this. And there is another thing. There are only very few indexes needed in normal practice. An OpenEhr database should never be used for research or datamining. For that purpose there is the possibility to run separate database in another thread or module, perhaps connected to the main stream via messaging. There is the idea of CQRS, it is worked out in Axon, but I do not talk about Axon but about that idea. It stands for query/storage seperated responsibility. You need a database for storage, which needs to be very fast, only a few indexes, and a database for querying. The second one will receive data over messaging so the main thread will never have to wait for the second to update indexes. The second can have message-queues and it may take a few seconds before fresh stored data are available for querying In this way we keep the OpenEhr database lean and fast and only doing its core business. So we need to find a database which stores data in an object oriented way, just dump the data in it. And it needs to be scalable and secure in storage. And we need a database suited for querying which is connected via messaging. --- ## Post #47 by @thomas.beale [quote="Bert_Verhees, post:46, topic:117"] But in OpenEhr we don’t change data. Never, although there are exceptions which would need to be stored separately. But for 99.9% data-actions we only read and add. We should choose a database engine which is specially designed for this. [/quote] I agree with this, and for this reason, DBs that write once (and later on, compress data on disk) are interesting. I also agree with your other comments. --- ## Post #48 by @pieterbos Cassandra has adopted a query language and a table schema that might look a bit like a relational database on top of their storage engine, but I would say it is not one. It's one of the databases built on the ideas of the google Bigtable paper, and as such it is a sort of distributed eventually consistent key-value storage, where the values themselves are structured somewhat like a table. Each of these column families (relational equivalent is a table) is a key-value store, where the values each can have a different set of columns. This means there is no strictly enforced table schema. What you can and cannot query is very limited by the way you model your column families and the keys to the column families. This means you need to store data multiple times for efficient querying, or create indexes. Creating indexes has some potential scalability issues, so should be done with some caution. There is no locking, no ACID transactions. Instead there is an eventual consistency, the most recent write eventually wins, and the possibility to wait until data is written to a specified number of copies/servers. This comes with some new design challenges for any applications. I have used Cassandra before outside of OpenEHR (we contributed a bit to it as well) and for a few use cases it solved quite a lot of problems with very little work. Mainly when you need to store many terabytes of data and need to have it available for predefined queries all the time. Its main advantage is scalability and availability beyond that of relational databases, but this comes at a cost. You could build an OpenEHR implementation on it, especially since you can usually split the data into the different EHRs, one for each patient, that you can use as part of the key for the column families. However, this also means you can use traditional databases with sharding relatively easily, which means a relational database can scale further than one might think at first sight. I would only choose to use Cassandra if traditional alternatives do not provide what I need in terms of scalability. --- ## Post #49 by @Bert_Verhees Thanks Pieter, for this useful and excellent explanation. Best regards Bert Verhees --- ## Post #50 by @erik.sundvall (Readers more interested in NoSQL openEHR storage solutions than "reading issues" can skip all the way to the bullet list of three papers near the end of this post.) Hi Bert! Did you actually read through our XML-related papers or just look at the graphs? What would you say the papers _conclude_ that you actually do not agree with? If you claim the publication to be a mistake rather than your possibly mistaken way of reading it, then you need to first read and then motivate your claims. Or just re-read and be happy and satisfied to find that they actually do not bash XML-databases in general. Our first XML paper - the one you are likely referring to: * Performance of XML Databases for Epidemiological Queries in Archetype-Based EHRs http://www.ep.liu.se/ecp/070/009/ecp1270009.pdf * A reader that does _not_ note that it is ad-hoc epidemiological population-wide querying we primarily explore will likely miss most points of the entire paper. (We follow through in later papers showing solutions for that use case, solutions that simultaneously cater for "normal" clinical one-patient-at-a-time EHR usage). * From abstract: "For individual focused clinical queries where patient ID was specified the response times were acceptable. This study suggests that the tested XML database configurations without further optimizations are not suitable as persistence mechanisms for openEHR-based systems in production if population-wide ad hoc querying is needed." (Note the word "if" in the last sentence.) * The paper does _not_ primarily attempt to compare openEHR XML DBs to the non-openEHR RDBMS in general. The paper primarily explores size and speed comparing different XML approaches. The RDBMS with real patient data is primarily a data source to generate large amounts of realistic/real test data. The measurements from the simpler RDBMS (also without query specialized indexing) is just a baseline. Quote from the paper "More information is also added to the openEHR data such as context, auditing, archetype ids an so on, which was not present in the anonymized SISCOLO [RDBMS] database. The size of the three sets of XML documents are respectively 556 MBytes, 2.8 GBytes and 23 GBytes. Therefore it is not a surprise that the sizes of the XML databases are much larger than the corresponding SQL database. However it is interesting to notice that the XML database systems differ greatly in the size of the generated databases with BaseX being the most space saving of all..." That we said "nothing" about indexes is plain wrong. The paper says: * "No indexes besides those that are already built-in in the XML databases were created, because we were most interested in ad hoc queries for which it is not known in advance which indexes should be used, and which is a very common use case in health care research. Thus, in the XML databases, no assumptions were made about the kinds of query that would be made." * "The way the openEHR archetypes are designed and the nature of data values that are stored in the database make the automatically generated indexes in the databases inefficient. The archetypes usually have many attributes with the same value, for instance almost all archetypes have an archetype node id equal to “at0001” and the database used in this study has mainly coded values with few options to choose from. This makes xml text and attribute indexes point to a huge number of entries in the database, leading to long inspection of documents in order to return the results. How to best handle querying of the relatively deep openEHR tree structures, often with repeated path segment identifiers, is an interesting topic for future research. " A proper reading would come to the useful conclusion that BaseX was the most interesting of tested databases and that it works fine for many one-patient-at-a-time use cases, but slow for the tricky use cases of ad-hoc epidemiology queries. Later XML and NoSQL-related papers: * Comparing the Performance of NoSQL Approaches for Managing Archetype-Based Electronic Health Record Data https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0150069 (Here a NoSQL approach, Couchbase, including full openEHR datamodel content beats an RDBMS in many ad-hoc-population-query cases) * ORBDA: An *open* EHR benchmark dataset for performance assessment of electronic health record servers https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0190028 (Here you get a huge dataset in openEHR-format to play with, and as a side finding a hint that ElasticSearch is Worth exploring further.) * Querying Archetype-Based Electronic Health Records Using Hadoop and Dewey Encoding of openEHR Models http://ebooks.iospress.nl/publication/46372 (How to optimize archetype aware indexing in systems based on relational algebra, for exampe RDBMs and Hadoop) To avoid mistakes, please _read_ the entire mentioned papers _BEFORE_ possibly commenting or concluding things about them. --- ## Post #51 by @Bert_Verhees "Did you actually read through our XML-related papers or just look at the graphs?" Really, is that the tone you want to have this discussion? Is me criticizing a report from you still eating you after all that time? Are you having this tone to challenge me? Okay, I give you what you wish. On private email Have a nice day --- ## Post #52 by @erik.sundvall Hi again Bert! I understand that for example the “…or did you just look at the graphs” and similar thinigs can be provoking, sorry about that and whatever else in my response that annoyed you. I apologize. Most of my questions were serious though. You wrote “he concluded things about XML databases had many short comings” and yes that actually triggered me since the paper, even though hampered by the short number of pages allowed, still is fairly careful and specific about what limited things can be concluded from the research. I think you claimed the paper concludes things it does not, that’s why I asked you to re-read it and to specify if something was wrong in the conclusions. This is how the “conclusion” section and the whole paper ends: “For individual focused clinical queries where patient ID was specified the response times were acceptable. This study suggests that the tested XML database configurations without further optimizations are not suitable as persistence mechanism for openEHR-based systems in production if population-wide ad hoc querying is needed.” If you don’t want to spend more energy on this, it’s fine. Now you at least have my apology and an explanation of a central thing I found unfair in your critique and got triggered by. Happy new year! --- ## Post #53 by @Bert_Verhees Same to you and all other readers --- ## Post #54 by @Bert_Verhees Hi Eric, I responded to your paper many years ago, in the context from that discussion and era my views still stand. But it is an old discussion with no relevance to current technical development in databases. Searching for a solution in SQL only or XML only is an unnecesessary restrictive way to think about a good architecture. So comparing both solutions is not very useful. Nowadays I think a NoSQL database for clinical use and maybe optional supported by a CQRS structure, for datamining is the best way to go. And I prefer to run it on cloud f.e. AWS, because there is native support for this architecture and also many ways to implement security, encryption in rest and on flight, and fine grained authorization structures. Best regard Bert Verhees --- ## Post #55 by @Colin_Sutton Hi Bert, I like your suggestion of using two databases, one for clinical use and one for querying. It should be possible to keep them in sync, writing to both. The initialisation of one from the other after a failure would be interesting. -- Colin --- ## Post #56 by @birger.haarbrandt Hi Colin, I think it depends on the use-case. So far, storing and querying data works quite well for us using postgres only. If we would like to introduce high-performance cross-patient queries, I think we would have to separate the data anyway (because we don't want to lock the database in production) and then we would have to decide if for analytics we still need AQL (which I think makes sense in cross-enterprise scenarios with distributed data but might not be the optimal solution for local needs). Cheers, Birger --- ## Post #57 by @Bert_Verhees Hi Colin, it happens in places where data for themselves are hard to query. Imagine an environment where it is very important to store data in exact the same way or order as they arrive, for legal reasons, for example. Healthcare data. But another time we need to have databases for population-wide researches, which do have completely different technical requirements and maybe they do not need the semantics which are stored in the archetypes. Do the separate databases need to run in sync? That depends on the use case. Mostly not in the same second, and mostly not even in the same day. The technical solution how to sync databases is to propagate data by messaging to other databases. --- ## Post #58 by @Prabhat_MOHANTY I sincerely think NODEJS could do better than what you are predicting. I am betting on NODEJS for the high traffic transactions and on Python ( at the moment ) for the CPU intensive work e.g medical image/Dicom processing on the platform I am working on. Kubernetes is good for NODEJS and Python as well. Yes, I am relying on the Microservices as a broad strategy. Note : Mine is a smallish data science driven PHR application aimed at the individuals, not for the Organisations /Providers. --- ## Post #59 by @Bert_Verhees I agree, regarding performance it doesn't matter anymore which programming ecosystem you use. If it does not perform well just add or rent some extra hardware. The choice for a programming ecosystem is a cultural one and which tooling is available and which libraries are. Some people just like Java script, some hate it. I hate it, so it will not be my choice. But I like typescript more which gets transpiled to Java script. But in the end I would write in Java and run it in a serverless cloud architecture like AWS Lamba. --- ## Post #60 by @Prabhat_MOHANTY AWS Lambda supports most of the programming languages including Python and Javascript. And I deliberately stay away from Java (though I am willing and learning Rust for device programming)! :) --- ## Post #61 by @Bert_Verhees Completely true, that is the good thing about it. There is nothing wrong with Java script or Python, they are just not my favourites. --- ## Post #62 by @Prabhat_MOHANTY In my days, back in 1993-94, people copied Javascript animations for their websites and I hated that and I started ignoring Javascripts. Today, I mostly use Python for the quick/over-night POC design and my data science work - NLP and Computer vision. I realised Javascipt V-8 engine has become very fast and powerful in the years I stayed away from IT. Now, I have started using Javascript for the transaction-heavy works on NODEJS and the interface design. Take a fresh look at Javascript. The web is here to stay alongwith Javascript. In fact I am using Javascript outside the web, on the devices also (I am still very raw in Rust). --- ## Post #63 by @Bert_Verhees Good advice, thank you --- ## Post #64 by @woodpk @pablo , I struggle with this. I am new to micro-services and am still learning, but my current concept of EHR doesn't seem to fit with the general micro-service model. From my understanding each micro-service (bounded by business context that is determined by behavior/function) owns its own data. The openEHR model of having 1 massive data repository would seem to violate this ? How can one conform to the core tenet of micro-service architecture (services that have independent behavior and control their own data) and also still maintain the outward appearance of a consolidated data store built with openEHR archetype components ? What are your thoughts on this ? Thanks for your thoughts, @woodpk --- ## Post #65 by @pablo Hey @woodpk, I've posted an article related to this topic a while ago https://www.cabolabs.com/blog/article/microservice_architectures_and_open_platforms_for_health_care_information_systems-5cab9ab60d88a.html About your question, one thing is to focus on one business area per micro-service, and another is how "massive" are the data repositories. You could have a lot of data related with one business area, for instance the clinical data repository is one area and is massive. But, in general, you will have many systems in a hospital architecture, each could be focused on one business area, all can exchange information they manage, and all can have their own repositories. What you won't have is one database with all the data from all the components if you are following the micro-service approach. On a lower level, you might want to break the clinical repository into different repositories, for example, because it could optimize usage (like using shards divided by range of time), or storing different information "types" on different repos. At the end of the day, you will find most clinical info are "documents", and as such, the clinical repository will be a massive document storage, but still focus on one business area. Hope that helps :) --- ## Post #66 by @woodpk @pablo Thanks a lot for such a fast and thought out reply. I can see how different sub systems could easily have their own openEHR data backend. I have been looking into the Task Planning aspect of openEHR. I see these task planning systems as Sagas. Udi Dahan talks about (I might be misinterpreting this) Sagas being a better definition for a bounded context than what most people might use (eg Client/Patient Manager or another service defined in terms of a data model). It just seems like the idea of complex Sagas, with the data they -own- would naturally “break up” or overlap the data sovereignty of the mass openEHR CDR document store. Do you think this issue could be solved almost by having the back end store for each “service” be of the openEHR CDR type? Would this even be the case for billing and other such services? I know these are too many questions to discuss adequately in a forum post, but I appreciate your attention and thoughts. --- ## Post #67 by @woodpk Out of curiosity, has anyone ever attempted or thought about an event sourcing / event store strategy for OpenEHR persistence, @ian.mcnicoll ? --- ## Post #68 by @Bert_Verhees I worked with event sourcing in juridical context. The concept would also do in clinical context because not only the state is important but also how we get there. So it could make the data-version system of OpenEhr superfluous. That would be a relief. But on the other hand, it should not occur very often that versions are stored. A lot of queries are about state, and what I saw in an event store storage was shadow databases that stored the current states when applicable. I wonder if an event store adds something to OpenEhr. I can imagine one strong point. Event stores can handle long lasting transactions. Sometimes in banking and business there are transactions that can take days, but more often, minutes or hours. OpenEhr does not really support that. When the Composition is ready, it will be stored in a microsecond transaction. More ideal would be that different parts of a Composition would be prepared for transaction immediately after they occur. Then it becomes possible to store not finished Compositions, cq events (f.e. Observations ), for the case something happens which makes it impossible to finish the Composition, even when the client systems goes offline. Another use case for long lasting transaction in clinical context is that Compositions really can take hours or days before they are completed. For example, labs can take longer time, but semantically, (one could argue) , they belong to the composition in which is the clinical context for the lab. Block chain can also be a great help in such situations. --- ## Post #69 by @woodpk @Bert_Verhees One of the things I was thinking about / wondering given my understanding of the openEHR system structure is how the main EHR 'database' operates under significant load from multiple users across regions. I'm sure i'm not seeing this correctly, but isn't the main EHR database a significant single point of failure ? I was thinking that using the CQRS/event sourced type of archetecture, not only would you have the openEHR Audit system taken care of by default, but you would also be able to partition out your data into actual microservices (services that owned their own data). I suppose the thing I am having the hardest time wrapping my head around is that, while I have read @pablo 's article on a microservices implementation, this does not meet the definition of a microservice as I understand it. The main EHR database seems like it might be a big monolith that all other services directly depend on ? Is my perception of this wrong? Are there things I am not seeing in the openEHR architecture ? I really appreciate the feedback both of you have given thus far. It is helping me understand the overall system better. --- ## Post #70 by @Bert_Verhees A database is always possible a single point of failure, from that perspective OpenEhr is not unique. But that risk is very much mitigated with mirroring in cloud technology with mirrored databases on several geological locations and also a local mirror for when the internet goes down. Also your kernel should run in the cloud on several geo-locations. The API will send you through to the best mirror for that moment. This is not OpenEhr specific, and we experience that not all hospitals take necessary precautions. We had a hospital in the Netherlands last week which had big problems in the server-room and was almost a full day without data. This can cost human lives. No matter OpenEhr or another system, this kind of problems are not necessary anymore, and a hospital having these without having taken precautions must be held responsible for lost of lives if it occurs. As I wrote, a event source would be a good idea for OpenEhr, in fact for all kind of clinical engines, but you need to solve the CQRS part, in the same way you have to solve it without event source. That is mainly by having an AQL engine which is not very hard to arrange on a NoSQL database. Because you do the queries against a database in the cloud, serverless, you don't need to know which mirror you are addressing, for all these kinds of situations are intelligent solutions in the cloud. --- ## Post #71 by @erik.sundvall Hi! [quote="woodpk, post:64, topic:117"] but my current concept of EHR doesn’t seem to fit with the general micro-service model. From my understanding each micro-service (bounded by business context that is determined by behavior/function) owns its own data [/quote] For many use-cases for a specific healthcare provider (or region) I'd say that having a single _logical_ EHR/CDR database is often a reasonable approach, then you could of course split it physically using sharding of your own or automated distribution/sharding built into general data-backend products. Other things than the EHR Database/storage in an openEHR-platform could be split into micro services though. Some suggestions are available in https://bmcmedinformdecismak.biomedcentral.com/articles/10.1186/1472-6947-13-57 ![](upload://k7AlSgyJhtshrPjDGBhQLLMOGvv.jpeg) It would probably not be so great to put _every_ box in that image it's own micro service, but it may give some thoughts regarding possible things to group/split. If you want to go by an "owning data" split, then the Demographics, the "Contribution builder" and some caches/trigger-handlers could be suitable pieces. The paper above also mentions some sharding options. Please note that the paper pre-dates most openEHR REST-implementations and specifications, so the URLs and several other things/thoughts do not match the current openEHR REST standard. --- ## Post #72 by @birger.haarbrandt Hi everybody, I think it generally makes sense to do horizontal partitioning on the EHRs. I think what could require some more attention/implementation is "distributed" versioning: This would work similar as git: there is a master CDR and the slave CDRs (like the database of a particular application system) could clone from the master. Then, commits can be made on the slave systems and merged into the master. In the architecture overview, this is briefly described as follows: > some EHRs in an EHR server in one location are mirrored into one or more other EHR servers (e.g. at care providers where the relevant patients are also treated); the mirroring process requires asynchronous synchronisation between servers to work seamlessly, regardless of the location, time, or author of any data created. Plus the link to the spec: https://specifications.openehr.org/releases/RM/latest/common.html#_distributed_versioning I think it would be great if this could be implemented across openEHR vendors soonly, so that we can really start to sync between vendor-independent CDRs and reach highest scalability. Best, Birger --- ## Post #73 by @Bert_Verhees It is all arranged by the cloud providers, you don't need to worry about that as a customer. Even if you have local databases, synchronisation can happen as a cloud service, because you can add local machines to the cloud mechanism. Bert --- ## Post #74 by @Bert_Verhees Thanks for this interesting contribution Bert --- ## Post #75 by @erik.sundvall [quote="woodpk, post:69, topic:117"] how the main EHR ‘database’ operates under significant load from multiple users across regions [/quote] @woodpk for some published performance tests see links at the end of post #50 [Dec '19](https://discourse.openehr.org/t/best-database-for-openehr/117/50?u=erik.sundvall) above. * https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0150069 - e.g. Couchbase * https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0190028 - e.g. ElasticSearch (+huge dataset) * http://ebooks.iospress.nl/publication/46372 - e.g. Hadoop --- ## Post #76 by @rtweed https://discourse.openehr.org/t/qewd-js-great-video-on-vm-style-dbs/379 --- ## Post #77 by @Bert_Verhees Good framework, abstracting the database layer and making quick development possible. IMHO, The best suggestion in this thread. --- ## Post #78 by @Pourya_Nasimi hi, what is your idea about Apache HBase? we are developing an OpenEHR based system and using HBase for our db. it would be great if we have @thomas.beale opinion too. thanks --- ## Post #79 by @thomas.beale I can't comment on the basis of any experience with Apache HBase, but I read it is conceptually Google's Bigtable idea over Hadoop and HDFS. So the question of is this DB any good for openEHR can really only be answered when you determine your schema approach. If you are going to use a 3NF approach, or close to it, I would not guess there is much advantage, and indeed standard Postgres is probably easier (you'll want inheritance support between tables for example - I don't know if HBase has that). On the other hand, if you are thinking of some esoteric schema architecture that creates large tables, maybe based on the idea of archetypes as table definitions (i.e. equivalent to classes), then it might indeed be interesting (templates then become views of archetype table projections; 1 view per template in the system). I think some Chinese openEHR projects have followed an approach like this. I'd consider it a research question rather than a production approach but you may know better. I'd be very interested to see it actually working! The other thing HBase could be good for is as a target for ETL extracts for 'study' purposes i.e. implementing SDTM tables for research. Each ETL target then looks more like a 'big table' i.e. big data resource on which to do some data mining or Bayesian logic or whatever. Certainly interested to know your experience. --- ## Post #80 by @birger.haarbrandt Hi Pourya, I cannot comment on HBase, but I'm interested if you tried EHRbase with its postgres approach. I would highly appreciate to learn about your findings and what criteria are leading you to build your own openEHR server (of course creating an own commercial IP is a sensibel reason). --- ## Post #81 by @ian.mcnicoll Hi Pourya, Just to back up what Birger has said. Of course it is always great to see new openEHR CDR datastore implementations and the specs are there to be used but it is a pretty challenging engineering task, and I always suggest to people interested in doing that to work for a bit with an existing openEHR CDR so that you get your head around a somewhat different paradigm of data management. That is particularly true if you have an application waiting on the CDR to be developed but even if the primary goal is just to build a CDR, a bit of time working with an existing product will give you a much better feel for the challenge and potential solutions. Of course you may know all that, and want to press ahead in which case, feel free to keep asking questions and good luck!! --- ## Post #82 by @domenic I'd like to suggest considering a Distributed SQL database for this project. If you're not familiar with this database category, it includes databases that are built natively as distributed systems. As "distributed-native" systems, they have that in common with NoSQL databases like Cassandra and MongoDB, mentioned in earlier comments here. But this category of databases also supports "SQL", as the name indicates. Specifically, "SQL" refers to relational algebra and support for ANSI-standard SQL. In the past, we as an industry moved to NoSQL systems with the idea that relational databases could not scale cost-effectively, or at all in some cases. For the RDBMS technology at that time (~2007) that assertion was true. But now Distributed SQL databases have solved that problem and can effectively scale relational joins in a distributed system. Distributed SQL databases are easily adopted partly because they use existing common database protocols, viz. MySQL and PostgreSQL, so they work with existing drivers in any programming language, like Java, and with ORMs like Hibernate, mostly out of the box. Distributed SQL databases are also multi-model these days. That means that they support documents and JSON as first-class citizens just as well as traditional relational tables and rows. Because of these capabilities, Distributed SQL are replacing the last era of NoSQL databases like MongoDB and Cassandra as well as RDBMSs like Oracle, SQL Server and IBM DB2. Some examples of Distributed SQL databases are SingleStoreDB, CockroachDB and YugabyteDB. (Disclosure: I work for SingleStore.) --- ## Post #83 by @sebastian.iancu Thank you Domenic from your thoughts. Do you have any experience, insights or results (that you can share) about implementing openEHR on SingleStoreDB ? --- ## Post #84 by @adebyrne I hope it's ok to bump this one back up. @ian.mcnicoll and I have spoken about it before and although I'm no expert by anyones measure its an area that's interested me and concerned me about OpenEHR. I've seen the comments on scalability of XML databases so called. My info is well out of date but in the early shared care records we used XML and had to separately index the metadata, or pull it out and re-parse it. None of this ideal and certainly not scalable. In the end we had a separate analytics database with a lot of indexing, and an operational database with a few searchable fields. Then I spoke to various people about how the thing is database agnostic but thought that was something of a sham, as effectively you're just storing big blobs of data and metadata, and probably aren't using much of the tuning features of the database as it doesn't really understand the data, or have a map of it. I remained sceptical. However, I recently had a conversation with someone about vector databases, and although it's yet another thing I know nowt about, I thought aha is this the thing we need to make OpenEHR a scalable technology for some of the bigger real time datasets I envisaged. In the end then, a simple question, has anyone considered using these vector technologies [or others I don't know about] to host an OpenEHR resource ? Am I on the right lines or off beam ? I am going to the event in Munich in Feb so am just doing a bit of research prior to that. And apologies if opening an old thread is poor form Ade --- ## Post #85 by @stefanspiska @adebyrne If you want to answer the question: **"Which patients are smokers and have systolic pressure ≥ 140 in the last year?"** then you need to: 1. Find the smoking information for each patient and filter on the condition is "smoker." 2. Find all blood pressure measurements for each patient and filter on the condition is "systolic pressure ≥ 140" and "creation date ≥ current time - 1 year." 3. Create a join between them on the relationship that both belong to the same patient. 4. Add a "distinct" operator on patients. This process is independent of any database technology and is purely logical. --- **Legacy Approach (Relational Database)** If you use a legacy approach, a database engineer would design your table structure in a relational database. You would properly end up with two tables: 1. **Smoking Status:** * patId (UID) * status (Boolean) 2. **Blood Pressure:** * patId (UID) * systolic (Integer) * creation_date (Timestamp) Finding where the data is stored and performing a join is thus straightforward. The process can be optimized by using table statistics. If performance is still insufficient, a database engineer can analyze the query plan and add indexes where necessary. --- **Document-Based Data Standards** When using document-based standards like **openEHR**, **FHIR**, or **CDA**, the idea is that data modeling is done by medical specialists based on an abstract base model. This can lead to challenge if one uses a naive Approach , such as having all data stored in large, unstructured "blobs." As a result even identifying where smoking data and blood pressure measurements are stored becomes difficult, since documents tend to be quite different. But there is a solution. you can leverage the semantic bindings of these documents, such as: * **Archetypes** * **Bindings to terminologies** For example, EHRbase uses a hybrid approach where documents are not stored as one large blob but are divided into paths and just the leaves are stored in Json. And thus specific archetypes can be indexed, allowing for quick retrieval and search. --- **Vector Databases** [Vector databases](https://www.elastic.co/what-is/vector-database) are particularly useful for **semantic** or **similarity searches**. They can quickly answer questions like: * "Which documents (or parts of a document) concern smoking?" * "Which documents (or parts of a document) concern blood pressure?" * "Which documents (or parts of a document) concern **both** smoking and blood pressure?" This is interesting for the medical domain, where much data remains unstructured. However, semantic searches are distinct from analytical queries such as: * "Which patients are smokers and have systolic pressure ≥ 140 in the last year?" --- **Operational vs. Analytical Databases** It’s important to distinguish between **operational** and **analytical** databases, as their requirements differ: **Operational Database:** * High availability * CRUD operations and single-patient queries (focused on the current patient) * Strong consistency **Analytical Database:** * Supports complex queries on historical data * eventual consistency is allowed. Using separate operational and analytical databases with different technologies is often a good architectural pattern. --- ## Post #86 by @thomas.beale [quote="stefanspiska, post:85, topic:117"] For example, EHRbase uses a hybrid approach where documents are not stored as one large blob but are divided into paths and just the leaves are stored in Json. And thus specific archetypes can be indexed, allowing for quick retrieval and search. [/quote] Excellent post. BTW I see that EhrBase is now using YugaByte 'wrapping' of Postgres. I had a look at the current DB schema init files - as far as I can see you are using jsonb (JSOB binary blob) representation for Composition.content, path-based index fields, and mostly orthodox tables for the other high-level structures. Is this correct? If so .. very interesting, it's an approach I proposed 20y ago ;) Performance test results would be very interesting... --- ## Post #87 by @Alex_Travis The hybrid approach mentioned, such as EHRbase using YugaByte with JSONB for path-based indexing, is promising and reflects the evolution of ideas over decades. It’s also intriguing to see how vector databases might fit into this landscape, particularly for semantic searches in unstructured medical data. However, there’s still a clear distinction between such innovative methods and traditional analytical queries that require operational consistency. On another note, the [cost of implementing ehr in hospital](https://www.cleveroad.com/blog/ehr-implementation-cost/) remains a significant barrier, especially for smaller institutions. While these advanced technologies offer exciting possibilities, their adoption will depend heavily on balancing technical benefits with affordability and long-term ROI. Performance test results from implementations like these could help build the case for broader adoption. --- ## Post #88 by @adebyrne Thank you The examples usually given are "find me a bunch of patients with this and that", which is something I'm generally not going to be too bothered about performance and could even wait overnight. As you say there are differences between operational and analytical designs, and I'm always wondering whether the OpenEHR system can satisfy thie requirement here. I don't nececcarily agree that these are single patient queries. In outpatients for example you might be working with some really complex lists that need to refresh in RT. I've seen well designed databases floored with the queries that support this. I don't really have a problem with the fact well designed archetypes etc can support the data queries. I'm still wondering whether I'm going to need a database analyst to decide how to map data and index/tune a platform for these operational purposes. I've already made clear I'm no expert. Not an DBA and not a professional analyst by any stretch. My Codd and Date is well rusty :slight_smile: I have however been around when operational databases have been flattened by patterns of use and hit walls when being scaled up. It's quite a scary place. The people I worked with were brilliant, and always seemed to optimise things and plan ahead pretty well. I just come back to I am trying to see if I can be confident in OpenEHR for these purposes, and was just pondering on vector databases as a way to support it, but from a very amateur standpoint. Ade --- **Canonical:** https://discourse.openehr.org/t/best-database-for-openehr/117 **Original content:** https://discourse.openehr.org/t/best-database-for-openehr/117