# openEHR prototype **Category:** [Implementers (archive)](https://discourse.openehr.org/c/implementers-archive/158) **Created:** 2013-04-02 15:16 UTC **Views:** 4 **Replies:** 58 **URL:** https://discourse.openehr.org/t/openehr-prototype/15233 --- ## Post #1 by @Ze_Silva Hi everyone, I will develop a prototype in openEHR using openEHR.NET, C# implementation. Basically i want to do a web service to an openEHR repository. I studied relational, xml and object databases. Recently i discovered the NoSQL databases. This type of database seems to match the openEHR objectives. There are four types of NoSQL databases: Key/value, Column-family, document and graph. I think NoSQL document storage is the best type to use. Does anyone know anything about this that could help me? Thanks in advance for your time, José Silva --- ## Post #2 by @Seref Hi Jose, Do you have a list of components you'll need to develop to implement your prototype? Can you list them? Regards Seref --- ## Post #3 by @Ze_Silva Hi Seref I didn't understand very well your question but I will answer in two ways: - I want to create an openEHR database and one web-service for this repository. I don't want to do or care about interface. The objective is that several applications could use this web-service. In future should be possible to convert openEHR in other standards to communicate with other applications who don't "talk" openEHR, but for now this is not important. - For now I will use only therapeutic and administrative prescriptions. Regards José Silva --- ## Post #4 by @Seref Ok, you're basically talking about implementing an openehr repository. You asked: "Does anyone know anything about this that could help me?" If you don't care about the access mechanism to your repository, you are free to use anything. If you have more specific requirements for saving to and reading from your repository, data size, transaction support etc, then you're better with some options than others. --- ## Post #5 by @Ze_Silva Right, this question was about the repository. I am thinking to use NoSQL Document Storage like MongoDB. Am I choosing the right type? My repository must be prepared to handle large amount of data, the query response speed is a bit more important than storage speed because there will be more queries than inserts/updates. Regards, José Silva --- ## Post #6 by @Seref Jose, Don't get me wrong, I have no intention to offend you, but I think you're trying to do something without adequately understanding what you're trying to do. EHR persistence in general is probably the most complex component of EHR implementation, both for openEHR and similar standards based data. If you want to know if you're choosing the right type of persistence technology, you (or anybody else) can only know this if there is a persistence strategy/design in place. You're going to have some representation of openEHR based data, and you're going to build a mechanism to persist and to read it. There are many ways of doing it. You can use something like db40, mongodb, relational dbs... There is a huge missing block in your thinking and in your plan, and you're asking questions which can be answered only if someone knows what is in that missing block. My suggestion is, try to implement the simplest possible method of writing data to somewhere (even a file would do) and read it back somehow. See what it is like to have a complete data loop. I've started with db40 about 5 years ago. Do it in the fastest way you can, to see the whole picture, then start identifying the components and optimizing/replacing them. Brutal honesty warning: It takes years of work to be able to answer your questions, and it is very valuable information, unlikely to be provided for free. So start building a small prototype, and ask questions as you encounter issues. Regards Seref --- ## Post #7 by @Ze_Silva No, it's ok Seref. You are right, i'm new in this field and i'm trying to start to do something. Thank you for your advices. Regards, José Silva --- ## Post #8 by @ANASTASIOU_A1 Seref, i certainly wish i received a response like this one 5\-6 years ago\! :\-D Jose, in \(small\) addition to what Seref says, your "problem" is NOT mongoDB\.\.\.Your "problem" is not 100% technological\. The first thing that you have to do is understand the openEHR data structure and how a tiny little humble number \(like a single OBSERVATION for example, just one number\) is stored and associated with the rest of the data in a subject's EHR\. Do this mentally, based on the specification documents\. Don't worry about technology\. If you had to do this "by hand" how would you do it? Where do you start? What do you need? What do you need first, second, third, etc\. And then try to generalise to how would you do it for Archetypes & Templates of ANY structure\. What you are dealing with here is a Dual Model \(it specifies both a Reference Model \(RM\) and an Archetype Model \(AM\)\)\. The Reference Model specifies abstract data structures \(like a list of numbers for example, or a table, etc\) and the Archetype Model specifies how are these abstract data structures pieced together in even larger structures that support a specific use\-case\. \(And Template's are doing the exact same thing by piecing together different Archetypes\)\. The key\-point here is that the AM contains a huge ammount of extremely important data: \_Constraints\_ \. And, it is not always possible to map an Archetype's \(or Template's\) constraints to whatever similar mechanism some DBMS is using\. It's not always as simple as "NOT NULL"\.\.\.\.It's more like "This list of numbers should be between 4 and 8 items long and each entry of this list, being a number, should have constraints \(lowLimit<\(<=\)x<\(<=\)highLimit\) that depend on the unit that the user will select\!\!\! \(and which units are allowed are also specified BY the Archetype\!\!\!\!\!\)"\. In other words, an Archetype may be allowing you to specify "length" in feet / inches / meters / cm / etc and the constraint "0\-1 meters" has different "physical" representation \(0\-1, 0\-100, 0\-3\.9, etc\) depending on the unit\.\.\.\.\.This "detail" can not be ignored or simplified\.\.\.\.This specificity is the actual objective\. Once this way of describing data is specified, once this huge data structure is in place, you can't just leave it there\. You now need a way to query it\. This is where the Archetype Query Language \(AQL\) is coming in\. This is a project on its own \(i am not joking\)\. You have to parse AQL and then plug the parameters to a function that will actually implement the query\. The "best case" scenario is one where you can "translate" a query from AQL to whatever a DBMS is using\.\.\.\.But that can turn to a bloodshed pretty quickly too so better keep AQL in the radar from the beginning\. Once you have your data in place, all impecably organised and queryable\.\.\.you can then start using them to actually generate some useful information\.\.\.This is where GDL is coming in \(http://www.openehr.org/news_events/releases/20130311)...But we are already in 2020 by now \(and everything has gone beautifully, ideal and full\-time\)\.\.\.so let's come back to what you are trying to do\. In general, even ignoring the GUI part, you will end\-up implementing an openEHR DBMS to a certain extent\. Whether it's totally file based like Seref is proposing or it makes use of facilities provided by some underlying DBMS, you will end up implementing functionality that CRUDs \(Create, Recall, Update, Delete\) this large data structure specified over several PDFs in openEHR\.org\. The mongoDB business is a tiny little branch of the tree \(not insignificant\)\. You have to stand back and appreciate the bigger picture because this will save you a huge ammount of design\-redesign cycles \(and the more you have built, the worst the "tearing down" is\)\. Of course, you can always scour the XML of an Archetype \(AT SOME VERSION\!\!\!\!\!\), grab all the paths and then assign values to those paths in a key/value kind of way and query the graph using the query language of the DBMS\.\.\.\.\.\.Solved\. \(???????\) \(Maybe this is a way to handle JUST the "last 10 meters" of the persistence\) I am in no case trying to scare you off but as Seref says, you need to understand what is going on and you need to do this before you write even a single line of code, this will bring the right questions to the surface\. I hope this helps\. All the best Athanasios Anastasiou --- ## Post #9 by @pablo Hi Jose, I think trying & learning from current openehr open source software is the best first step for what you are trying to do\. Ing\. Pablo Pazos www\.cabolabs\.com --- ## Post #10 by @Seref Actually, I'd still suggest that he tries it on his own first. It takes about 2 years to learn how to write code properly with a language. It takes 20 to learn how to read code. Due to massive amount of frameworks and concepts, today's code is a whole different challenge compared to 20 years ago. He'd lose lots of time just setting up stuff, following function calls in source code etc etc. My humble suggestion to Jose is write first, read later. --- ## Post #11 by @Robert_Stark Hi everyone I don't know if it's just me, but I'm not sure that Jose's question really got answered. I myself am starting to dive into development of an openEHR system and I found some of the comments and responses to Jose's original question a little puzzling. Specifically the statment "It takes years of work to be able to answer your questions, and it is very valuable information, unlikely to be provided for free". Seref, i have read your blog and seen your postings, and I respect your knowledge of openEHR and contributions. But I don't know why asking about database implementation would take years to answer and not be provided for free. I think he just wants an opinion for those who have implemented a nosql datastore. Maybe I'm not understanding this correctly (so please correct me if I'm wrong), but I think that we need to be understanding that there are going to be all kinds of developers which have different levels of experience. We want to attract as many of these people as possible and make it as easy as possible to start there journey into openEHR. And therefore, it would be great if someone who has used a noSQL database like mongo, answer the question about the experience with it and wether it worked and what are the pitfalls. This kind of information should be shared as it increases the chances of the mainstream adoption of the project itself as a whole. openEHR can be quite intimidating even for experienced developers because of things like dual model approach, AQL, ADL, Archetypes, Templates, etc. These are not ideas that are necessarily mainstream with regular computer software development. And so putting all of these things together is frankly intimidating since there is kind of a black hole on exactly how to implement all of this together. We want to show that implementation is possible and share ideas of how. There are a lot of ideas about how to go about doing this, and it would be great if this kind of information was shared freely as our ultimate goal here is the same. My hopes is that the openEHR community embraces this way of thinking as well. Dr. Rob Stark --- ## Post #12 by @system Hi! Sorry for joining the discussion a bit late. I hope you don't mind if I split the discussion thread and (soon) start another separate additional renamed thread-branch regarding modularization. In [1] we discuss a modular persistence approach where also NoSQL persistence approaches could be plugged in. Regarding NoSQL solutions Sergio Freire has has recently ended a productive post-doc year with us at Linköping University (we miss him already) and together we are now in the middle of experiments using different approaches: - Some initial XML experiments were reported in [2]. As expected the investigated XML-databases did of course not scale for epidemiological queries if used in a simple straightforward non-optimized way, but some (e.g. BaseX) worked well for non-epidemiological queries where you already know the patient identity. - Hadoop with an openEHR-path-specialized indexing mechanism and map-reduce (experiment lead by Fang Wei-Kleiner) - Couchbase with openEHR data stored in JSON-format (experiment lead by Sergio Freire) All of the solutions above will likely be available as open source later, but as you probably understand they are experimental incomplete research implementations done in very limited time with limited resources and thus far from ready for production use. The preliminary performance results so far regarding the last two are promising also for epidemiological queries. (Sergio is also exploring a RDBMS-based variant.) I know that the vendor Marand has been exploring different NoSQL approaches too (including MongoDB) before settling on a well performing RDBMS-hybrid approach using an additional inverted index. This is mentioned briefly in an upcoming survey paper [3] regarding different openEHR persistence implementations used around the world. Best regards, Erik Sundvall [erik.sundvall@liu.se](mailto:erik.sundvall@liu.se) [http://www.imt.liu.se/~erisu/](http://www.imt.liu.se/~erisu/) Tel: +46-13-286733 References: [1] Sundvall E, Nyström M, Karlsson D, Eneling M, Chen R, Örman H. Applying representational state transfer (REST) architecture to archetype-based electronic health record systems. Accepted to BMC Medical Informatics and Decision Making. 2013; Preprint manuscript available via email request from [erik.sundvall@liu.se](mailto:erik.sundvall@liu.se) Limited parts of the paper are also described in chapter 3.2 of my thesis [http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-87702](http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-87702) [2] Freire S M, Sundvall E, Karlsson D, Lambrix P. Performance of XML Databases for Epidemiological Queries in Archetype-Based EHRs. In: Proceedings Scandinavian Conference on Health Informatics 2012. Scandinavian Conference on Health Informatics 2012, October 2-3, Linköping, Sweden. Linköping: Linköping University Electronic Press; 2012. p. 51-57. Linköping Electronic Conference Proceedings, 70. Available from: [http://www.ep.liu.se/ecp/070/009/ecp1270009.pdf](http://www.ep.liu.se/ecp/070/009/ecp1270009.pdf) [3] Samuel Frade, Sergio M. Freire, Erik Sundvall, José Hilário Patriarca-Almeida and Ricardo Cruz-Correia. Survey of openEHR storage implementations. Submitted to [CBMS 2013](http://cbms2013.med.up.pt/) --- ## Post #13 by @thomas.beale Let's try to get a little perspective here\. The problem of storing openEHR data in a database is the same as for any other so\-called 'complex object data'\. The openEHR data are defined by the openEHR Reference Model\. That has around 110 classes, including 28 clinical data types, and 40 descendants of LOCATABLE \- this includes the demographic model classes\. There are probably around 65 classes that 'matter' to normal developers\. Most data are structures consisting of COMPOSITION / SECTION / various ENTRY types / HISTORY / EVENT / CLUSTER / ELEMENT, and most clinical data complexity ends up as just more CLUSTER / ELEMENT structures, i\.e\. 'free hierarchy'\. This class model is pretty tractable for software developers\. In terms of persistence, the main complicating factor is the versioning, which although it is modelled in a specific way in openEHR \(and includes the 'Contribution' concept, i\.e\. change\-sets\), is far from an openEHR\-unique feature\. So to implement an openEHR persistence solution, you just need to be able to store:   \* all updates as change\-sets, known as Contributions, where   \* each Contribution consists of one or more       o logical content item additions       o logical content item deletions       o logical content item changes   \* and where each such content item is either       o a COMPOSITION or       o one of the type EHR\_ACCESS, EHR\_STATUS or FOLDER       o in the case of demographics, a PARTY or PARTY\_RELATIONSHIP\. I wouldn't regard this as a 'big problem' \- and it's not specific to openEHR\. All of the 'complexity' of openEHR resides in how the data are understood at the next level, when created in the first place, and when retrieved from persistence\. From the point of view of basic store and retrieve by e\.g\. some indexed UID, this doesn't matter\. However it starts to matter when you get into retrieval by query\. Then you suddenly start to care 'what the data mean' and about how to implement a query processor based on not just the structure \(the reference model classes\) but also on the 'soft typing' provided by archetypes / templates\. Seref's comments relate to implementing not just brute\-force persistence, but also query\-based retrieval, and doing both in a truly scalable and portable way\. In other words, the requirements of the 'maximum capability' openEHR solution one could imagine\. Obviously it's reasonable to implement lesser things, and I would encourage this\. The need for the two functions \(persistence and querying\) don't go away, but we can certainly make some simplifications if we don't assume the need for e\.g\. 20 million EHRs with sub\-second query response from day 1\. So I would encourage developers to focus on a\) a working SQL or noSQL persistence solution and b\) even a rudimentary AQL query processor\. By the way, a 'no SQL persistence solution' can be achieved with a normal SQL database like MySql \- just use blobbing\. Or use one of the XML databases like http://www.exist-db.org/exist/apps/doc/ \. I think the performance may be weak at the outset, but Erik Sundvall's and other groups have been looking at these solutions and gathering evidence as to what kind of performance can be expected and realised\. Once you decide to go no\-SQL, either blobs, path\-based, XML or whatever, the AQL query implementation becomes pretty easy \(again, the hard stuff is the optimisation, not the core functionality\) and I would encourage developers to try something here\. As a final suggestion, would it help to set up an 'openEHR persistence' wiki page and even mailing list to gather intelligence across the community and share it better? \- thomas beale --- ## Post #14 by @ANASTASIOU_A1 There already are a few good starting points for such a wiki page: http://www.openehr.org/wiki/pages/viewpage.action?pageId=786487 http://www.openehr.org/wiki/display/resources/Persistence+FAQs http://www.openehr.org/wiki/display/dev/Design+pattern+for+persistence \(to a certain extent\) --- ## Post #15 by @system Hi Robert, This is a paper that describes their implementation of openEHR and Querying system by Scala and MongoDB http://link.springer.com/chapter/10.1007%2F978-3-642-37134-9_15 Shinji --- ## Post #16 by @Ze_Silva Hi, I would like to thank you all for your opinions. I'm seeing that this community is very active and I'm learning too with your comments. Rob saw my point and I agree with his opinion. I'm exploring C# implementation from CodePlex to create my web-service. I think my steps should be: 1. Choose the archetypes i will need; 2. Create one class for each archetype allowing this use for several templates; 3. Create a repository to store this information taking into account what Thomas Beale said (like I said, I thought in NoSQL); 4. Store and query the repository creating the respective modules; 5. Use the queries' answer to build a XML message to send to the client that did the query. Please, correct me if I'm wrong or missing something. By the way, this project I'm doing is for my master thesis dissertation. Regards, José Silva --- ## Post #17 by @Seref Jose, I do not understand what you mean by #2. Have you seen this?: [http://serefarikan.com/2012/11/08/openehr-for-practical-people-cleaned-up/](http://serefarikan.com/2012/11/08/openehr-for-practical-people-cleaned-up/) Did you read this bit on Codeplex?: [http://openehr.codeplex.com/documentation](http://openehr.codeplex.com/documentation) --- ## Post #18 by @Robert_Stark Thanks Shinji\. Are you using any of these databases with your Ruby Implementation? --- ## Post #19 by @Robert_Stark Jose, Your steps are a little vague so it's hard to to know what you are trying to do for sure on a few steps. But I though I might put this out there for others to look at what we are going to try and do. And by the way, thanks everyone for pointers about nosql. I still have to read up on all this so I may find the answers I am looking for in those documents. RIght now we are still researching. So here is where we are starting at with our Rails Project. Please let me know if I am way off base on any of this. 1. We will develop specialized archetypes for the dental domain using the archetype editors. We can then use these in conjunction with openEHR archetypes to develop templates. 2. We are going to use the the Ruby library (Thank you Shinji!) to convert our adl documents into object models in Ruby for the system to manipulate (Create, Edit, Delete Objects). Once a patient record is created, we will store it as json data to either mongo or couchbase for storage. We haven't really decided on which one to go with yet. 3. We will try and use json formatted data when communicating with clients (ios and browswer). So right now, where I am a little foggy is on how we are going to query and what are we going to do with aql. I am still researching this. Of course i know we can query json data from the nosql database, but I have more to work on here. Just wondering Jose why you would use xml for communication when you could use json which is supported by many languages as well as mongo? --- ## Post #20 by @Seref I know I'll sound grumpy again when I say this, but here we go: Most nosql databases are quite young. They are all emphasizing the ease of scalability, and the relatively easier models of querying data. If anybody here is looking at a nosql db for a system that is going to be used for clinical care, my suggestion is to consider the ACID support and immediate consistency. Most nosql dbs rely on replication argument to claim that durability is achieved via replication and the data is there, one way or another. However, not all installations can afford multiple servers, and most nosql servers also scale with eventual consistency. During clinical care, you can't rely on eventual consistency. You want immediate consistency. Relational dbs are extremely good at ACID and immediate consistency, which is the reason they can't easily scale out. Because you need a global transaction manager to scale out with immediate consistency, and that is really really hard. Please take a look at this before getting over excited about nosql db concepts: [http://en.wikipedia.org/wiki/CAP_theorem](http://en.wikipedia.org/wiki/CAP_theorem) So far no one has beaten the cap theorem. The nosql hype in a way reminds me the perpetual machine claims that never die. Yet, the second law of thermodynamics is out there, unchallenged. So if you're doing research work, where you are not building an OLTP system for clinical care, nosql may be great. But if you're responsible for systems that will support clinical care during the actual care process, I suggest a bit more thinking. I'll simply post a link to another message I've written in the past for JSON: [http://lists.openehr.org/pipermail/openehr-technical_lists.openehr.org/2011-December/006406.html](http://lists.openehr.org/pipermail/openehr-technical_lists.openehr.org/2011-December/006406.html) Please check the bits about abstract types an concrete instances etc etc. . Without AQL, you simply can't have portable access to data. if I have to learn access method A to read from openEHR repo A and access method B to read from openEHR repo B, then how on earth we're going to have smart healthcare apps that can run across multiple systems? Adopting systems to each repository is so costly that it'll never take off. Just for the fun of it, google "curly braces problem" and see what that brings. --- ## Post #21 by @system Hi Robert Thank you for that input - I like your attitude and approach. People like Seref are deeply into implementation and find it a bit frustrating to start at the beginning. I am not sure that anyone has a working implementation and a non-sql database although I know Seref looked into it in detail. So I believe it is a new area. The principles are available in the openEHR specifications and Erik Sundvall may have some knowledge of who is the go to person in this area. Extending the openEHR XML schema to fulfill the requirements for a NoSQL approach might be necessary and I am personally very interested in how you decide what the indexes might be. I would suggest that the indexing should be reactive rather than set at design time - so any query repeated can be indexed for the values that are sought. Anyone had any experience? Cheers, Sam --- ## Post #22 by @Ze_Silva Hi Yes Seref I read this links, maybe i didn't understand them well, I will review. Thank you for your reply Sam. I thought in using XML because I think that I read somewhere that XML or dALD are the standard for communication in openEHR. And if it is, I think that is not good to use different languages to communicate between different openEHR systems. Using different languages would force other openEHR systems to understand each language, having different methods for each repository that needed to communicate with. Regards, José Silva --- ## Post #23 by @Robert_Stark Seref okay, I understand about the ACID support and the pros and cons of nosql databases. And I'm assuming that you favor a relational database. So are you saving the openEHR as a blob or are you normalizing the data into the tables. If storing as a blob (although I don't think this is a good way to go) then, what format is the data in (XML, JSON, Text, etc) To me, the reason why a nosql database is attractive is that the data does not need to be normalized into an rdbs, and so it is more like true object persistance. This is nice because it decreases the complexity of the app and allows one to focus on the app it self and not how to store the data. If you could explain more about how you are using a relational database or point me to a resource, I would appreciate that. Thanks, Rob --- ## Post #24 by @Seref Robert, I actually do not prefer them, but due to historical and various other domain specific reasons, I kinda got stuck with them. Thomas has actually provided the fundamentals of blob based persistence in the wiki: [http://www.openehr.org/wiki/pages/viewpage.action?pageId=786487](http://www.openehr.org/wiki/pages/viewpage.action?pageId=786487) In the past, I've implemented a persistence model where a key value representation for each data commit was used in RDMS system in Opereffa. Opereffa now belongs to Charing Systems, and if you google for it, you can still access the source code that includes the db design. The problem with that was that it would not scale to large data, which I need to process for machine learning, mathematical modelling that I'm doing. Currently I'm using a different method with RDMSs, but it is unpublished work at the moment, so I can't share it. Personally the biggest problem I'm seeing with relational dbs is the difficulty of scaling out for writes. There is a lot one can do for read scaling, but write scaling with a relational db is really hard, and so far nothing I've seen comes close to Oracle RAC, which is not the cheapest piece of software you can buy. If you're not going to build a system that must support heavy writes with immediate consistency, RDMS is not a bad option. Most relational dbs can offer very impressive performance. but you have to invest into them. That is, you've got to know your db layer product really, really well. In the past, I've managed to cut query times from 3.5 minutes to 20 ms in the past, but that takes serious work. I keep going back to what I was trying to explain in the beginning, choices for db layer is very much dependent on many key factors. The better one defines the requirements, the clearer the choice for db technology becomes. Cheers Seref --- ## Post #25 by @pablo Hi guys, About Jose's concerns I think the better way to understand openehr persistence complexities ia trying things out\. I agree with Seref that this take a long time, specially for generic solutions\. For specific solutions like persisting data defined by a small set of archetypes could be easily done in a short time\. About relational vs nosql, for a real solution, if scalability is taken into account, a mixed solution of relational for inserts and updates and nosql for querying would be a good solution because writes on disk should assured and that is not granted by many nosql solutions\. For a small set of users relational with some level of normalization would work ok\. Take a look at the Open EHRGen Framework as an example\. Ing\. Pablo Pazos www\.cabolabs\.com --- ## Post #26 by @system Is scalability really an important argument for everyone? I mean, a doctor only needs access to his direct patients in, say, half a year\. He never looks at any other patients in his system\. How many patients are that, 5000, 10000? Maybe even 20000\. The largest scale is hospital scale, but even in hospitals are situations with distributed systems desirable or even fact\. For National Healthservices, you don't need central systems with millions of patients, but you need a good message\-system, and an index\-server, which is not a medical information system\. For disease\-control, epidemiological warming systems, or for medical research purpose, you don't need to have access to all patients on a single cluster, it is enough to make smart use of semantic webs and/or eventually, distributed queries\. Even in hospitals, in the Netherlands, hospitals are markets for specialist, which work on their own account or in small businessgroups for hospitals and also inside the hospital\-buildings\. Often they have their own information systems, and often with bad messaging\. That needs improvement\. The hospital itself also offers services, and has information systems for that\. For example, financial accounting, medication, and beds and nurses\. Only academic hospitals in the Netherlands have specialist in service for the hospital, and they have central information systems, and there are advantages to that, but there are more ways\. When you have 10 million patients on one machine\-cluster, is the same as having 10000 patients on 1000 machines\. Most of these patients live on more systems, at the GP, dentist, local hospital, insurance, etc\. So, those 10 million patients live on perhaps 5000 machines, machines with systems with different requirements\. This is the situation in most countries, and this will not change in most countries\. There is a lot of opposition of several groups against central machines, not only privacy concerns, but also how will the software\-companies make money if the government hijacks their market? In the liberal market situation most governments will not chose for central systems\. At least not in coming decades\. Unless North Korea wins the war, of course\. Health related arguments are thus not the only arguments which decide how the Health\-information landscape will look\. Those 1000 machines are cheap, not depending hard on the internet, software can be available on sharp markets, competition in features and price, the government can guard quality rules\. Most of us are not building a single National Healthservices cluster with all patients centralized stored\. Realize that and life becomes more simplier, you can concentrate on other, more important things\. If you cannot let go the dream of being the one who delivers the National Healthservices cluster, make you storage\-layer transparant, so with it can be exchanged with not much effort\. You should do that anyway, at every system\-designer school you learn that\. Keep in mind that the other software layers in your system do not need to know on which kind of database they run\. This all is only about scalability for databases/systems\. We do need scalability on other subjects, like in the discussion between Tom and Tim on identification on archetyping, and identification of patients, etc\. So even when you are designing a small system, you need to remember that it must be possible to safely share information with thousands of systems\. That is not scalability in database, but in logical design\. Have a good day, Bert Verhees --- ## Post #27 by @pablo For networks of hospitals with centralized systems \(like public hospitals here in Uruguay\) an insurance\+healthcare companies that are buying clinics and hospitals every day \(as it happen in the US and many other countries\), scalability is a must\. For the mentioned cases 1\. the are more users every day, 2\. new data services are needed every day, and when performance limits are reached, scaling is the only solution\. Ing\. Pablo Pazos www\.cabolabs\.com --- ## Post #28 by @system Hi Pablo, do you have some examples of processes that require the immediate availability at the terminal of employee A of data entered/obtained at the terminal of employee B in these largish organisations? --- ## Post #29 by @system It is a solution, another solution is distributing\. Distributing is the way how the upcoming National Health Service network in the Netherlands is designed\. Only a centralized index\-service which knows where information about a patient is, and a message system, which retrieves the information when needed\. This solution was to avoid a central point of failure, but more to comfort the software companies, which would lose their markets if there would come a centralized system\. A centralized system is also innovation killing\. Because the market then becomes in hand of one commercial party\. But most of all, no\-one working in health needs direct access to hundred thousands or millions of patient\-records\. Bert --- ## Post #30 by @pablo In the centralized environment I mentioned, distribution is not an option. The concept here is: there still exists a lot of centralization, we cannot change that, so we need to provide solutions under that scenario too. In Latin America centralization is the rule, not the exception. My message is to give Jose some input to help him on his project, I think we need to discuss about that here and create other threads to discuss solutions in real environments, considering different scenarios. Please consider Jose's project is a PROTOTYPE, and he needs to try things out. --- ## Post #31 by @pablo He Roger, I don't fully understand your question, what kind of data are we talking about? terminals A and B are on the same organzation? what type of organization are we talking about? a clinic, a hospital, a goverment agency? There are different sources of data, different kinds of data and different accessing requirements, and different organization structures (federated, associated, acquisitions). Also, one thing is what exists now and another is what we should do to improve that (our vision/ideal, here I'm talking about that ideal, not reality. Our reality will make your bones shake...). --- ## Post #32 by @system I understand, in that case you need mega\-databases\. Good luck with it\. I am happy to live where the market is divided, which gives newcomers a way to enter\. Bert --- ## Post #33 by @thomas.beale Bob, this is where it is useful to 'know something about the data'. In openEHR land (and it's the same for 13606, CDA, CCR, anything similar), your base lump of committed information is something like a 'document'. In openEHR we don't think of them like that, but the granularity is the same as for those document standards. In openEHR, the container is the COMPOSITION. So you know you are going to commit one or more COMPOSITIONs at a time; you know also that the contents of those COMPOSITIONs are archetyped. So that enables you to store them as blobs with a smart index based on archetype ids and paths. Then you have two practical choices: Store the blobs in technology-specific binary form, e.g. java objects, binary data infosets, whatever. This will be optimised for your implementation. As long as you can convert this to an interoperable format (like the published openEHR XSD; JSON in the future) this is a good way to go. OR.. the second choice is you actually create the blobs directly from an interoperable form like XML or JSON. This won't be optimised for your internal system computation, but it will make your internal software a bit easier to write, and it will make it easier to export the data in an acceptable standard format. For high capacity systems, the first choice is the most likely. this is why the only realistic way to go with RDBMS is blob storage; you can make the RDBMS use its brute force as a blob manager. Even better to make them variable size blobs e.g. as . - thomas beale --- ## Post #34 by @pablo Exactly, that is the case here, but we need more than good luck ;P Ing\. Pablo Pazos www\.cabolabs\.com --- ## Post #35 by @system > Exactly, that is the case here, but we need more than good luck ;P OK, much wisdom too\. regards Bert The Tao gave birth to machine language\. Machine language gave birth to the assembler\. The assembler gave birth to the compiler\. Now there are ten thousand languages\. Each language has its purpose, however humble\. Each language expresses the Yin and Yang of software\. Each language has its place within the Tao\. But do not program in COBOL if you can avoid it\. http://www.canonical.org/~kragen/tao-of-programming.html#book1 --- ## Post #36 by @system Hi Bert, Scalability had been a great concern in the last decades, but cloud computing has succeeded to disguise it\. We can purchase computer resources by reasonable cost on demand\. Even an EHR system for small clinics needs to discuss to use such cloud system for sustainability, because severe disaster can easily break intra\-hospital/clinic system\. \(It was just proved by the earthquake and tsunami, 2 years ago\) However, this is a very good for us, openEHR developers\. We can build from 10 to billions patient system by same logical information model on cloud system as you mentioned\. My suggestion to build a prototype is to consider to use such cloud system\. Shinji --- ## Post #37 by @system > My suggestion to build a prototype is to consider to use such cloud system I agree with you, Shinji, cloud computing is the way you can scale up systems\. It is in fact what I suggested, but the term "cloud" did not come to mind\. When you have a distributed OpenEHR system, deployed by an organized group, they can easily exchange information in a OpenEHR way, and even in a disaster \(when you need your system more then ever\) or a less catastrophic network failure, still all the islands in the cloud remain having their own databases\. In the Netherlands we had last week three times a major network breakdown in banking systems\. Two concerned only one bank, but the largest bank, ING, and one concerned iDeal, which is a Dutch service for all banks for Internet money transfer\. People could not get to their money, and in the third breakdown webshops lost millions, there was no money transfer at all\. Two breakdowns were because of software problems, the third was because of the largest DDOS attack in Dutch history\. And because this was in the news, they showed us that this happens many times, worldwide\. In the USA, there was a network outtage concerning the Bank of America, also last week\. This illustrates how important it is not to be dependent on networks\. Distribution of systems is the way to scale up systems\. And an other advantage of distribution is that it is not necessary for whole regions to switch to OpenEHR\. The organizations which rather wait can be served by an acceptable message system\. I think this is the way I can also agree with the participants in that other parallel discussion\. Thanks for your suggestions\. Concerning prototyping distributed/cloud OpenEHR, it will be very interesting to work on that\. Bert --- ## Post #38 by @system Maybe it is the iPad, but I receive my messages twice, if you also have this, please excuse me\. Bert --- ## Post #39 by @thomas.beale I also happen to agree that a realistic view of the world would be as Bert has described below\. Some countries have a different background and want centralised systems\. But the reality is that the vast majority of health transactions occur locally, and will never be relevant outside that situation / location\. Even care in the community, which requires localised communication of a care team and some local clinic is essentially still 'local'\. Central storage of patient data in a large country at least is not a /needed /approach \- it just involves absurd expense and fails to deliver the main functions \(e\.g\. UK\)\. For small countries \(sub 5m people\) it's more attractive, but still might not match the reality on the ground\. However, there are places that do want e\-health computing hubs with 10,000,000 patients and more, for whatever reason\. Plus the need to do secondary studies on millions of patients\. So the picture gets muddy, and the notion of 'scalability' starts to extend in ways that a rational analysis wouldn't expect\. \- thomas --- ## Post #40 by @Seref Just food for thought: is scalability only a function of data size? What happens if your solution is performing adequately for all reads/writes you need up until a point and you need to access your repository with a completely different query? Maybe robustness is a better word for it in this case, but scaling across data volume is not the same as scaling across data access patterns even if the data size is kept constant. --- ## Post #41 by @pablo For me scalabity has many dimensions, data size is only one of them\. Users, subsytems & services are other dimensions we have to consider for scaling\. Ing\. Pablo Pazos www\.cabolabs\.com --- ## Post #42 by @system Yes, but it is their problem, I think they should reform their ICT, it is too dangerous how they deploy it\. When we support situations like these, we also help to keep them from necessary reforms\. Like Shinji and I discussed this morning \(and many others I guess\), clouds should be the answer on scalability\. Bert --- ## Post #43 by @system I explained before, fifteen years years ago I worked as an SQL\-engineer for a big market\-research company, and they were having really big data, 700\.000 people were in their database, being researched on all kind of things, unpredictable research also\. We prepared such research, by defining indexes, caching data, temporarily tables, that kind of things\. We prepared SQL inquiries the day before, letting them run over night, and next morning, they could load the resultsets in their favorite tooling, SPSS it was mostly\. You cannot expect a production system to be fast on uncommon queries\. Especially on a cloud, it can even take longer because of all the network transport\. There is also always the argument for an epidemiological warning system\. For that purpose, there is a "new" technique, which is used by secret services quite some time now to predict war or terrorism\. That is a semantically approach on network\-data research\. Often it fails, but also sometimes it works\. People don't get born as terrorist, they become one, and often, this is visible on newsgroups, forums\. After Breivik had done his bad thing, all his Internet\-publications were found\. It is a missed chance that the government of Norway did not invest much in that kind of research\. It can even get better when there are network\-protocols are optimized for that \(there are\), and then we have the semantic web, which should connect to health\-systems\. I realize that this is a new way of thinking, but, starting on obvious subjects like cholera or TBC, it could give some fast results, and, very important, in a second\. But this is beside OpenEHR, it is something that should be initiated by governments\. Meaning, OpenEHR does not need to solve all problems, sometimes there is a better way doing it\. Bert --- ## Post #44 by @ian.mcnicoll Hi folks, A few reflections on the thread so far and a pointer to a different discussion on the same sort of topic at http://www.linkedin.com/groups/Choice-OpenEHR-persistence-layer-144276.S.208531138 Getting back to Jose's original question, I think there a few clear answers, with 1\. There are no 'out\-of\-the\-box' persistence solutions\. Whatever choice you make you are going to have to do a fair bit of work to develop a viable openEHR CDR \- by that I mean something which can read and write serilaised openEHR data, handle the audit trail capacity, and most importantly, support AQL\. Most successful systems also support templates and template\-derived technical artefacts such as code libraries and XSDs, 2\. We know that some solutions do not really work \(at least beyond very small scale / academic prototypes\) \- RDBMS solutions which involve normalisation or object\-relational tools like Hibernate\. 3\. We know that RDBMS solutions based on blobs \+ path\-based indexing can work extremely well for operational data \(scale and speed\)\. see http://en.ibs.ru/content/eng/703/7039-article.asp Other vendors have had similar experience with this approach\. 4\. On the face of it an XML database is best aligned to the path\-based querying re --- ## Post #45 by @Seref For me this is still the state of the art for healthcare data persistence implementation, and unlike you, I think this can be improved\. I'm not talking about a magic bullet, but there is so much to improve in persistence approaches in general\. I think this is all I'd like to say in this thread\. Jose must be regretting asking the original question by now :\) --- ## Post #46 by @system You are right, I agree, there is always something to improve\. I will see you're contributions in persistence discussions from that point of view, from now\. Bert --- ## Post #47 by @ian.mcnicoll Hi folks, Sorry \- sent the first response prematurely A few reflections on the thread so far and a pointer to a different discussion on the same sort of topic at http://www.linkedin.com/groups/Choice-OpenEHR-persistence-layer-144276.S.208531138 The conversation on scalability just confirms my view that there is room for many different persistence approaches, particularly when you separate operational from anaylsis use cases\. Big, small, centralised, distributed, the beauty of openEHR is that it is essentially agnostic to all of these design decisions\. Let many flowers bloom\. In the UK I see a market from app/device based storage based perhaps on a simple XML blob, through a cloud solution which can handle large numbers of small, independent applications \(think hospital departmental systems\), through to shared repositories capable of handling millions of patients, and on to research 'big data' analytics systems\. All valid, technically, clinically and socio\-politically\. Getting back to Jose's original question, I think there a few clear pointers \.\.\. 1\. There are no 'out\-of\-the\-box' persistence solutions for openEHR \(unless you buy one in\!\)\. Whatever choice you make you are going to have to do a fair bit of work to develop a viable openEHR CDR \- by that I mean something which can read and write serialised openEHR data, handle the audit trail capacity,and most importantly, support AQL\. Most successful systems also support templates and template\-derived technical artefacts such as code libraries and XSDs, I think this was behind Seref's original comments\. 2\. We know that some solutions do not really work \(at least beyond very small scale / academic prototypes\) \- RDBMS solutions which involve substantial normalisation or object\-relational tools like Hibernate\. 3\. We know that RDBMS solutions based on blobs \+ path\-based indexing can work extremely well for operational data \(scale and speed\)\. see http://en.ibs.ru/content/eng/703/7039-article.asp Other vendors have had similar experience with this approach\. 4\. On the face of it an XML database is best aligned to the path\-based querying required for AQL but we also know through Erik's work that XML databases do not work terribly well 'out\-of\-the\-box'\. I think Bert favours this approach and I guess is getting some decent results once appropriate 'tweaking'has been applied but I am not aware of any publicly available metrics for such systems in operation\. 5\. There is a lot of interest in NoSQL and Mumps database persistence solutions, particularly for analytics and serious processing but this is still at the research level and unknown in real\-world environments\. I think the real breakthrough will come when someone packages up an openEHR CDR service as part of a cloud\-based pay\-as\-you\-go solution, so that someone like Jose can get started and become familiar with openEHR\-based development, without the immediate overhead of having to develop their own\. Ian --- ## Post #48 by @system > On the face of it an XML database is best aligned to the path\-based > querying required for AQL but we also know through Erik's work that > XML databases do not work terribly well 'out\-of\-the\-box'\. I think Bert > favours this approach and I guess is getting some decent results once > appropriate 'tweaking'has been applied but I am not aware of any > publicly available metrics for such systems in operation\. First on Erik's work, I don't understand why everyone takes that for granted, I only have read that part about relational\-db/xmldb comparison\. I hope we are talking about the same paper\. I am talking about a specific paper, and it will be clear in the following below which paper I am talking about\. He is doing an unusual query, he is not going into detail, which indexes there are, etc, he says nothing about normal production work, but everyone I hear is jumping to conclusions, it almost seems that it is wanted that an XML\-db fails\. I hear no\-one complain about the missing details in his comparison\. But there is no way you can conclude from Erik's paper that an XML\-db is a bad choice for OpenEHR\. Several experienced ICT\-technicians believe it is a good choice, and they were also surprised that Erik does a datamining query to test production\-use of a database system\. And he does only one query, or a few, but not one normal production query\. that paper I described, and that part about this comparison, was not very impressive\. I am sorry to say, but it is my honest opinion\. How fast can you retrieve a patient and ALL his compositions\. It depends on how smart you develop your XML structures, but it is possible in the unmeasurable time under one second\. That is equal, if you have 5,000 or 25,000 patients\. I am talking about normal production use\. See how an relationele\-db performances on this\. And that is also a problem with Erik's paper, he compares with a relationele\-db, as I remember \(I am telling this without having the paper here\)\. As we all know, a relationele\-db is the worse choice possible for OpenEHR, you write that too, and then Erik writes it is better than a native XML database for OpenEHR? Needing to jump into 20/40 or 60 tables, consulting many indexes is faster than retrieving a number of documents on their database\-XML\-id? Really, we should not anymore refer to this\. XML is very fast, especially if you do not need to scale up to 100,000 or more patients direct accessible, and maybe then, also, it will outperform a relationsl\-db, but, I never tested that\. This afternoon I explained, and Shinji did also explain that scaling up is dangerous and unnecessairy, because cloud computing is the future, and for good reason\. So why should an XML\-db else be a good idea? You are right, I favour it, and believe or not, it was Erik's paper which helped me in this conviction\. I can read between the lines, it was also saying things between the lines\. There are a lot simularities between OpenEHR structures and XML\. Both uses paths to come to a leafnode which contains a value\. Validation of datasets is ready in my kernel, works very thoroughly and is lightning fast\. Every construct in ADL 1\.4 can be transformed to XSD\. That was the hard part, but it is ready\. This transforming takes some milliseconds\. Of course, the XSD will be reused\. Then AQL, an important part of XQuery 1\.0 is XPath 2\.0\. It seems not to difficult to transform from AQL to XQuery\. But if there are problems, there will always remain XQuery, which is a very rich querylanguage\. I will report on that later\. I was just starting to write this part\. So, it does not work out of the box, there has to be some code knitting arround, but that counts for every database\-system\. However, how much codecomlexity do you have to create to get the thing running? This is a valid question\. It is about quality\. Do you need to write an AQL engine? Query\-engines are difficult to write, that is where a lot of development investment of companies as Oracle go to\. That is were good database\-companies distinguish from bad companies\. When using an XML\-db, a Path oriented query engine is in the box\. It is developer by a team, it is maintained and optimized by a team, for years, often very dedicated people\. Maybe XML is not the best solution, especially for large scale use, maybe cache is better when having 50,000 patients, I don't know\. We have to find out\. But it is much better then a relational database\. As you say, Erik describes it worse then a relational db, and you describe a relational\-db as the worst choice\. How damaging can one be? And why should one want to be damaging? So, I needed to say this\. With your statement you were harming my work, and I needed to give my opinion on this, So people can make their own choice what to believe\. Most of the people will not get any futher than believing, we both know that\. I wish you a nice day, and if it is evening at your place, I wish you a nice evening\. I read my email over, should I send it, do I insult someone? I hope not\. It was not my purpose\. Sorry Erik, you are a good guy, but I need give my honest opinion\. Best regards Bert Verhees --- ## Post #49 by @ian.mcnicoll Hi Bert, I am sorry if I have upset you \- there was certainly no intention to 'harm your work'\. I am quite sure a properly configured and optimised XML\-db can handle many openEHR implementations perfectly well\., and on the face of it, should be a pretty good fit for openEHR\. I was not trying to be anti\-XMLdb in any way, just trying to summarise current experience, as I understand it\. I share your views on Erik's paper but I think it does tell us that XML\-db's do not work well if not optimised for openEHR\. In this respect they are no different from \*any\* other openEHR persistence methodology and it would have been a huge surprise if they had performed very well 'out\-of\-the\-box'\. I did indeed say that complex normalisation with an RDBMS seems to be a bad choice but storing openEHR data as dumb blobs \(with clever indexing and optimisation\) in an RDBMS can work well, and there are a number of real\-word implementations that demonstrate this We do not have as many examples of real\-world openEHR XML\-db implementations with metrics\. If they exist I am happy to stand corrected and look forward to hearing more positive feedback from your XML\-db experiences\. BTW I do think that increasingly AQL and a standard service layer will be pre\-requisites for operational openEHR CDRs\. Again a lot depends on whether your product is designed to play in a wider openEHR exosystem or if you are just using openEHR 'inside' as a great way to persist clinical data in a standalone system\. The point I was trying to make for Jose's benefit is that there are some persistence approaches that are known to work, and at moderate scale, others like NoSQL / Mumps that seem promising but are essentially new ground, while others like yourself are well down the road of implementing XML\-db solutions, and are very positive so far, but that implementation experience is so far limited\. If you do not feel that is a fair summary of the 'state of the art', please correct me\. The more diversity of approach and technology the better as far as I am concerned :\-\) Again sorry if I came over as bashing XMLdb approach or your work \- I look forward to seeing the end results\. Regards, Ian --- ## Post #50 by @system Excuse accepted ;\-\) Maybe I was overreacting, but I was reading that my work was worse then the worst possible work\. Now I understand, that this was a possible interpretation, but not a necessary interpretation of your words\. So sorry if I was overreacting\. Bert --- ## Post #51 by @system I think you are right, there must come definitions\. Shinji brought up the idea of a cloud OpenEHR in discussion, this is context of the earthquake in Japan\. I agree, we do not only need to protect us against disasters, but we need to care that a simple DDOS\-attack does not create a disaster\. The best protection against this, is scaling up via clouds, where systems can remain using their local database, and maybe what is available on a local network\. But for having clouds, we need to define a standard service layer and a minimum\-set of AQL which must be supported, as well also export data\-format, XML or dadl\. Maybe a service to test OpenEHR on this would be nice, testing the service interface, the minimum required AQL, the formats, so that every developer can test if his work is doing good\. Bert --- ## Post #52 by @system Hi! Interesting to see a paper being discussed without being properly read. Please read it through and try to see the context before expressing strong feelings about it: [http://www.ep.liu.se/ecp_article/index.en.aspx?issue=070;article=009](http://www.ep.liu.se/ecp_article/index.en.aspx?issue=070;article=009) Now some pointers to those that are too lazy to read properly but don't mind spending long texts on commenting what they have not read properly. 1. The paper is part of a bigger research context where we are looking at how openEHR storage approaches work for two different use cases, let's call them SINGLE and MULTI. - SINGLE: normal clinical day-to-day patient browsing use where you know the ID of the patient - MULTI: epidemiological queries sometimes spanning millions of records If you at least read the _title_ "Performance of XML Databases for **Epidemiological** Queries in Archetype-Based EHRs" you might see that this paper is focused primarily on the MULTI use case. 2. Of course out-of-the-box configurations of XML databases won't be a perfect fit for the MULTI use case, and of course we guessed that before starting, but we did not know _how_ good or bad the performance would be, did you? We wanted some benchmark to measure against before starting the experiments with Hadoop, Couchbase etc. (If we had skipped this step, then somebody would probably instead be complaining about that we use overkill distributed MapReduce solutions and had skipped the simple obvious XML-databases that likely would have worked well enough...) Since very few seem to publish any openEHR storage performance measures at all we thought that even these initial benchmark numbers could be useful to some people, thus we published them. I did not expect them to be so misinterpreted and used out of context, so we should probably have spent even more text on explaining on what the tests were _not_ about (but I am not sure that would have been read either). 3. Even though we focus on the MULTI use case we also mention performance for the SINGLE use case, and if you actually read what the paper says you will find that many XML databases perform well for this. (It is in the text, not in a graph or table though so you have to read the text not just look at the pictures.) Even the abstract reports this by saying "For individual focused clinical queries where patient ID was specified the response times were acceptable" The paper later states "The average response times for the clinical queries were between 10 and 200 ms for the BaseX, eXist and Berkely DB XML" We thus share the opinion of Bert and others that XML databases can be used perfectly well for the SINGLE use case. So this paper (if properly read) could be used to _support_ that use. From the measurements you may also conclude that for example BaseX (even just using the built in out-of-the-box-indexing) could actually work for some MULTI use-cases if the datasets are small enough. 4. The paper is not anywhere suggesting that a relational database would be ideal for openEHR storage. We are not using a RDBMS for openEHR formatted storage anywhere in the paper. Some may have been confused by the fact that the original epidemiological data came from an RDBMS optimized for epidemiological queries. The openEHR formatted data is derived and enriched from the original and is instead tree-formed and contains _a lot_ more of openEHR metadata, hierarchies etc. We don't expect the original RDB to be comparable to the openEHR-fomatted data since it would be like comparing apples and oranges. In the bigger research picture we however want to know how much it costs (in response time) to use a generic storage solution as opposed to the RDBMS tables that have been manually designed to fit the epidemiological use case. This should be compared to the time and hassle to do all that manual design and maintenance - such comparisons/discussions are easier if there published measurement values... (Initial tests from some MapReduce approaches are actually indicating that generic solutions may actually work well enough also for the MULTI use case. We'll publish that too when finished, but will try to fool-proof the context explanations better.) 5. The epidemiological queries used are realistic since they are translated from real epidemiological research queries used in the original database. 6. It' should not be called "Erik's paper" Sergio is the first author for a reason, he did a _lot_ of hard work that was used in the here mentioned paper and that is then reused for current research of distributed openEHR storage solutions. 7. I do think that it is important to discuss the use-cases SINGLE and MULTI separately since they have very different characteristics and requirements. in another (accepted to BMC Medical Informatics and Decision Making but not published) paper:"Applying representational state transfer (REST)architecture to archetype-based electronic health record systems". You find abbreviated notes about that also in section 3.2.2 of my thesis: [http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-87702](http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-87702) Solutions for the SINGLE use case are not too hard to scale if proper sharding is used. I discuss sharding-approaches in the paper/thesis mentioned above. Scaling by sharding this way could be applicable to RDBMS as well as XML- or other NoSQL-based approaches. Best regards, Erik Sundvall [erik.sundvall@liu.se](mailto:erik.sundvall@liu.se) [http://www.imt.liu.se/~erisu/](http://www.imt.liu.se/~erisu/) Tel: +46-13-286733 --- ## Post #53 by @system The conclusions from your paper are often used outside the narrow context of your paper, this was also in this discussion. It is not for you to blame, but for the one who does this. But still I have some remarks on that paper, see below That is true, and it is stating that XML performs worse, and that statement is often repeated by others out of the context and referring to your paper. But again, this is not to you to blame. But that statement is based on the result of one single inefficient XQuery-statement and one SQL-statement of which no details are known. On which I have a few things to say: First, but not most important, but still worth explaining: -------------- One should never do an OLAP query adhoc. I agree that the performance of that statement is bad, but I think it can improve a lot if it will be prepared properly, like a professional database-engineer would do. One can add temporarily indexes, data-extractions, it is a normal practice when working on OLAP queries. What you do is an OLAP query. But one doesn't want OLAP-indexes during production (OLTP) time, because it cost processor time maintaining them. Spilled processor time because nobody is interesting in continuous updating this index. One only needs an update of such an index just before the query is executed. So after doing an OLAP query, one removes those indexes again. A production environment should be as fast as possible for production use. All this is common sense. Second, more important ------------------ And then, for OLTP use, some indexes can be added because an XML database can never guess what belongs together. For example, in your XQuery, I see possible 4 XML-structures being queried on internal paths-attributes and then being linked together. I am not sure about this, because the paper does not explain this. But if this is the case, one could create indexes on those values which are searched to be key in connection to other XML documents. I guess, your query-time will reduce to 1% of the query time measured in this paper, of course, depending. There is not enough information to jump to this kind of conclusions. Regarding the preparation which has been done, and the wish to test an unoptimized environment, one can expect the worst configuration, thinkable. 100 times performance improvement is possible with rather small About not enough information, there is also my third remark ------------------ I did not find in the document the structure of the MySQL database shown. You can normalize OpenEHR to 10 tables, but also to 50 tables. Depending on the queries you want to run, this can be quite a difference on performance. I worked with Codd-normalized relational database too in OpenEHR, some years ago. It is where thinking started for me, at that time, when programming a database-application. First I had tablesets, splitted up completely to all leaf-nodes, later I started working with combining field-values, and blopping things which are hardly queried. For example, constraints are hardly queried, only used at production time when validating data. That makes a relational database fast, but not very reliable/efficient, because you cannot index or even query every detail. Last but not least ------------------ Writing an AQL engine for a relational database is almost undo-able for a small company. The investment will be maybe two or three annual incomes for developers. For an XML-database, because of the simularities between XML and OpenEHR structures, this can be done in a few months by a single experienced developer. So, this will be may be till 10 times cheaper and more safe. Add this up with the 100 times performance gain from your test by simple measures, and a native XML database becomes a valid solution. Sorry, I did not realize that. I lost the PDF, but just read it again from your link. Thanks for your remarks, Erik, it gave me the opportunity to respond. Best Regards Bert Verhees --- ## Post #54 by @ian.mcnicoll Hi Erik, Firstly, apologies to you and Sergio for the mis\-attribution of the paper\. I agree that it is an important piece of work, given the general lack of good performance metrics in anything like realistic systems\. You are, of course, quite correct that careful reading and analysis gives a somewhat more nuanced impression than the headline\. Many thanks for those clarifications, in particular for pointing out that the unoptimised XML\-db performed pretty well in 'single patient ' query patient mode\. I must confess to having missed this when I read the paper\. It is useful to draw the distinction between single and multi patient queries but I think it is also worth stating that in anything other than a small tightly focussed application, some sort of multi\-patient querying capacity is essential e\.g for quality and audit purposes, to identify cohorts for call/recall provision etc\. I am encouraged to hear Bert's assessment that developing AQL against an XML\-db was far simpler than his equivalent experience with RDBMS\. Ian --- ## Post #55 by @thomas.beale we need to be very careful with what we are comparing here when mentioning relational databases. There are entirely different ways of using it. We can think of it as follows: so we should say here: a RDBMS in 'classical' mode... well, let's just talk about apples and oranges properly here, and I think no-one will be maligning any technology unfairly, nor anyone's excellent professional work (personally I can't wait to see what Bert's product looks like in the end). - thomas --- ## Post #56 by @system Me too ;\-\) Bert --- ## Post #57 by @system Hi Seref and others! > If anybody here is looking at a nosql db for a system that is going to be used for clinical care, my suggestion is to consider the ACID support and immediate consistency. I think it is useful to consider carefully exactly what resources need to be included and locked in specific kinds of transactions. For many clinical use cases it is enough to lock a single patient record during writes to that record, for those use cases, sharding in a way that always sends all writes for a particular record to the same shard (DB on a specific network node/cluster) will likely solve a lot of contention/performance issues. > [...] not all installations can afford multiple servers, And some big installations will not be able to scale up using a single huge server node - in those cases multiple servers will be more affordable. > and most nosql servers also scale with eventual consistency. During clinical care, you can't rely on eventual consistency. You want immediate consistency. See above. Immediate consistency for each individual record might be enough for many (not all) use cases, immediate consistency for the entire distributed EHR system might not be needed for use cases of tyoe "SINGLE". > Relational dbs are extremely good at ACID and immediate consistency, which is the reason they can't easily scale out. Because you need a global transaction manager to scale out with immediate consistency, and that is really really hard. Please take a look at this before getting over excited about nosql db concepts: [http://en.wikipedia.org/wiki/CAP_theorem](http://en.wikipedia.org/wiki/CAP_theorem) So far no one has beaten the cap theorem. This is why I prefer to split the SINGLE and MULTI use cases when possible. If the MULTI use cases can accept running in read-only mode on data that is a bit delayed (so that all possible partitions (P) have had time to be caught up) then you can query consistent data (C) that is timestamped in oast time before the last partition event. Neither RDB- or NoSQL solutions can get around the CAP theorem. Some NoSQL solutions allows you to configure what restriction you want to relax C, A, or P. > So if you're doing research work, where you are not building an OLTP system for clinical care, nosql may be great. But if you're responsible for systems that will support clinical care during the actual care process, I suggest a bit more thinking. Well, XML databases are for examle sometimes considered NoSQL, and I think Bert and may other implementors are using them for clinical OLTP systems, so it is not a question of NoSQL vs RDBMS. I think it is more a question of how to do sharding etc if you really need to scale out - and match that to your use cases while still respecting the limits of the CAP theorem. NoSQL is no magic bullet. RDBMS is no magic bullet. > Without AQL, you simply can't have portable access to data. if I have to learn access method A to read from openEHR repo A and access method B to read from openEHR repo B, then how on earth we're going to have smart healthcare apps that can run across multiple systems? Adopting systems to each repository is so costly that it'll never take off. Just for the fun of it, google "curly braces problem" and see what that brings. Agree. Shared paths and queries (and thus query languages like AQL) are keys to semantic sustainability etc. This does however not rule out NoSQL, JSON, XML etc in any way - but perhaps that was not what you meant either. Best regards, Erik Sundvall [erik.sundvall@liu.se](mailto:erik.sundvall@liu.se) [http://www.imt.liu.se/~erisu/](http://www.imt.liu.se/~erisu/) --- ## Post #58 by @system Hi! > The conclusions from your paper are often used outside the narrow context of your paper, this was also in this discussion. > It is not for you to blame, but for the one who does this. Or perhaps rather misinterpretations of the conclusions. > > The paper is not anywhere suggesting that a relational database would be ideal for openEHR storage. > > That is true, and it is stating that XML performs worse, and that statement is often repeated by others out of the context and referring to your paper. I repeat: there is no openEHR data in a relational database being compared to XML in that paper, so how can it be worse or better? There is a non-openEHR epidemiological database containing original non-openEHR data in a manually designed table structure optimized for certain epidemiological queries. The query use-cases come from previous usage of SQL-queries in that database. I repeat: the paper does NOT explore performance of openEHR data in relational databases. You could possibly say that tree-formed openEHR data based on a generic RM with a lot of extra metadata and an entire EHR structure (then serialized as XML) preforms worse than an epidemiology-optimized non-openEHR relational database (that is not even an EHR). But who is surprised by that finding? The interesting comparison of the paper is between the different XML-databases, especially if ad-hoc querying is desired. > But that statement is based on the result of one single inefficient XQuery-statement and one SQL-statement of which no details are known. No, the tests used in the paper contain several different queries that are in turn run repeatedly, as explained in the paper. However a conference paper has limited allowed page length and we could only exemplify a few things. If you need more info about something in a research paper, then I'd suggest... 1. reading the paper thoroughly, and 2. contacting the corresponding author of the paper for more details ...before rambling too much about it. > One should never do an OLAP query adhoc. Phrases like "One should never" easily fall into the same category as "the truth about..." or "XML/SML-schema is bad" or "object orientation is bad". They may be entertaining or sometimes useful as simple rules of thumb in specific contexts, but in proper research and engineering you will often find phrases like "it depends" or "this is a trade-off between A and B" more useful. Epidemiologists doing exploratory research often want to make ad hoc queries in an iterative process. This is a valid family of use cases. Admittedly sometimes a tricky one - sometimes making either performance or the complexity of DB (index) maintenance suffer. But hard problems that are fun and interesting to research. > I agree that the performance of that statement is bad, but I think it can improve a lot if it will be prepared properly, like a professional database-engineer would do. Of course. Now to an interesting question: is it always worth the resources and time to call for a professional database-engineer when you come up with new (ad hoc) epidemiology queries that then generate new ideas and follow-up queries? Or when you start exploring new archetype structures? I think a good answer starts with "it depends". Some queries may be one-off exploratory probings that will not be repeated - other questions are things you want to repeat every day. It depends on your available computing resources, staff resources data size, query complexity etc. Instead of saying "One should never" it is more helpful to get some numbers, scaling factors and reasoning indicating probable cost (man-hours, processing time, index disk size, response time etc) of different options. I think we should be welcoming publication of such research, instead of saying it should never be done since it is of less importance in our own specific favourite family of use-cases. > One can add temporarily indexes, data-extractions, it is a normal practice when working on OLAP queries. What you do is an OLAP query. > But one doesn't want OLAP-indexes during production (OLTP) time, because it cost processor time maintaining them. Yes of course one wants to consider this and in many cases thus use separate systems. Did the paper state anywhere that OLTP and OLAP should always be done in the same system? Was the paper focused on OLTP use cases? I think not. > And then, for OLTP use, some indexes can be added because an XML database can never guess what belongs together. > > For example, in your XQuery, I see possible 4 XML-structures being queried on internal paths-attributes and then being linked together. > I am not sure about this, because the paper does not explain this. > But if this is the case, one could create indexes on those values which are searched to be key in connection to other XML documents. Well some of the databases (for example BaseX) do have built-in indexes of attribute values etc as default out of the box, so they have thus been used. One problem when you have a lot of archetyped data is that you get very big indexes containing a lot of the same thing (the attribute value "at0001" for example). Such index content affect performance and is discussed in the paper. This can be used as a hint to people that want to create extra (for example more path-oriented) indexes (that I guess you are suggesting instead). > 100 times performance improvement is possible with rather small measures. Easy to say, requires work to prove. If you disallow ad-hoc queries and know all queries beforehand then it is of course a lot easier to create optimized indexes, but then you miss a part of the research context - we are in this particular research thread not researching only the easiest use cases - we want to support also epidemiological ad-hoc querying of openEHR data. A family of very valid use cases that would likely benefit from some published numbers and scaling factors when weighing different factors against each other (including the option to call for professional database-engineers to tailor solutions to a particular query). Certainly commonly recurring things like EVENT_CONTEXT.start_time and some data values could benefit from range-based indexes rather than alphabetical indexes. I think Sergio made some experiments with that and other indexes on some databases but performance certainly did not improve as dramatically as 100 times. > I did not find in the document the structure of the MySQL database shown. > You can normalize OpenEHR to 10 tables, but also to 50 tables. > Depending on the queries you want to run, this can be quite a difference on performance. Again, there was never any openEHR-formatted data in the MySQL database. I am sure Sergio can send the non-openEHR DB-schema if you want to look at it, start by asking him. > Writing an AQL engine for a relational database is almost undo-able for a small company. The investment will be maybe two or three annual incomes for developers. For an XML-database, because of the simularities between XML and OpenEHR structures, this can be done in a few months by a single experienced developer. I agree, and we made a quick (and somewhat dirty) AQL to XQuery translator in fairly limited time using Java CC. (It will be released with the accepted REST-paper.) If somebody would like to share an AQL parser framework implemented in ANTLR I think it would be a nicer community contribution from a maintenance perspective though :-) An ANTLR AQL grammar available from [http://www.openehr.org/wiki/display/spec/AQL-+Archetype+Query+Language](http://www.openehr.org/wiki/display/spec/AQL-+Archetype+Query+Language) > I lost the PDF, but just read it again from your link. Googling or sending an email message may help you the next time you lose a paper that you want to discuss/debate details of. Best regards, Erik Sundvall [erik.sundvall@liu.se](mailto:erik.sundvall@liu.se) [http://www.imt.liu.se/~erisu/](http://www.imt.liu.se/~erisu/) --- ## Post #59 by @system That is why I write "suggesting". So, the datamodel of the MySQL database isn't even described, but it was compared with an OpenEHR databases in XML? What was the point of that? You did state that the MySQL database had an index on EHR-id. But if you do not explain the layout if the database, the term "EHR" in a database is meaningless. So this made the suggestion that the MySQL database served an OpenEHR-purpose stronger. I have been confronted quite a few time with people reading this document, people with an academical background. All of them missed things you call obvious. That is true, for that query. But why didn't the differences show up in the conclusion, if it was that interesting? Almost the complete text is about the XML-databases in general, and only a few alinea is about the different performance between the three tested XML databases. I quote from the conclusion of the document: ------------------ ------------------ The text is sometimes colored with a negative bias against XML-databases, for example these sentences, I quote: The use of the word "even" twice to make a negative statement stronger in 25 words, and mentioning that investigations where omitted because of slow response time, makes a very negative impression. I would say that the paper makes it hard to most decision-makers in industrie to chose for XML-databases. They need someone like me to debunk the suggestions made with facts and common knowledge. ------------------ You also state that XML databases are not suitable for clinical use, but maybe useful for students (saying, nice to play with for students), I quote: ------------------ Again, all this is based on a few (OK, not one) XML-queries on non-optimized databases, and the comparison is against an optimized production database which isn't even an OpenEHR-database. Reading the document carefully even makes it worse from my point of view. I am not sure what you are trying to say here. Should one do ad hoc OLAP queries, or should one try to optimize the database before doing that? There will not be many physicians writing XQueries, research on database-contents should always be done with knowledge how to do it optimal. Did you ever see database-analyses? How many visits does an engine need to go to reach a specific end goal? Consider following: Running through a BTree index through 32 steps makes it possible to do the same as a table-scan of 4 billion records. That is about 140 million times faster!!!! Running through a BTree index through 16 steps makes it possible to do the same as a table-scan of 60,000 records. That is about 4,000 times faster!!!! If the searched value is marked as unique, these figures are multiplied. I have written an SQL-engine, ten years ago, for a specific project, I am very well aware of the difference between: - finding a dataset over an index (directly jumping to it and looking to the next index-entry, see if it fits and then stop) - finding a dataset over an unique index (directly jumping to it and then stop) - finding a dataset over a tablescan. (walking over all the records to the very end of the table). In XML-databases this works about the same way, they walk against the structure index and check every XML-leafnode which has the path in it as defined in the query. The paths you were looking for were in all XML-documents, so effectively, you forced the engine to complete table-scans. Depending on how large the table is, an index on specific items and interconnecting documents, and designing some path-values as unique can speed up a search many times. It is not very scientific to jump to conclusions about database, and color your text with negative qualifications when doing tablescans. Yes of course, if you investigate architectures, you should investigate them being used in an optimal way. What else will your research be good for? Showing how a database badly performances when used in a bad way? Because that is what you show us. The negative qualifications in the paper do not say much about a real life use of that database. I see that you are not very experienced in database engineering. Indexes can be arranged on the moment they are needed, they can be enabled and disabled, removed, installed, index-definitions saved in text-files, many possibilities. A practical database engineer does not invent the same wheel every week again. One should not keep indexes updated which are not in use at a specific moment, like I said, waist of performance and time. Most of the large health-institutions, like hospitals, have professional database-engineers in service. And even if they don't have, there are really good books on the market about how to arrange OLAP research on OLTP databases. Reading one in a week time, or sending someone to education makes a big change. The knowledge is very generic and it is easy to find out how to do it when changing database-vendor or employer. All professional XML databases have built in indexes. But, as I explained before, they do not know the connection between XML-documents, and they do not consider a value as unique, if you don't tell them to do. See above for further explanation. Investigating database-performance is working, That is correct. Doing research on database contents too. There ain't no such thing as a free lunch. Yes, the text got me there by the ....., now I now you were evaluating a non-OpenEHR MySQL database, with no description at all, with unknown queries, against three XML databases with no optimization. I don't need to study the MySQL configuration, as you say, it is unimportant, and I have seen many EHR database-configurations in my life. I don't need to study one more. I am not sure if a grammar is the way for creating an AQL to XQuery translator. But maybe I am wrong in this. I keep the option open. Thanks for the links and the discussion. If it is the same to you, I like to end this discussion at this point. I have made my point, I guess you too. But feel free to disagree, but I am not sure to reply. I have to say that I am a busy man, and these kind of discussions are too much time consuming. Have nice day Best regards Bert Verhees --- **Canonical:** https://discourse.openehr.org/t/openehr-prototype/15233 **Original content:** https://discourse.openehr.org/t/openehr-prototype/15233