Archetype relational mapping - a practical openEHR persistence solution

Interesting discussion..:-).

Concerns for our current architecture:

  • Should be completely according openEHR IM

  • Accessible for external developers (front-end) through an API.

  • KISS; hide all openEHR complexity for GUI developers. So no pathvalues output, just simple GET and PUT.

  • Scalable, should be able to handle large queries and large datasets without any (huge) challenges.

  • Providing REST access (Json documents)

  • Access with Javadocs
    Some options we compared:

  • Exist DB is not ACID https://en.wikipedia.org/wiki/ACID. So in fact not usable at all. Also performance sucks.. (we tested).

  • Oracle rdbms is indeed ACID. You have to convert the archetype (ADL or XML) to XSD, that will be used to store the XML in a relational DB (for better performance).

  • Like Bert already stated; MarkLogic is somewhat pricey..;-), but -could- provide better performance. We never tested.
    So we came back to good old postgres and store the archetype Object model into a relational DB straight forward. None of the compared systems could even come close in matters of performance and scalability. In the end we -could- provide XML access alongside with current Json, but no GUI developer asked for it..:-).

Other papers that show the same:

  • Performance of XML Databases for Epidemiological Queries in Archetype-Based EHRs. By Sergio Miranda Freire, Erik Sundvall, Daniel Karlsson and Patrick Lambrix

  • AN OPENEHR REPOSITORY BASED ON A NATIVE XML DATABASE. By Linda Velte, Tiago Pedrosa, Carlos Costa and José Luís Oliveira.

Thanks for all input, Jan-Marc

One small correction, Jan-Marc, this is not possible. Not all constraints possible in an archetype can be expressed in XSD, you run against this very quick.

That is not needed also. The RM can be represented in XSD, that is enough, it is used as meta-information for Oracle. From the archetypes Schematron can be generated, and that can handle all constraints which can be defined in an archetype.

But when you use path/values, of course none of this is needed.

Bert

this is also what I would expect. Path-based storage does rely on very smart ways to figure match terms in a query to paths of course. There are some tricks to use here. For example, the path to systolic BP DV_QUANTITY node from the archetype is /data[id2|history|]/events[id7|any event|]/data[id4]/items[id5|Systolic|]/value In the whole of CKM there are probably about 7,000 ‘interesting’ leaf paths (if you assume that you crunch DATA_VALUE subtypes into little blobs). That’s a tiny number. Assume that when they’ve modelled everything in medicine (outside of genomics and proteomics) that we have 50,000 such ‘paths of interest’. That’s a very small number. These paths can be mapped in smart ways to a 64-but number space so that finding out if a specific query term is in some EHR is very quick. When you include a coded list of archetype ids in the mix, I think querying can be made extremly quick. The devil is in the details. Various large DBs used path-based approachs in the past, Informix was one.

Hi Seref,

We tested with lots of queries, not with large datasets (tens of thousands) within the system itself. But given the nature of the “normal” relational DB schema and queries, this should be no problem at all. It’s good old postgres..:-).

Jan-Marc

Hi Thomas,

this is also what I would expect. Path-based storage does rely on very smart ways to figure match terms in a query to paths of course.

Did you test, or is this theoretical?

Jan-Marc

I am impressed already :wink:

Very well possible it seems. The only thing is interfacing the beast towards the human minds of developers, but that is a one time job.

I'll keep this in mind.

Bert

Thanks,
My feelings for postgresql are well known, it is a great piece of code. Final question, do you support AQL with your current design?

I have not performance tested my own implementation (although I did actually build one over a decade ago). But the Informix path system was real and performant (they used a hierachical number node code approach e.g. 1.2.3, 1.2.3.1.2 etc). There are others I have read about as well, but I’d have to spend a bit of time to track them down, not having looked at it for a few years.

The reason why it can be fast is that you don’t have many columns and essentially no joins, and you can do extremely efficient matching on number spaces to determine the ‘type’ of something (i.e. whether it is a systolic BP or whatever) before you decompress the value (a DV_QUANTITY). Then you do value comparison (e.g. bp > 160 or whatever) as a second step.

Before doing any of this, you apply basic AQL logic that cuts out whole parts of the DB because their values are not inside (AQL CONTAINS; XML ‘//’) the required kind of Composition.

If you have run an actual study of path-based performance and found it not to be working, it would be very interesting to know the details. It may be that with some adjustments you can significantly change the performance characteristic.

  • thomas

Dear Seref,

At moment we do not support AQL, mainly because none of our current partners/developers are asking for it. We have discussed using AQL internally several (lots of-) times, but also with for instance Ian.

If this changes, so if there is a need for AQL, we could easily support it. Its just a matter of development. In that case our architecture is to convert AQL to SQL, and go from there.

Regards, Jan-Marc

Thanks for the answers Jan-Marc.

Hi Thomas,

Your solution sounds interesting (looks like postgres hstore). But we are (very) happy with our current stack, so we will not test it.

Jan-Marc

Interesting to research on. I think it is the best way to do it. The thing that needs to be added are the interface to create/read input/output in a developer convenient format, which reflects the RM-structure (JSON, f.e. or XML, or others, only a matter of extra interfaces) And write an optimized AQL parser for this purpose, which is a bit difficult, but not very difficult. Validation can happen with schematron on from path-values, constructed XML (in memory) The whole idea is database-vendor independent, and the infrastructuring software can be written in any language/environment. Maybe a nice project to build, after I finished the AOM in a few weeks. Thanks for sharing this Bert

Thanks! Would be nice to hear about the progress of this effort.

Not sure if you ask me Birger, but I will announce the github link here, and will build a working prototype in open source. Maybe parallel I will do something else, I am not sure about that

Bert

Hi Bert,

great to hear that you are working on this! I had in mind that someone from Linkoping University is working on a prototype, Erik Sundvall mentioned it once or twice if I remember correctly. My working group will surely try to find a good solution, too. However it would not make sense to invent the exact same wheel twice.

Best,

Birger

Down the road would be nice to write a paper about a persistence model for openEHR, or at least a ITS for specific DBMS, my focus would be an ITS for relational DBMSs in general, and a second step would be for specific brands, considering specific features of Postgres, MySQL, Oracle, SQLServer, etc.

Maybe a good paper topic for the next MEDINFO?

Hi Birger and all!

Our colleagues at the department of computer science here at Linköping University started exploring storage of archetyped data using Hadoop and a index variant releated to dewey-encoding. Dewey-style-encoding reminds of what Thomas mentioned regarding Informix previously in this discussion thread. (Look at for example the beginning of http://www.inf.unibz.it/dis/teaching/ATA/ata5-handout-2x2.pdf to see a bit of Dewey-style-encoding)

The researcher (Fang Wei-Kleiner) moved to USA before anything was published, but since there seems to be renewed interest in the topic, we’re playing with the idea of submitting something about the initial experiments (for MIE now if we have time), would that be of interest to read do you think?

Also there is an openEHR-storage related paper by Sergio Frerie et.al. that has been accepted and soon will be published: “Comparing the Performance of NoSQL Approaches for Managing Archetype-Based Electronic Health Record Data”. I think that Sergio will annonce that here when final publication is finished. Hint from that paper: Couchbase is an interesting approach to explore more if you are looking for openEHR-performance…

Another thing - regarding XML databases do read our old paper (http://www.ep.liu.se/ecp/070/009/ecp1270009.pdf) and include BaseX in your tests if you are experimenting with open source native XML DBs for openEHR, it preformed well for “clinical” patient-specific querying even though all xml databases we tested were not suitable for ad hoc epidemiological population queries (without query specific indexing).

A very interesting paper. I have some first opinions on that. But first I need to explain what I think about the matter. I have not prepared the story below, so there may be things which I write to fast. See it as provisional view, not as a hard opinion. --------- There are relational database-configurations for OLAP and for OLTP. The combination is hard to find. There are reasons. This is a classic problem. You need specific indexes for data-mining (OLAP), and for every extra data-mining query you need more indexes, especially if you don’t have time to wait a night for the result. Those extra indexes stand in the way for transactional processing (OLTP) because they need to be updated, and that is unnecessary burden for the OLTP-processes, as longer as the database exist, the burden becomes heavier. That is why OLAP and OLTP are not often combined in one configuration. So many professional databases have extra features for OLAP, I worked, years ago with Oracle for a big Dutch television company, and my main job was to create indexes for marketing purposes. We ran those unusual queries during the night and stored the result in special tables, Oracle called them “materialized views”. The day after, those views were processed in analyzing software, like SPSS, and after that, thrown away. It was a database with 900,000 persons in it, and every person had a lot of history of web-history, personal interests, etc. “How much interest does a person have for opera, and is he able to pay for opera, is it worth to call him for a ticket-offer, we cannot call 900,000 persons” These were complex queries based on things the customer bought, television programs he was interested in, web-activities. That was the kind of thing they did with the database. So this could well be compared with a medical database, regarding to size and complexity. The same difficulties count for XML databases. That is why XML databases have also features for creating extra indexes. Oracle, by the way, if it knows the structure of XML (via XSD), it breaks, underwater, XML into relational data, and store it in a relational database. It also converts XQuery to SQL. In this way, it has the weakness and advantages of a relational database, and it needs the extra indexes for unusual queries, but on the developer view it is XML. ------- Comparing XML and relational, for OpenEHR, I favor XML, because it can easily reflect the structures which need to be stored. It makes the data-processing layer less complex. There is a lot of tooling around XML, XML-schema to make the database-structure known to Oracle, Schematron to validate against archetypes. This is very matured software, and therefor the complexity is solved years ago, and well tested. It is hidden complexity, and matured hidden complexity is no problem. -------- And if you want to do data-mining, like epidemiological research, and you have the time to plan the research, then the classical database, XML or RDB, is OK. In my opinion, there is not often a real need for adhoc data-mining (epidemiological research) queries, with result in a few minutes. They are always planned, and creating the indexes and storing the result in “materialized views” are part of the work one has to do for data-mining research on data. So, I don’t think there is a real need for this. -------- Regarding to XML databases, Oracle has a solution, which can perform well if it is professionally maintained. This is often a point, because professional Oracle maintaining regarding to advanced use is very expensive. Another company is MarkLogic. It is said that MarkLogic is better, but I don’t know that from own experience. Both are free to use for developers. You must think of numbers to 35,000 Euro a year for licenses, which is not very much for a big hospital, but very much for a small health service The open source XML ExistDB database is not very good for data-mining, is my personal experience. So, we must ask ourselves, are we solving a problem that no one experiences? -------- There are a few advantages to OpenEHR. Data are immutable, never changed, never deleted. This makes a few difficult steps unnecessary. The Dewey concepts look very attractive, although it is also created with deleting and changing data in mind. This is very important for normal company use. But, as said, we don’t have that in OpenEHR. Medical data always grow, it are always new data. An event that has passed will never change. The only things that change (in OpenEHR they are versioned, but from the user perspective, they change), that are demographic data. And one can live with that, create extra provisions for that demographic database-section, which is only a small part of the complete database. Often, the demographic database is external anyway. So, my thoughts, maybe Dewey is too good. Path-values (to leaf-nodes) storage is enough The paths, combined with ID’s are keys, and are much alike XPath, so it is easy to store XML in a path-value database. And querying is also easy, because all queries are path based. I think, for OpenEHR this is the fastest solution. But maybe I overlook something. Maybe I say something stupid. I am not offended if you say so (maybe in other words ;-). We all want and need to learn. What we still need to do to build this solution is handle the XQuery grammar and let it run on path-based-database. This is not very easy, but also not very hard. Maybe the algorithms are already to find. Like the Dewey algorithm this can run on any database, also on free open source databases. I think you get an excellent performance on Postgres. A path-value database is easy to index, it only needs a few. The inserting will stay very fast, always. Lets do some calculations, for fun: How many path-value-combinations do you have to store for an average composition? Maybe 30? How many compositions are for an average patient? 10,000? So every patient needs 300,000 path-values. So you can store 10,000 patients in 3 billion records. This is not much for Postgres, and the simple indexes needed. When you need to store 900,000 patients you need 90 separate tables. Very cheap, very fast, also for adhoc queries, and easy to accomplish, I think. I am very interested in opinions on this. Thanks Bert

No comments, on the other hand it is Saturday

I had left out some necessary technical details.
I will possible build it and then have possible the fastest two-level-modeling engine in the world, which will, of course, also support OpenEHR.
So it is not really bad that this happens.

When it is ready, I will make an announcement.

Best regards and enjoy the weekend.
Bert Verhees

Hi Bert,

I will post some more thoughts on these things after the weekend :slight_smile: To just give a quick answer: imo it’s important to have a flexible data format like the one I2B2 uses (roughly said EAV) to mix openEHR data with non-openEHR data. Making analysis on XML documents/databaes might prevent integration of other sources (and even if its possible like in SQL Server or Oracle, queries that mix XQuery and SQL might become pretty ugly. And I see no good way to provide a drag & drop interface to researchers/physicians to make queries).

Cheers,

Birger