Archetype relational mapping - a practical openEHR persistence solution

Interesting paper from China

http://bmcmedinformdecismak.biomedcentral.com/articles/10.1186/s12911-015-0212-0

Ian

I talked about this approach with a colleague from China during MEDINFO. The problem is your schema grows with your archetypes. Also, that storing data from many templates that don’t use all the fields in the archetype, will generate sparse tables (lots of null columns). I told him it was easier to do an ORM from the IM, because the schema doesn’t change and allows to store data from any archetype/template. But they already have a system working this way.

based on a quick look, my reaction is the same, unless they have some very interesting Archetype => Schema transformation.

Another problem is you have to convert your object oriented model (which RM is) to a relational model, which becomes complex in converting templates/aql to SQL. I have been that way. More then five years ago I left it. It is difficult doable, if you want a full featured openehr kernel. I would never recommend going this way, unless someone has a really smart idea.

It can work for a light featured openehr light derived application model.

Best regards
Bert

Alas, the hyperlink seems to be broken.
š

Actually, we use such mappings to promote some “important” elements to relational tables to get sort of indices on the data. Otherwise, I don’t think we would be able to do efficient ad-hoc cross-patient queries directly on the database. Exporting data to I2B2 or SSAS would be inconvenient sometimes and SQL still has some advantages.

BTW, is there an AQL implementation that is optimized for such “epidemiolocial querying”? I think Erik Sundvall mentioned a hadoop-based research project a while ago.

Best,

Birger

Hi Birger,

It might be this paper you are thinking of.

Freire S, Sundvall E, Karlsson D, Lambrix P. Performance of XML Databases for Epidemiological Queries in Archetype-Based EHRs. Scandinavian Conference on Health Informatics 2012; October 2-3; Linköping; Sweden. P 51-57. Linköping Electronic Conference Proceedings.

http://www.ep.liu.se/ecp/070/009/ecp1270009.pdf

http://www.ep.liu.se/ecp_article/index.en.aspx?issue=070;article=009

Regards

Mikael

ORM is not a problem with current tools. In fact frameworks like Hibernate and Grails make Object-Relational Mapping something enjoyable to work with. I think the problem with the described approach is the growth of the relational schema when your knowledge base grows.

But there are design challenges, ORM doesn’t solve all the problems itself. IMHO, the object model that should be mapped to relational, if relational is chosen as DBMS, is not the raw openEHR IM. Simplifications over the IM are needed in order to prevent excessive JOINs and huge hierarchies. In fact I teach this in one of my courses and this was part of the tutorial we did on MEDINFO. For example, the OBJECT_REFs can be designed as simple relationships, because plays the role of a FK in the object model. There are many simplifications that can be done to reach an object model that is compatible with the openEHR model but more “relational friendly”.

Hi Ian and all,

We, openEHR Japan had an unconference with Dr Lu and he gave us a presentation for us about this research.

I will ask him if the slides would be shareble.

Shinji KOBAYASHI

This is correct. The usual way I do this with an object model is to create a set of P_XXX classes, where ‘P_’ means ‘persistent form’. The P_ classes are a transform of the main IM (whatever it is) that does things like

  • stringifying a lot of low-level fields
  • ignoring derived fields
  • occasionally using a blob transform where it makes sense.

Then one can start to consider tools like hibernate on the P_ model.

Both the openEHR BMM files and JSON/XML/ODIN save format use this approach.

  • thomas

Dear all,

I just found there are lots of comments on our recent paper. Thanks for all these comments, I will read them and reply later.

@Shinji, Of course the slides could be shared.

Regards
Xudong

Dear all,

I just found there are lots of comments on our recent paper. Thanks for all these comments, I will read them and reply later.

@Shinji, Of course the slides could be shared.

Regards
Xudong

I have some opinions on this

My experience years ago was also with an ORM environment, I tried several. but then I looked at the generated queries.....
This has nothing to do with scalability or good performance. If you want a system with several hundred simultaneous connections working on it, then there must be better ways then executing hundred lines SQL.

But it is not an OpenEHR exclusive problem. I went to a HL7v3 meeting, and they sometimes work with kind of relational databases too.
A software product was proudly demonstrated, it generated SQL statements from object-oriented statements, they generated statements of 250++ lines.
By the way, they did not know that until I asked them.

Nice debugging if the output is not what you expected. And also, how can you ever trust an output coming from such an SQL statement? We are talking about life an death.

In fact some of the big database-vendors work this way too, but they do it in a professional way, with hundreds of very well trained and highly qualified developers. And then it is another story.

There are many simplifications that can be done to reach an object model that is compatible with the openEHR model but more "relational friendly".

Correct, that is a possibility. For example. each DV_TEXT has two codephrases, one for language, one for charset. These are minimal two extra recursive loops in an SQL statement (and then find out that they contained a null value ;-).
This slows down the queries. As long as it is straight and simple, there is no problem.

You can ignore those codephrases and say that the language always will be Chinese or Spanish, and the same for the character set. You can also optimize them by concatenating the information and put it in a field, but then you will never be able to query on those values, only in free text.
Or you can say, we only work with these archetypes, and optimize the database for it, but then it is not anymore two level modeling (or user-central modeling as I like to call it)

That is part of what I meant with the light version of OpenEHR. Nothing wrong with that, it can serve purpose. But it is not OpenEHR, it looks like it.
If I was an hospital manager, I would not rather take a full blown OpenEHR system, or else go to a monolithic classic hospital system which does a fine job for many decennial. It is not that people dying fast when a hospital uses Epic. I saw a classical monolithic system with 7000 tables, and the number growing every day. It runs for over 20 years, it started with a few hundreds tables. So it grew with 350 tables every year.
In my opinion, it still can run for 20 years more, but spitting in those 7000 tables is impossible (by then, 15000), no one knows what is in it, many haven't been touched for ten years. No screens connected to them anymore, no one knows what they do, no one dares to optimize. I have seen many systems like that. It is a common practice in many companies. These are not datastorages, but datadumping sites.

Let it grow, let it grow, sings Eric Clapton :wink:

So I believe, that an OpenEHR system is cheaper, better maintainable and more future prepared then a classic monolithic system, also when that OpenEHR system uses the most advanced technologies it will be cheaper and better maintainable.

I have experience with XML-kernels, lets focus on that. OpenEHR is very easy to process with XML. AQL is easy to convert to XQuery, in fact XML is more or less object oriented. But XML is only one approach, there are more ways. I am studying them now.

But regarding to XML, I want to say following:

Oracle 12c, for example, creates underwater a relational database from an XSD, and stores XML-data in that database, and converts xQuery (underwater) to SQL on that database. (I believe the Chinese ARM approach is like this, but then in own development.)

I did not experience the Oracle performance as lightning fast, but maybe that can become faster by having more machines in a cluster, and doing some tricks you do not find in books, but you can learn from Oracle, only if you pay for the license, which is not cheap and you hire some specialist, which is also not cheap. I am not a trained Oracle engineer. I know that you need to be to get the best out of Oracle databases.
(By the way, for developers using Oracle it is free, but you work without bugfixes and without this advanced knowledge which you really need. )

Another expensive product, which is advertised as faster as Oracle is MarkLogic. I never tried it, although it is also free for developers. I believe a license is 32.000 Euro a year, but I don't know if it is for clusters, etc. I think however, that price is a small beer for a big hospital.
I don't know how they handle XML underwater, but they do it well, as I understand. I should try it, but I am busy with other things.

My opinion is, pay what is needed, and go for the best.

Bert

Thanks, pazospablo.

Yes, you are right. There may generate sparse tables. We found the problem when we using it in the implementation. Currently, we update our method to use “storage template” to specialise the archetypes. The “storage template” will be generated automatically through combining multiple templates defined by the domain experts.

Regards
Xudong

We haven’t convert RM to relational model, it will cause lots of JOIN operation and low performance. Actually, we using “storage template”, which is a specialization of archetype for storage purpose, and is generated automatically according to existing multiple templates (If more templates are added, the “storage template” will be updated, but in real case, it will not be updated too often).

Best Regards
Xudong

Hi Pablo,
I agree with you re the indirect use of IM. I don’t necessarily see a big problem with this paper’s approach because every approach has its downsides. In fact, some really successful projects and products are built on designs that are shunned by the “collective wisdom” of academia and engineering communities, but they work, they do the job, and most important of all, the cost of dealing with design related issues is acceptable, given the expectations from the system.
This is the key: what is the expectation? What defines success in an openEHR implementation? It is very context specific, with human, institutional and other factors having their effect on technical factors. I’m not saying that you’re suggesting anything on the contrary, just trying to make a point that this is an implementation that made it to a functioning state and many factors in that process is unknown to us. The authors may not even be able to share those factors.
If my memory is correct, when I read this paper some time ago, it mentioned an earlier study, probably by the same group, but I could not access it. That’d be interesting to read as well.

So, if I understand, it is an -on the fly- optimization. So to say if in an archetype something is not there, then there is no storage needed to store that, no query needed to handle that (because it is not there). So you store every archetype with its own optimization in an own structure? You optimize the RM for a specific archetype If I understand it well, this is a very original approach. Excuse me for reading your document to quick. I will now read it more carefully Thank you very much for sharing. Bert

I believe your paper is about performance of an openEHR based system with a relational database. It does not argues if this is the right approach. And specifically on the performance we could not agree more…:-).

In the past year we worked on three different versions, all completely openEHR compliant. We compared them, so I believe we are entitled to discuss. Let’s see:

  1. Our first version was Java based with a postgres DB, everything stored as path/values.
    Every query would take about a second. We did not even try complex queries..:-). Also the GUI side did not know what to do with the pathvalues.

  2. Second version was completely XML based, lots of Java with an Exist DB. This version can be found in GitHub.
    Results: a single query over 900 records took 100ms. But the scary part was that performance exploded linear, so 90000 would take about 10 seconds, even for the most simple (already indexed and optimized) query.

  3. Our current stack converts archetypes to an Object model and persists this model to postgres. Indeed this needs some fancy tricks (@Thomas), but it’s doable…:-).
    Performance is comparable with the findings in the paper, even with huge queries. Performance is steady at about 1 to 3ms per query. No optimization done yet, is not yet needed but could/will make it even faster.
    Cheers, Jan-Marc

Hi Jan-Marc, There where some problems handling the path/values, most problems were based on giving a semantic meaning to the paths. Storing path and an according a value is very, very quick. I asked database specialists, and they say this is the best way to go until billions of records. Also easy to migrate to another database, for clustering or other reasons. But there are some problems to solve, which were harder to solve five years ago. One problem is the GUI builders, they are looking at a difficult to understand database-approach, and also easy to create errors in, hard to debug. They need JSON to write their datasets in. The other problem is querying. As long as it are predefined queries, you can do anything, but then you are no different from an old monolithic system. But writing new templates heavily relies on on the fly query building There are however, some technological progresses, also in the open source domain. The path/value storage could come to a better life again with help of ANTLR, which can help to interpret AQL for this purpose. I even think this is promising. Let engineers read the Definitive ANTLR4 Reference by Terence Parr, and read it with path/values in the back of the mind. Both the GUI problem as the query problem can be solved. It should be worth the spent time and the price of the book :wink: best regards and good luck Bert Verhees

Hi Jan-Marc,

To clarify: when you say huge: do you mean that the result set is huge, or the the amount of existing data is huge? Are you able to comment on how query performance changes/stays the same when the result set size begins to grow?