openEHR - Persistence of Data

Hello guys,

i’m starting a research about the persistence model of Archetype data, that stores the information entered by the user of the system.

I would like to know if there is a indication of the openEHR standard for what kind of model schema should be used in DataBase, and if there are researchs in this area.

Thanks in advance,

Márcio Costa
B.Sc. in Computer Science @ Cin/UFPE
M.Sc. Candidate in Computer Science @ CIn/UFPE
MSN: mdckoury@gmail.com

I have good experience, fast and lean with a path/value-database, only a few tables is needed, no relations between, not many indexes, and can be compatible with new big-data databases, as you can find in the emerging new technologies. Check the book: big data glossary from Pete Warden, O’Reilly.

But you have to figure it out yourself how to do it exactly.

Good luck,
Bert Verhees

Hi Márcio,

There is no standard persistence model, the persistence mechanism is not in the standard scope.

There are many ways of storing openEHR RM instances (archetyped data), the only thing to take into account is that the information to store will be highly hierarchical.

Said that, in EHRGen [1] we use a relational model with an Object-Relational Mapping [2] tool (GORM from Grails Framework[3]). The advantage of that is that you have a complete and validated RM instance persisted on the DB, and you can query for complete objects or single data ELEMENTS. I’ve written ORM tools myself [4] and the main problem is the amount of joins you need to load a complete structure, but in my experience you never load a complete structure for a real time interaction with the user, and you alway can caché some data.

This approach is straight forward, because all you need are the classes of the RM, and you delegate DB stuff to the ORM tool.

Other models are viable too, like K/V [5] or EAV [6] approaches (mentioned by Bert). This approaches are fast for saving and loading data, the problem is that you need to have some complex logic above that for constructing a complete RM instance on memory, because K/V is a flat representation of a higly hierarchical tree structure.

Other models I didn’t try yet are Object Oriented DBs and Document Oriented DBs (XML, JSON, …) [6]. I think DODBs are a good option, fast for store highly hierarchical structures, but you need to write some ugly queries if you want your data back :smiley:

Hope that helps.

[1] http://code.google.com/p/open-ehr-gen-framework/
[2] http://grails.org/
[3] http://en.wikipedia.org/wiki/Object-relational_mapping
[4] http://code.google.com/p/yupp/
[5] http://en.wikipedia.org/wiki/NoSQL
[6] http://en.wikipedia.org/wiki/Entity%E2%80%93attribute%E2%80%93value_model

Other models I didn’t try yet are Object Oriented DBs and Document Oriented DBs (XML, JSON, …) [6]. I think DODBs are a good option, fast for store highly hierarchical structures, but you need to write some ugly queries if you want your data back :smiley:

Aren’t several major OpenEHR systems using XML or stored blobs on top of a rather minimal DB schema?

The problem with ORM systems would ultimately be a rather bloated schema and hard-wired classes to accommodate that schema. Yes, ORM lets you automate the creation of DB data structures and classes, but, once created, they become part of the schema and your DLLs, which is fine until you have many hundreds of them (tables and classes defined in autogenerated code, etc.).

Randy Neall
Veriquant, L.L.C.

Other models I didn’t try yet are Object Oriented DBs and Document Oriented DBs (XML, JSON, …) [6]. I think DODBs are a good option, fast for store highly hierarchical structures, but you need to write some ugly queries if you want your data back :smiley:

Aren’t several major OpenEHR systems using XML or stored blobs on top of a rather minimal DB schema?

The problem with ORM systems would ultimately be a rather bloated schema and hard-wired classes to accommodate that schema. Yes, ORM lets you automate the creation of DB data structures and classes, but, once created, they become part of the schema and your DLLs, which is fine until you have many hundreds of them (tables and classes defined in autogenerated code, etc.).

That is my experience too, that is why I prefer a simple schema, for example a key(path)/value schema. I heard about good experiences with storing XML-pathes and values.

The bad thing is, as Pablo said, you have to write software-logic to translate the paths/values to RM-instances and back, but the good thing is, you only need to do it once, because the kernel/RM specs are very stable.

Also you need to be prepared for AQL queries and templates which use paths to retrieve information.

regards
Bert

Hi!

Other models I didn't try yet are Object Oriented DBs and
Document Oriented DBs (XML, JSON, ...) [6]. I think DODBs
are a good option, fast for store highly hierarchical structures,
but you need to write some ugly queries if you want your data back :smiley:

Not necessarily that ugly... we curently auto-convert AQL to XQuery
and execute towards an XML database. Those queries are very readable.

Then the question is what kind of client system you are aiming at. For
some use cases you don't really need to map things back to
openEHR-RM-objects, in web browser based GUIs for example you can keep
treating the data as documents, document fragments, fragment lists
etc. and use DOM manipulations, jQuery or similar approaches for most
data manipulation needs.

Good luck with your work Márcio and please keep us informed!

Best regards,
Erik Sundvall
erik.sundvall@liu.se http://www.imt.liu.se/~erisu/ Tel: +46-13-286733

Correct - you would not use XML blobs in a system of any serious size
and transaction requirements, binary is needed. But blobbing with very
smart adaptive indexing works very nicely on relational DBs, where the
RDB becomes a smart blob manager.

- thomas

Hi Erik, the information in this reply you give is very limited.
So excuse me if I miss the point.

How do you validate incoming data against the archetypes?
That is the reason I create an RM-object, for validation-purpose. I
don't need it for other purposes.

And another question
It looks to me that GUI's have a complicated constellation when they
need to work with hierarchical and complex-structured XML.
Maybe you disagree.

I am interested in your solutions.

Bert

Hi Randolph,

The problem with ORM systems would ultimately be a rather bloated schema and hard-wired classes to accommodate that schema.

Yes and no, it depends on your configuration. One configuration could end up using only on talbe with many columns, other configurations could break up this table to normalize the schema.

I don’t understand the “hard-wired classes to accommodate that schema”, the openEHR RM is an Object Oriented model, a programmer should implement the model on the ORM tool and the schema should be generated from those classes, in fact is the schema what accommodates to the classes. So, for new versions of the RM, a new schema could be generated, an mappings between those schemas could be generated to (for data migration, if needed).

Yes, ORM lets you automate the creation of DB data structures and classes, but, once created, they become part of the schema and your DLLs, which is fine until you have many hundreds of them (tables and classes defined in autogenerated code, etc.).

Yes, you’ll end up with a schema, a fixed one, that depends on the RM version you use, if the RM version change, the schema should change too, but the RM version you’ll use will be very stable, nd I’m sure that only one version of the RM will be in use at a time.

ORM generates the schema for the classes (source code), not the classes for the schema, so we don’t have “autogenerated code”. This is my experience and the way I think this should be done, because the openEHR Reference Model is Object Oriented, so a programmer could easily program those classes and user an ORM tool to generate the database structure.

DLLs?

Kind regards,
Pablo.

Randy Neall
Veriquant, L.L.C.

Hi Erik, you are right, the uglyness depends on 1. the queries you want to execute and 2. the programmer background.

For 1. the “common” queries like get all records for this patient in this time window, are not that ugly, but more complex queries could be.
For 2. for a XML guy, writing xPath based queries is ok, but for a SQL is a pain in the a55.

:smiley:

I’m hoping to see that paper on AQL->xQuery soon!

I totally agree that inside the system maybe you don’t need a complete RM structure to handle data instances, but for the service layer (sharing information with other systems) this is a must.

i would like to thank everyone for the information and attention.

i’m trying to do a review about this subject to start my research, but i will do something to analyse the best way to model and persist this kind of data.

Best Regards,

Márcio Costa
B.Sc. in Computer Science @ Cin/UFPE
M.Sc. Candidate in Computer Science @ CIn/UFPE
MSN: mdckoury@gmail.com

2012/2/17 pablo pazos <pazospablo@hotmail.com>

Do Anyone knows about some papers of persistent storing?

att,

Márcio Costa
B.Sc. in Computer Science @ Cin/UFPE
M.Sc. Candidate in Computer Science @ CIn/UFPE
MSN: mdckoury@gmail.com

Op 17-02-2012 20:49, pablo pazos schreef:

the openEHR RM is an Object Oriented model, a programmer should
implement the model on the ORM tool and the schema should be generated
from those classes, in fact is the schema what accommodates to the
classes.

Starting with OpenEHR often means, For every class a table, maybe write
your own access-layer, specially optimized, maybe better then hibernate,
because customized.
But still, to store a Locatable, with item-tree's, clusters, datavalues,
it used about 40 insert-statements, (automatically generated), or more.
To retrieve it, about the same number of Select-statements.

Then start optimizing, and create f.e. a wide table Datavalue, which
could contain a complete datavalue, so you do not any more needed to
split up the datavalues.
It reduces the number of SQL-statements/indexes to, maybe 25 instead of 40.
Also maybe try to create a very wide table Locatable, hoping that you
can do some more saving, for example, the name-attribute which is a
DvText, the uid which is (mostly) a hierobjectid, you can flatten that.

It helps a bit. You reduced again the number of tables/SQL/indexes, say,
to 17 for a decent Locatable.

I don't think that is a good way to solve this problem.

There are other solutions.

XML is one, but I am thinking about another way.

Does someone know how an Object Database works?
You offer an object, it never has seen, and it stores it.
And can be queried on attribute-values.

I don't know how it works, but it could be like this:
I think it could work on attribute-paths.
If it does, it looks like OpenEHR, which works on archetype-paths which
are representations of attribute paths.

Every leaf-value has a path pointing to it.
Every locatable has a UID.
These both make every single leaf value unique.

Imagine:
When splitting the data over tables, your query engine needs to use more
then one SQL statement, or join several statements, needs to open
indexes, needs to read key-fields and jump to them in indexes, and so on
and on.
Every table/class involves costs. The more complicated an
Locatable-object is, the more expensive it is to store.

It is also possible to flatten the whole business to one table. Only one
simple query retrieves a complete locatable. Only one index used.
Also, implementing AQL is not very hard, the necessary information is
indexed available.

That must be very very fast. I don't think, faster is possible. Only 1
SQL statement to retrieve all values of a Locatable.

And there is no problem with the large number of records that table has,
and there is no scaling problem.
Every record is, in fact, one leaf value.
The rumor says Postgres is fastest (I heard, Oracle has a problem
admitting this) until 100 million records, after that one should start
thinking about NOSQL-solutions.
100 million records mean about 25.000 average patients. 4000 datavalues
for one average patient.
The older versions of Locatables, they of course go to a separate table.
Some NOSQL-databases have a versioning system themself.

But there are some disadvantages. you must create RM-instances from
path-notations, you must keep pointers to parents, the software logic is
more complicated then dumping it to hibernate or XML. But this to solve,
not really rocket science, and because the kernel is stable the code
last a long time. And even when the RM-specs change, without migrating,
It can support more versions of the RM-model in one table, simultaneous.

Think about it.

I did think about that, the kernel-speed improved with a factor 50.

I saw, some month ago, a person, technician who wrote an SQL engine for
a HL7 RIM database. The engine created SQL statements to retrieve
values. I cannot remember the exact numbers, but I thought, 200 tables
were involved, and the SQL statement (automatically generated) was about
100 lines at least.
It is easy for programmers, it develops quick and without errors, but is
it good?

I think speed and simplicity is necessary for success.
Speed because of the simple table design and short SQL statements,
simplicity because of the archetypes, not software, defining the job.

regards
Bert Verhees

Op 18-02-2012 2:13, Bert Verhees schreef:

It is also possible to flatten the whole business to one table. Only one
simple query retrieves a complete locatable.

Flatten is not the right term. It suggest a very wide table.

I meant a very thin table, only three main fields. The UID, the path and
the value.
And some more fields for meta-information, that is all it needs.

The concept is known as key/value-pairs

Bert Verhees

I remember a Honours or Master’s thesis on openEHR persistence…I think Heath was involved. Heath is that publicly available?

Cheers,

-koray

Hi Koray,
Yes there was a honours thesis done on using an object database to store and query openEHR data. It was intended to compare our indexed XML blob approach but from memory it ended up comparing two commercial object databases.
I will have to ask Chunlan if the paper is publicly available.

Heath.

Heath,

it would be so important to me if i get this publicly.

Thanks in advance,

Márcio Costa
B.Sc. in Computer Science @ Cin/UFPE
M.Sc. Candidate in Computer Science @ CIn/UFPE
MSN: mdckoury@gmail.com

2012/2/20 Heath Frankel <heath.frankel@oceaninformatics.com>

Hi Marcio,
I do not know the scope and focus of your research, so I can’t give your more to the point advice, but the path + value approach that has been mentioned in previous messages has been implemented in Opereffa. You can see how it works if you have time to dig into the source and database layout. http://opereffa.chime.ucl.ac.uk/introduction.jsf
I’ve done that almost two years ago, and I’ve got a feeling that I can improve the performance significantly with the next release, but it should still be helpful to see what this type of approach looks like in real life.

Kind regards
Seref

2012/2/21 Márcio Costa <mdckoury@gmail.com>

Hi Pablo,

I’m sorry for being so slow responding to your questions. I may not be understanding you fully, nor have I made myself totally clear to you. First, a DLL is a file system file known as a Dynamic Link Library, a unit of compiled machine-executable code, typically invoked from a computer code file with a .EXE extension or from another DLL file, with a .DLL file name extension. These naming conventions are used by Microsoft Windows programs, but may not be used by other platforms, which which I am unfamiliar.

The ORM approach, as you describe it (correct me if I’m wrong), involves the creation of specific classes, expressed as compileable source code, and which therefore end up baked into the executable code files (EXE, DLL, or whatever the equivalent is called on your chosen platform). I am not sure how automated this process actually is in your OpenEHR context. Are you, for instance, able to download an archetype from the OpenEHR web site, press one button in your ORM, and thereby generate a class in your source code, which is then compiled into machine code (in something like a DLL)? And then, after that, with another push of a button, does a schema magically materialize, matching your auto-generated classes? If so, that’s wonderful.

But I have a concern that has nothing to do with automation, and which could actually be aggravated by automation. However automated the class or schema generation is or isn’t, and no matter which process comes first (generating the classes or generating the schema), and no matter which process is dependent on the other, you still end up with both a schema and compiled code that will expand with each new class that you create. That’s what I mean by “hard-wired.” You can do a lot of hard-wired stuff very fast via ORM code or schema generation automation. Your DLLs (or whatever your equivalent is) will expand in size and number. Your schema will grow in size and complexity in direct proportion to the number of classes it is trying to persist. You don’t feel the pain, however, because the computer did it all (or a lot of it) for you. But you’re still left with an end product (consisting of schema and compiled code) that will bloat with each new thing it is designed to express, manage, present and store. That process can go on for a very long time, yes, but it can’t go on forever. And the human body, with all the things that can go wrong with that body, ultimately requires thousands, maybe tens of thousands, of classes to describe just what can go wrong with the nervous system, to say nothing of the rest.

It seems to me that the better solution would be to develop a metadata-based system capable of describing all that must be expressed, allowing both schema and program code to remain unchanged while presenting to the user information of which the compiled code and schema are both essentially ignorant. In other words, neither the program code nor your schema has any awareness of particular structures of medical information. All of that is instead in the metadata, not schema, in the metadata, not classes.

My mistake in all this may be that I am mentally associating “classes” with source code that ends up as compiled code, but maybe you mean something else by your OOP terminology, something more akin to metadata that never gets baked in to the machine code. But I don’t think I’ve misunderstood your concept of a schema.

I had all along thought that OpenEHR was intended to operate in precisely the way I have described, as a basis for expressing complex information with simple schema and source code, neither of which embody, in hard-wired form, particular medical information structures. Your description of ORM seems to me to mean something else. If I’m wrong, I certainly apologize.

Thanks,

Randy Neall

Hi Randolph,

I’ve commented between your lines.