Thanks,
I always forget it is weekend, as independent developer, there ain't no such thing as a day of.
I agree that mixing query-languages is ugly, and I don't see the necessity.
But maybe, you will explain that next week
Bert
Thanks,
I always forget it is weekend, as independent developer, there ain't no such thing as a day of.
I agree that mixing query-languages is ugly, and I don't see the necessity.
But maybe, you will explain that next week
Bert
Even as a research associate or independent developer, its sometimes good to take a break from work
I hope I didn’t miss a point in your e-mail and hope that I understood corretly: when you store openEHR-data using XML technologies, you will have to use XML technologies to query that data. To my understanding, also Oracle will need xpath/xquery when you store XML data using the XML datatyped column. If you want to combine this data with other data that isn’t based on XML, you will have to convert the external data to XML or mix query languages in the mentioned data bases.
In short: It makes sense to not use the openEHR CDR for such queries but export it to something more suited for such purpose. This does not need to be a classical Data Warehouse (it has only limited use for the questions clinicians have) but certainly XML would not be my first choice for the analytics layer.
Birger
I don’t understand the point you make. The leafnodes in an archetype are always constraints on primitives, and XML can represent all types of primitives. Everything that is not a leafnode is a structure, the structure is object oriented and easy to convert to XML, where the tag-names are the attribute-names, and the type-names are the typenames from the RM-XSD. The structure which is represented in XML is also easy to bring back to an object oriented structure. It is just a notation, nothing else. There are even standard java-classes (I thought in JAXB) which support this kind of things. AQL is easy to convert to XQuery, but maybe I would prefer XQuery, and it is, for a non technical person possible to use a drag and drop interface which represents the archetypepaths. I am missing your point, or better sad, we are not on the same track of understanding Best regards Bert
Hi Bert,
I’m not arguing that you can represent most data in XML. I’m just concerned that mangling high volume or specialized data like for example sensor data, genom data and geo-spatial data into a document format might not work too well. Also, when the ER-diagram of non-openEHR data is fairly complex, producing a meaningful XSD and XML documents might not be that quick and easy (at least I don’t know of a industry-strength tool that can help with this task. However, I may be wrong about this and I’d be happy to learn).
Regarding performance, we did some tests on SQL Server 2012 last year. As I have only experience with this particular database, it might well be that my critique does not apply to Oracle or Marklogic!
Just a minute ago I compared a simple SQL Query with an XQuery on our data repository. I simply wanted to get all validated blood pressure values and their corresponding datetimes of a pediatric icu. Using the plain relational representation of the data (we automatically map data from compositions to tables), it takes under 1 second to get all 329.273 rows. Having a full index on the blood pressure fragment of the composition (this is needed to get the internal tabular representation of the data) and a secondary index on the paths, querying of the same rows still takes 30 seconds (without, it would be 2 minutes. No surprise). Additionally, the size of the data increases from 10MB to 270MB.
This is the reality we face in out system, therefore, I consider XQuery and XML not an option for us to do analysis in this database layer. As said, this might not apply to a better implementation of XML by other vendors but I’d love to see some real-world numbers.
Just some thoughts and experiences, I’m not a dedicated database expert, therefore, I would not be sad if I’m proven wrong ![]()
Cheers,
Birger
I agree, long ranges of data are not well represented in XML. It has too much overhead. (Although there are other solutions for that which are easy to integrate with XML, but that aside) So handle XML as an intermediate representation, good for software to handle, it can represent objects very good. So it fits good to a Object Oriented paradigm. OpenEHR also works along this paradigm. XML is a format which has good support for validating and it can represent objects very good. It is also widely understood, and almost every development-environment has standard support for XML. There are two kind of related matured industries-supports I am looking for. That is a good, well defined query language, and as an extension on this, a validation environment. XQuery and Schematron are excellent technologies which fit very good to the two-level modeling (OpenEHR) paradigm, because they are path-based. JSON is also very good, and it is leaner, especially if sender and receiver have deep knowledge about the data (which is the case in OpenEHR), then JSON is better. But the industry support for JSON is, as far as I know, not as good as it is for XML. But on the other hand, it is easy to migrate from XML to JSON and vice versa, even without or structure data-loss, see for example I don’t believe that XML-databases actually store XML. Oracle, for example, breaks it up in a relational structure. But I don’t know the internals of others well. The worst solution, however for storing XML would be really storing XML. In the solution I presented in my email. it is not XML in which I want to store data, that is path-value combination (in fact, in detail it differs somewhat, this is the base idea. The elaborated idea is 10 times as efficient.) Because, regarding to storage, their are other criteria than for validating and communicating data. In storage speed and efficiency are very important, and also, a very good and fast implementation of AQL (or XQuery) And when data are retrieved, they can be represented in JSON or XML, or whatever one likes, even support for native American smoke signals is possible, these are again representations. I am not very impressed by these database-tests, there are so many side-factors which are not taken into account. The JDBC-drivers, for example, the used communication-protocols, the indexes, the code of the supporting software-layers, the quality of the query-engine, the operating system, the file-system, the network-card-driver, etc, etc. You are testing complete different stacks of technologies. It is like testing chain, and then concluding that the last shackle is no good because the chain breaks somewhere in the middle. But there is indeed a problem with the old database technologies, and that is that they are build for data-manipulation. There are good reasons to do that, a bank does not want to process every day your complete history, but wants to know you current savings and mortgage position. So they modify your current data constantly. The Codd normalization is also designed for efficiency and integrity in the context of datamanipulation. When you use a database out of the box then you will see features which are needed for constant manipulation. But you don’t need them, because medical data are immutable. This is very important. I can assure you that my database storage requires only a few indexes, and also very fast indexes, because data are immutable. The disadvantage of my solution is that it is not out of the box. The most important job to do is let the query engine work with the data-storage, but there are now new ways to work with grammars, and I don’t think this is very difficult. W3 has a lot of information for XQuery grammars When this is done, a database-configuration, designed for speed, on every RDB-engine can be used to create this data-processing method. But I see that we are talking indeed in different tracks of approaching the problem. You test out of the box solutions, many people do. And I think that out of the box, nothing is good enough, because they were not thinking of OpenEHR but of a million other customer-requirements when designing their database. And how good and how well designed and how professional and well maintained, they will not remove those characteristics which stand in your way. Embrace the good news
Bert
For what it is worth, here is what PostgreSQL does by default:
http://www.postgresql.org/docs/current/static/datatype-xml.html
Karsten
One doesn’t know what software really does. One must distinguish what software seems to do and what it really does.
Storing XML really as XML means, storing a lot of redundant information. I don’t know, but I cannot believe postgress really stores the full tag names, even when they occur thousand times. I would be really disappointed if they do.
Bert
storing a text serialisation of an object structure in a database never makes sense when you think about it. Storing a /binary/ serialisation is a normal 'blob' approach, and if you are not doing blobs or partial blobs, then you are doing transparent representation, in which case things like XML or JSON don't come into it - they only make sense as one kind of generated view on data.
You might implement blobs using a zipped text format, but in a serious scalable implementation that would surely have to be the least efficient of viable approaches.
I can't think of any circumstance where a DB would actually store its information as XML instance text (other than where some column value happened to be an XMLstring, i.e. the XML is just data).
- thomas
The document oriented view of the domain has no problem with storing XML text in the DB, because the implementations are built against that view. The clinical care focused use cases require developers to focus on reading & writing documents, so there is rarely a requirement to read across documents, because clinicians would not be able to consume all that information at once (except averages or other aggregate values they’re interested in)
So developers mostly deal with put document/get document type of requirements and simply having an XML field (or json for that matter) makes things easy for them. Hence, it makes sense to them. Database vendors have their own reasons for supporting direct storage of XML and they deal with obvious disadvantages behind the scenes (google: postgres xml toast). The price of storage keeps falling faster than all the other components,which also helps justify this approach.
I suggest a coding exercise for representing XML documents in a relational database with writes and reads to see what a joy it is to rebuild an XML document in comparison to just reading it directly from a column ![]()
Some DBs store XML using tables behind the scenes to save the developers from this pain, but they usually rely on schemas to do that (cough *rcl) and it becomes another joyful activity to use that feature.
So all I’m saying is: there is a price for everything, and in some cases having xml sitting in the db is not as bad as it sounds
Hi Seref,
“so there is rarely a requirement to read across documents, because clinicians would not be able to consume all that information at once (except averages or other aggregate values they’re interested in)”
I don’t think that is true in any other than the most trivial implementations. The ability to query and pull granular data across documents, for single and multiple patients is a critical requirement for anty EPR/EHR.
or perhaps I misunderstood?
Ian
You’re right, it does not read in the way I wanted it to. Let me try to clarify:
Clinical care is strongly single patient centric, and in most cases this puts a limit on the amount of data that does not push the limits of computational power too much. Developers can afford to fetch multiple documents and go across them if needed without requiring the DB to perform the same operation, hence having to process XML (or json) So you can get away with treating the relational database as a document store and performance problems do not easily reveal themselves.
There will be cases which this won’t work (for a single patient’s care), but it is only when the requirement to produce time series, aggregates etc becomes too expensive to do outside of the DB. Even then, most implementations would pre-define these and populate them during data entry in advance, and still get away with not having to dissect XML at the db level.
So my argument is more about being able to not to process XML using the relational database than the necessity of clinical data that spans multiple documents, which can be summarized as “developers are lazy”
Does it make sense now?
right - but that’s treating a document as an opaque data item, i.e. just like a String, that happens to be the value of some field in the overall schema. on the difference between blobs and documents… - thomas
If a database has a field-type XML, then I expect it does something special with that field that justifies the fieldtypename.
Especially I expect that from Postgres, because they mostly do good things.
Bert
If a database has a field-type XML, then I expect it does something special with
that field that justifies the fieldtypename.
Oh, it does, it offers validation of the XML, AFAICT.
Karsten Hilbert