openEHR - ETL - sharing is caring

We, DIPS AS, have been working on the need to make openEHR data available for data analytics. The structured clinical data must be aligned with other (not openEHR) data sources. Thus there is a need to do ETL transformations on openEHR data.

AQL is a great query language for openEHR data. We use AQL as a the key building block for our ETL process. To support the target system (relational database) we annotate the AQL.

Here I will share one example to illustrate how it works. The example use the traditional “Hello world!” for openEHR: Blood pressure.

You will se how we define attributes on the AQL paths and also how we define demographic metadata to be added at runtime.

--@etl.definition (schema_version=1, id=no.dips.arena.blood_pressure, version=1.0.0, namespace=at_blood_pressure, authors="bna@dips.no", copyright="DIPS AS", tags="blood_pressure, blood_pressure_v2")
--@etl.query (id=blood_pressure, target_name=blood_pressure, name="blood_pressure", description="The local measurement of arterial blood pressure which is a surrogate for arterial pressure in the systemic circulation.")
--@etl.query.metadata (type=episodeofcareid, target_name=episodeofcareid)
--@etl.query.metadata (type=dim_patientid, target_name=dim_patientid)
--@etl.query.metadata (type=dim_orgunitid, target_name=dim_orgunitid)
--@etl.query.metadata (type=dim_wardid, target_name=dim_wardid)
SELECT
   --@etl.query.column (target_name=origin, type=ISO8601_DATE_TIME, index=true, description="History Structural node.")
   o/data[at0001]/origin/value AS 'origin',
   --@etl.query.column (target_name=time, type=ISO8601_DATE_TIME, index=true, description="Standard, uspesifisert tidspunkt eller tidsintervall som kan defineres mer eksplisitt i et templat eller i en applikasjon.")
   pe/time/value AS 'time',
   --@etl.query.column (target_name=systolic, type=Double, description="Peak systemic arterial blood pressure - measured in systolic or contraction phase of the heart cycle.")
   pe/data[at0003]/items[at0004]/value/magnitude AS 'systolic',
   --@etl.query.column (target_name=systolic_units, type=String, max_length=255, description="Units for systolic")
   pe/data[at0003]/items[at0004]/value/units AS 'systolic units',
   --@etl.query.column (target_name=diastolic, type=Double, description="Minimum systemic arterial blood pressure - measured in the diastolic or relaxation phase of the heart cycle.")
   pe/data[at0003]/items[at0005]/value/magnitude AS 'diastolic',
   --@etl.query.column (target_name=diastolic_units, type=String, max_length=255, description="Units for diastolic")
   pe/data[at0003]/items[at0005]/value/units AS 'diastolic (units)',
   --@etl.query.column (target_name=mean_arterial_pressure, type=Double, description="The average arterial pressure that occurs over the entire course of the heart contraction and relaxation cycle.")
   pe/data[at0003]/items[at1006]/value/magnitude AS 'mean_arterial_pressure',
   --@etl.query.column (target_name=mean_arterial_pressure_units, type=String, max_length=255, description="Unis or the average arterial pressure")
   pe/data[at0003]/items[at1006]/value/units AS 'mean_arterial_pressure (units)',
   --@etl.query.column (target_name=position, type=String, max_length=255, description="Individets posisjon ved tidspunktet for målingen.")
   pe/state[at0007]/items[at0008]/value/value AS 'position',
   --@etl.query.column (target_name=position_code_string, type=String, max_length=255, description="Individets posisjon ved tidspunktet for målingen.")
   pe/state[at0007]/items[at0008]/value/defining_code/code_string AS 'position (code_string)',
   --@etl.query.column (target_name=position_terminology_id, type=String, max_length=255, description="Individets posisjon ved tidspunktet for målingen.")
   pe/state[at0007]/items[at0008]/value/defining_code/terminology_id/value AS 'position (terminology_id)'
FROM
   COMPOSITION c[openEHR-EHR-COMPOSITION.encounter.v1]
      CONTAINS OBSERVATION o[openEHR-EHR-OBSERVATION.blood_pressure.v2]
         CONTAINS POINT_EVENT pe
ORDER BY o/data/origin/value desc

8 Likes

@bna is this creating some kind of table schema and loading that table with the extracted data?

Yes. The annotated AQL is a primary building block for DIPS openEHR ETL service. One such aql will be installed/deployed on the system. Multiple tables will be created, additional fields added, grants given, etc.

In addition the new definition will be registered in the asynchronous messaging bus for both full reload and continously adding new data.

2 Likes

Very interesting, thanks for sharing.
How do you handle repeating clusters in an archetype? When transforming to a fixed table schema.

Great question. That’s one of the key features we’ve implemented. There are different strategies. One is the capability to use nested queries which creates join tables. This is great for repeating structures.

2 Likes

A very simple example below.

The structure in the template is:

  • One problem/diagnosis defining the problem (i.e. Prostate cancer)
  • A repeated DV_CODED_TEXT for referral reasons/indications

We want to have one line in the table for each problem. Each of the reason should be added to a join table to make it possible to aggregate/count/etc. different indications.

Note how the CLUSTER defined as CL is selected with no annotations. This tells the engine to pass it down to the next sub-query.

--@etl.definition (schema_version=1, id=no.dips.arena.test.01, version=1.1.0, namespace=at_test_01, name="Test parent-child", description="An example on how to extract repeating structures in a separate join table", authors="bna@dips.no", copyright="DIPS AS", tags="example, join, parent_child")
--@etl.query (id=parent, target_name=problem, name="problem")
SELECT
   --@etl.query.column (target_name=cuid, type=String, max_length=255, description="CUID")
   c/uid/value as 'cuid',
   --@etl.query.column (target_name=problem, type=String, max_length=255, description="The problem defined with some terminology")
   e/data[at0001]/items[at0002]/value as 'Problem',
   cl
FROM
   COMPOSITION c[openEHR-EHR-COMPOSITION.encounter.v1]
      CONTAINS EVALUATION e[openEHR-EHR-EVALUATION.problem_diagnosis.v1]
         CONTAINS CLUSTER cl[openEHR-EHR-CLUSTER.problem_diagnosis_dips.v1]


--@etl.query (id=parent.child, target_name=reasons, name="Reasons")
SELECT
   --@etl.query.column (target_name=reason, type=String, max_length=255, description="0..1 reasons for each problem")
   cl/items[at0001]/value AS 'Reason'
FROM
   CLUSTER cl

Based on this you might to a simple join and count query like:

select
    p.problem as problem,
    count(r.reason) as reason_count
from
    at_test_01.problem p
    left outer join at_test_01.reasons r on r.sourceid = p.sourceid
group by
    p.problem
2 Likes

Could you please share an example table of extracted data with repeating clusters? I’m really curious to understand your work.

1 Like

Echoing Joost’s comment, we’re also EXTREMELY interested in this work. When will we have this in production? :innocent:

1 Like

You will have it in production with the next upgrade. Good times ahead of us :tada:

2 Likes

Is it possible to share something by now?

1 Like

@bna maybe if you have a corporate blog you can post a full guide on how you are tackling ETL, it would be nice to have a full example, and if this is something others could use, why not, incubate a new spec for ETL. This could be interesting for the SEC @thomas.beale @sebastian.iancu

3 Likes

This is an interesting topic. One approach to this is to just export composition JSON to a database like Postgres with JSONB and query on that directly. The JSON capability of databases like Postgres is incredible!

Some related work from our brothers on the FHIR side:
Projection of FHIR resources to plain ANSI SQL databases: sql-on-fhir/sql-on-fhir.md at master · FHIR/sql-on-fhir · GitHub.

Google recently seems to have taken this up more seriously and done some significant work open-sourced here - GitHub - google/fhir-data-pipes: A collection of tools for extracting FHIR resources and analytics services on top of that data..

Later on, the SQL on FHIR team seem to have ditched the structured table approach and just resorted to querying on JSONB directly due to limitations of representing certain fields (nested repeatable for instance) in the SQL schema properly: Database schema · FHIR/sql-on-fhir-v2 · Discussion #47 · GitHub.

@bna @pablo questions to you, since you already seem to be on the AQL → SQL bandwagon: What advantages do you see in having specific ETL definitions per use case? I only see disadvantages:

  • Every time the ETL definition changes, new data needs to be reindexed from the beginning
  • Every time the ETL definition changes, new schema migrations in the SQL database may need to be performed.

Won’t it be better to have a definitive COMPOSITION → SQL definition for analytics that will work with any ANSI SQL database? Something similar to OMOP – the data is super spread out into tables, and multiple joins are needed before the data makes any sense, but it’s still workable on SQL databases.

One approach I’ve been exploring is directly using EHRbase’s database representation of a COMPOSITION (called RAW) which kind of optimizes for archetype node lookups etc, and just use JSONPath with SQL in Postgres directly with that.

2 Likes

Thank you @Sidharth_Ramesh for this post and all the links!

Few months ago I researched the possibility of using support for JSON in modern SQL databases as a basis for an openEHR CDR. I also looked at “Archetype Relational Mapping” approaches.

There was a lot of work put into the existing openEHR CDRs. Congratulations to everybody involved.

However my question would be – would they do it differently if they started today ?

My second question – should we commit some time to evaluating how the advances in SQL databases could help building an openEHR CDR IF starting from scratch today?

p.s.
Wouldn’t it be nice to have these sites but for openEHR?
(with sites like these it is no wonder why developing on FHIR is more popular than developing on openEHR :cry:)

Just a quick comment about Health Samurai’s fhirbase - it’s not properly maintained anymore.

They do provide a nice script in Go that can populate a Postgres database with FHIR projections last I checked. And they have a bunch of helper functions that are useful for querying on versioned data.

Some things to consider with using JSON in databases: Postgres has the most advanced support. Almost all other databases are lacking behind. There is also no standard ANSI SQL version of what Postgres has currently implemented.

1 Like