Create Composition Performance Benchmarks in EHRBase - For batch insertions

anupama · 25 January 2022 05:44

Hi,

We have developed data pipelines that insert retrospective clinical data from hospital systems into EHRbase CDR (We have installed using Docker)

The data volume we are piloting with is in the tune of 800K + . Although the insertions work fine, we see that there is a huge performance deterioration after a few 100 insertions. It almost takes 1 second per composition on an average. We also noticed that for every few inserts, there is a delay of 5-6 seconds.

Before we start further investigations, I just wanted to know if there are any performance benchmarks available for the EHRBase composition Creation? Is it designed to handle batch inserts the way we are trying?

Thanks,
Anupama (anupama.ananthasairam@karkinos.in)

birger.haarbrandt · 25 January 2022 16:32

Hi @anupama,

this appears strange to me as we have made benchmarks and performance testing that show some stable performance for inserts. We tested with a Postgres Cluster with 5 nodes and a dedicated write instance and have insert time of around 200ms with 50 concurrent threads firing on EHRbase.

Even with a single instance this should work better than what you experienced, hence there might be an issue with your Postgres configuration.

pablo · 25 January 2022 17:48

Hi Anupama,

I have done some tests almost an year and a half ago, check the CSV reports here GitHub - ppazos/testehr: Load tester for the openEHR API

Also tried to use JMeter GitHub - ehrbase/load-tests-jmeter: EHRBASE Load Testing

Remember the delays might depend on the amount and complexity of the data, so the number of total compositions is not the only indicator to look at. You need to consider the size of the OPTs and the compos.

Seref · 26 January 2022 10:01

There are so many variables in that setup that it’s hard to know where to begin speculating

Docker first: are you mounting a folder in the host for the postgres image? Did you consider setting up postgres in a VM? or even better on an actual machine? Are we talking SSDs here or magnetic disks?

Those delays sound too large for Java’s garbage collector to be the culprit, especially the 5-6 secs, unless you’re running your performance tests on a Nintendo switch… So my first attempt would be to give postgres some room to breathe and try it that way.

Paulmiller · 2 February 2022 10:41

HI @anupama, I’d be interested in any follow up. Did you manage to fix your problem?

anupama · 8 February 2022 05:48

@all Sorry for the delayed Follow up status.

We started the insertions in a new environment. Single Node Cloud Cluster. Dedicated Write Instance. Single Threaded.

We inserted 844K compositions and 31K EHRs(Patients) at the rate of 200 ms per composition

Infrastructure Used:

As the next step, we are planning to use multiple nodes and multithreading. Will post our findings.

pablo · 10 February 2022 00:46

@anupama that seems a reasonable “insert time”. I’m not sure if you are considering the full request time or just the DB transaction time.

Do insertions run on the same machine that runs EHRBASE? Because for a remote (internet) test, just the request round trip will have a delay of 200+ms.

Sidharth_Ramesh · 30 September 2022 17:42

I’m wondering if there’s a way to Bulk insert multiple entries as transactions in EHRbase (and openEHR in general) - both for compositions and EHRs in one HTTP request rather than multiple. Can contributions somehow be used here?

Using the EHRbase SDK to connect to the database directly and execute these as one transaction would also be a decent solution.

Our use case is similar to @anupama, and there’s also the concern about the ACID properties of a transaction along with performance. We are streaming change data events from live legacy systems and processing these in batches.

thomas.beale · 30 September 2022 19:43

Hi Sidharth,
in openEHR, you can certainly add changes (i.e. new versions) to multiple Compositions in one Contribution (indeed, logical deletes and completely new Compositions can appear in the one Contribution as well). We don’t consider this a ‘bulk’ or ‘batch’ operation - it’s just a normal change set, like in any version control system.

Doing Contributions across multiple EHRs would be a batch operation that is not currently defined by the REST API, but should be, since it’s a very normal thing to want.

Since this kind of ‘commit’ is usually a B2B kind of operation, it is more suited to batching up all the changes and sending that. The EHR Extract gets close to the kind of bundle concept you want, but we’d probably need some adjustments to make sure it can handle anything allowed in a Contribution.

We’d also need a service interface, but conceptually it’s simple, since the batching is done in the Extract; all that is needed in the API is (conceptually) something like merge_extract(x: EHR_EXTRACT, ...) with some extra parameters indicating how to handle failures.