Pagination with total count (specifically in EHRBase)

pskillen-nes · 30 September 2024 10:36

We have a system which builds AQL queries and runs them against an EHRBase 2.2 server. Users can specify a set of search criteria (time ranges, participant identifiers, fields from various archetypes, etc), and the code builds an AQL query for a known template ID.

Not long before going to prod we realised we’d need to implement pagination, because without it we could often cause queries which would hang our Postgres server. The implementation of this is pretty easy (fetch and offset params are passed alongside the AQL). However, we no longer have a mechanism to get the total number of results which match the search criteria - we only get the number returned in the current page.

We’ve considered a few approaches to fix this, such as running a second query, wrapping the initial query in a COUNT(…) function, etc, but all of these have their issues.

I’m interested to see if others have solved this issue, or can we just not present users with a “total count” field?

vidi42 · 30 September 2024 11:25

@pskillen-nes tl;dr there is not total count metadata returned with the EHRbase AQL query response, so you would need to use a count query upfront to obtain that.

f I remember correctly with EHRbase 2.2 we’ve introduced some custom AQL Metadata info to help with pagination, and in every response you should get also these 3 values:

    "fetch": 2,
    "offset": 8,
    "resultsize": 2

fetch represents to the fetch request param or AQL LIMIT keywork used for the request
offset represents to the offset request param or AQL keyword
resultsize represents the number of results that matched your current request (including fetch and offset params).

However, to do a total count you would have to obtain it trough a count query, that you already mentioned. But I’m curious what other approaches have you or others considered.

pskillen-nes · 30 September 2024 12:03

Hi Alex. Yes the meta block has been very useful. I’ve only just realised that these values aren’t part of the official openEHR spec, so thanks for adding them.

I know it’s likely that I’ll need to run two queries - one for results, and one for count. My main problem is the AQL is dynamically generated, so it’s hard to infer the count query from the select query.

My first approach was simply trying various permutations along the lines of SELECT count(*) FROM <original query>, as you can sometimes do with postgres, however I couldn’t find something that EHRBase would accept as a valid query.

My last approach was to use string manipulation to remove the SELECT .... FROM from the front of the query, and replace it with count:

const countQuery = `SELECT count(c/uid/value) as counter FROM ${fromClauseAndBeyond}`;

This approach was far too naive. For a start it didn’t even preserve the distinct. But also, somehow, it ran with much worse performance than the actual select query.

Query snippets, for reference

-- Dynamically generated AQL
SELECT DISTINCT
            e/ehr_id/value as ehrId,
            e/ehr_status/subject/external_ref/id/value as chi,
            c/uid/value as compositionUid,
            c as fullComposition,
            c/context/start_time/value as eventDateTime
          FROM
            EHR e
            CONTAINS COMPOSITION c[$archetype_id]
            CONTAINS (
              ACTION op[openEHR-EHR-ACTION.service.v1]
              AND ACTION p[openEHR-EHR-ACTION.procedure.v1]
                CONTAINS CLUSTER dev[openEHR-EHR-CLUSTER.device.v1]
            )
          WHERE
            c/context/participations/performer/external_ref/id/value = $qp0    AND c/name/value = $qp1
          ORDER BY
            c/context/start_time/value ASC

// JSON wrapper
{
  "q": "<AQL goes here>",
  "query_parameters": {
    "qp0": "VGPOCMR1798",
    "qp1": "NES_TS Medical Devices Data Hub",
    "archetype_id": "openEHR-EHR-COMPOSITION.report-procedure.v1"
  },
  "fetch": 100,
  "offset": 0
}

HolgerReiseVSys · 30 September 2024 15:07

@pskillen-nes For the given example the following count query should be sufficient:

SELECT COUNT (DISTINCT c/uid/value)
FROM COMPOSITION c[$archetype_id]
	CONTAINS (
	  ACTION op[openEHR-EHR-ACTION.service.v1]
	  AND ACTION p[openEHR-EHR-ACTION.procedure.v1]
		CONTAINS CLUSTER dev[openEHR-EHR-CLUSTER.device.v1]
)
WHERE c/context/participations/performer/external_ref/id/value = $qp0
	AND c/name/value = $qp1

ORDER BY, LIMIT, and OFFSET are not needed, of cause.

As EHRbase currently is not optimized for this kind of population query,
depending on the size of the database, counting the results will not be fast.

There are also some areas where the original query could be improved:

Ordering by c/context/start_time/value is usually not intended, because this orders by the full date string.
Instead, it should be ordered by its magnitude: c/context/start_time
In the not-too-distant future, the performance of this feature will be significantly improved.
When paging, care has to be taken that a total ordering is specified.
As the database is free to optimize the query, results may be appear on several pages, ore on none at all.
Additionally sorting by c/uid/value will stabilize the order.

Also, please be aware that in general, performance tends to degrade for higher page numbers.
The database will have to perform the query up to that point in order to retrieve the relevant data.

pskillen-nes · 3 October 2024 14:31

Thanks for the replies @HolgerReiseVSys and @vidi42.

In the end we used Alex’s idea of two queries, and Holger’s updated query (my fault, I almost had the right AQL at the start but couldn’t quite get the DISTINCT to work).

This separate count query adds about 900ms on top of our response time. Though the actual select query takes 11-15sec, so the count isn’t a massive overhead. This is all on a tiny AWS RDS instance with a modest dataset (~8000 records), so I’m interested to see how this scales.

And thanks for the tips on ordering, I’ve updated our ordering AQL snippets accordingly.