SynPuf: syntetic data (into openEHR)

I’m thinking in generating a way of load existing SynPuf data into openEHR data repository (probably via openEHR REST API). I have several questions to the community in order to make it more useful/more easily available. First a little summary:
DE-SynPuf is a realistic looking syntetic data set with information from persons, claims, and prescriptions. SynPuf is organized in a way that you can get a percentage of patients from the total you want loaded into the system (each batch contains a set of patients with all their inpatient, outpatient, and carrier claims, and drug prescription information). SynPuf is used as the go-to dataset in most standards to be able to have a quick way of benchmarking/querying a given system. More info here

Now the questions:

  • Do we have a way of bulk-loading openEHR data into any of the currently available openEHR CDR?
  • Can we create patients by using simplied json format?
  • Demographic data storage is probably a big discussion, so for the moment a poll:
    How should be demographic information be stored in this kind of use case?
  • Attach the available information in EHR_STATUS.other_details
  • Store it in specific compositions in EHR domain
  • Store it in DEMOGRAPHIC archetypes
  • Generate FHIR instances for demographic

0 voters

While 3rd option is probably the most correct, I’m not sure of current support for DEMOGRAPHIC model in data repositories. Second one should be easy but not really that “normalized”. First one will probably add quite a bit of data into the ehr_status. Last one can probably be reused from other available SynPuf → FHIR generation.

That’s my vote there :slight_smile: I’d say in this instance, just go with mimicking demographic using compositions. It’s the most pragmatic middle ground for something like this.
What you have in mind is a great idea, having a significant data set in openEHR form, especially if it’s in EhrBase, would be fantastic. Other approaches to demographic would a) complicate implementation, b) would make querying demographic data problematic/less-convenient and most interesting queries at the population level end up touching demographic one way or another.
There’s no reason to not to work on a V2 for demographic mapping based on one of the better options above once the most pragmatic one is in place.

I didn’t know about SynPuf, may I be lazy and ask if the claims data is … interesting? I can see there’s prescription/medication data but is there any diagnosis/observation type of data in that set? There’s also a Brazilian hospital data set in openEHR form I think. ORBA? O… something? It was a few years back when it was made available.

1 Like

I believe you have ORBDA in mind: Geração de uma Base Pública para Avaliação de Mecanismos de Persistência de Sistemas de Registros Eletrônicos de Saúde Baseados nas Especificações da Fundação openEHR – L@MPADA / UERJ

I also have a link to Synthea in my notes: “Synthea data contains a complete medical history, including medications, allergies, medical encounters, and social determinants of health.”
It is available as FHIR, C-CDA and CSV.

Edit: fixed the name for ORBDA

1 Like

thanks, it was orbda indeed :slight_smile:

1 Like

Yes, synthea was also an option, I think @bna commented about it in the past

Mine as well. A trick that you can use if you have no dedicated openEHR demographic service is to do the mimicking using Compositions + AdminEntry + Cluster archetypes (the ‘fake’ real ones :wink: and then store the public demographic entities (= professionals + places) in a special EHR in the service - e.g. call it the 0-EHR, with a known EHR id (I don’t know off-hand if a GUID made of all 0s will work, but it would be perfect if it does).

Then you end up with an EHR service with all the usual patient EHRs, containing refs pointing out to demographic entities, but those demographic entities are cunningly hidden in another EHR, whose job is to act as a local demographic registry, or you could think of it as a cache.

Detailed patient demographics could be stored the same way - in another special EHR, since in Synpuf data all patients are fake anyway.

BTW Ricardo Correia’s group at Porto has someone working on an EHR data synthesiser - I don’t remember the details.

They have diagnoses in ICD (ICD9 if I recall correctly) and procedures in HCPCS
They are provided in a typical database way as “diagnosis1, diagnosis2…procedure1, procedure2…”

This could also be a good question. Is AdminEntry fully supported in all available EHR repositories?

1 Like

I’d be surprised if it was not supported. It’s an ENTRY subtype, hence, archetypetypeable. Already some archetypes in CKM, and in (archetypes that are in ) production in Ocean’s archetypes/templates, in case it helps.



It is also possible to put the the demographics clusters in other_context, though I think using ADMIN_ENTRY is preferable.

SynPuf is obviously worth doing but I think Synthea is much richer (for operational data)

1 Like

Yes, I think both are useful. I was thinking that this could be used as a validation of an openEHR to OMOP transformation , as we already know how synpuf data looks in OMOP CDM

do we? how? where? :slight_smile:

Here is the ETL

Here is the direct download to the dump

1 Like

I’m beginning to suspect you guys have a Jira card open in Veratech titled “Distract Seref” :smiley:
Thanks a lot Diego.

1 Like