New Python openEHR Synthetic Data Generator

Howdy, my weekend hack turned into something I thought might be helpful to others running after test data - heaps of them!

Project started from a fork of Berlin-Institute-of-Health / Genkidata (https://github.com/Berlin-Institute-of-Health/Genkidata).
It has significant additions:

  • instead of just duplicating existing Compositions, it uses:

    • An NLP library to change text to synonyms for DV_TEXT. While not perfect from a clinical semantics point of view, it’s much better than lorem ipsum stuff!

    • For quantities it changes values randomly between -15 <> +15 percent so it’s likely to be clinically plausable.

  • in addition a new feature to create canonical Compositions from Webtemplates (it’s a biggie! and possibly still has errors but it passed all tests from ehrbase SDK test webtemplates using Pablo’s validation tool).

When you run the app, it prompts three options:

  1. API Upload (into ehrbase or other CDR)

  2. Jitter Existing Compositions (\source_models\compositions) but rather than just duplicating in the original app it creates new values)

  3. Stored (Source Webtemplates)

Existing Compositions are taken from test data from https://github.com/ehrbase/openEHR_SDK so they pretty much cover all possible variations.

The amount of Compositions and EHRs is defined by user input.

Resulting canonical Compositions are saved into:

/dist/compositions

You can put your own Compositions (to duplicate but with new values) and Webtemplates into:

/source_models/compositions

/source_models/webtemplates

Enjoy! And comments / tickets / pull requests welcome.

3 Likes