Modelling of Genomics data in VCF format

We have a requirement from a PHR application developer to model genomics data into the EHR of the person. From our research so far we feel that a person’s VCF file has all the relevant genetic information that will be required in healthcare and so can be assumed to be part of their EHR. While investigating further, we realized that a typical person’s VCF file may have 1000s of variant data sets and if we model the VCF file completely, that can add a large amount of data.

We have a couple of questions

  1. Is our approach correct? If not what is the recommended best practice for managing genetic information in openEHR?
  2. Are there any archetypes in the CKM that have been designed for this purpose?
  3. If we model every row in the VCF file as a cluster instance, the composition can have a very large number of occurrences of that cluster. Is there any best practice on the number of occurrences in one composition

I have imported the aample of a VCF file int a google sheet here

If anybody has attempted this before, your advise will be appreciated.

1 Like

Hi Dileep! The openEHR genomics project is led by @ce.mascia at CRS4 and supported by domain experts from HiGHmed in Germany and Oslo University Hospital, as well as the openEHR Clinical Modelling Program.

The main archetype is the ‘Genomic variant result’ ( which is intended to be used to report findings and annotations related to one variant found in the genome by a sequencing test.

This archetype is supported by a number of models for specific variant types, for example ‘Genomic conversion variant’ ( or ‘Genomic repeated sequence variant’ ( All the archetypes can be found in the CKM project ‘Genomics’ (

The specific variant archetypes are specifically intended for representing the genomic data in the same way a VCF file does, so I think these should probably fit your use case pretty well.

Thanks siljelb.
We did look at this project and the Archetypes there, but could not see a direct fit between the VCF data and the models. The models seem to contain a lot more data than what the VCF sample that I have. Not sure how to map the data

Dear @ce.mascia,
Can we have a call one of these days so that you can explain the current models to us bit more?


1 Like

This is a useful finding! I think it’d be a good idea for the project to publish a couple of mapping examples from VCF files or other sources to the archetypes.

Hi @Dileep_V_S,

First of all, thank you for your feedback, it’s very important to collect opinions about the models, in particular from use cases different from those that led to their creation.

As Silje rightly explained, the models within the openEHR Genomics project are intended to represent the result of a sequencing test performed to find genomic mutations. Originally, the model’s development started exactly from the data contained in a VCF file, especially including the mandatory fields, and then the archetypes have been expanded in order to accommodate additional information from the INFO column too. Pheraphs, this is why the archetypes may seem to contain a lot of extra information with respect to the sample you shared. Our idea is that the more data you can capture the better it is, especially if these are obtained “for free” automatically from the sequencing pipeline. And this is true both in the purely clinical context and, even more so, in research contexts where a possible data reuse for further analysis could be interesting. Anyway, only a subset of the data is mandatory and further fitting to the needs of the specific use case can be done with the creation of templates.

We will be happy to have a call with you to have a deeper look into the models and maybe to test them with your sample, to see if and how they can fit.

I can prepare a doodle poll and share with you and the rest of the team to organise the meeting.

Best regards

1 Like

That is a good Idea and will help a lot of implementors.


1 Like