Representation of genetic positions

Hi everyone!

We’re working on the Genetic variant archetypes, and have run into an issue with how to represent the more complex positions within a gene. On the genomic level this is easy, a simple positive number starting from the left side of the genome and ending on the right side. For the coding sequence (a single gene), this gets more complicated because positions in the exons are numbered with positive numbers. Positions in the introns are however numbered with a combination of the closest position in the exon, followed by either a plus or minus sign based on whether the right or left exon is closer, and then another number representing the distance from the closest position in the exon (the first number). For example: ‘87+1’ or ‘88-3’.

How should we represent this in archetypes? The simple use case (genomic level) is well represented by a single DV_COUNT, constrained to 1…*. We’d prefer that the same structures be expandable to be usable for the coding sequence examples too.

EDIT: See this background material for more info and examples: https://varnomen.hgvs.org/bg-material/numbering/

Also see this archetype: https://ckm.openehr.org/ckm/archetypes/1013.1.4393. The Start position and End position elements are the ones that are affected by this issue.

If combinations are finite (are they?), maybe a mini terminology can be defined with all combinations. Probably would ease translation to make it readable

Hi Diego! Combinations aren’t finite, as far as I understand. Also, we’d like to be able to use the data to identify the actual numeric position in the gene or genome, and I think a terminology would be less useful for that.

Several years ago, in the conclusions of my Master Thesis, I reasoned about the need of developing specific data types for genetic information. This combined with some low-level archetypes (CLUSTERs) would be an interesting research topic.

2 Likes

7 posts were split to a new topic: Data types for variation in pregnancy?

At least for the genetic variation, why not just handle it as parsable text. I see parallels with the ISO Duration types - sometimes a wee parsable string is the easiest way to go. I’m not sure I see the value in splitting it up in an archetype. This is a way of encoding the position but ? in a precise way (TBH I donlt completely understand!!).

[part of original post moved to Representation of variation in pregnancy]

2 Likes

Hi all,

regarding the genetic use case, the way in which the position of a mutation is expressed depends on the reference sequence we are using to describe the variation.

The easy case is when we’re using a genomic (g.) sequence and a positive integer is enough to locate the mutation in the genome (g.123). If we’re using a coding (c.) sequence instead, things get complicated as we may need an expression like c.-85, c.-88-3 but also c.*1 or c.*37-3 as previously pointed by Silje and Heather. To generalise, we can say that the position needs four elements to be properly represented:

  • a modifier (+, -, *) — optional
  • an integer — mandatory
  • a modifier (+, -) — optional
  • an integer — optional

So, if I get your suggestion, supposing we use a DV_PARSABLE datatype. Which is the best way to express the regular expression to which the value might adhere to be considered valid? Should the regular expression be part of the model (maybe through the “Regex_any_pattern” RM attribute of the DV_PARSABLE class ) or should we leave the element open and let the implementer check the data validity?

2 Likes

it can be part of the constraint on the type. We can be as strict as we want: just generalizing for this last case e.g.

[a-z]\.[\+\-\*]?[1-9][0-9]+([\+\-\*][1-9][0-9]+)?

where

[a-z]\. is = g.|c.|etc. (can be nailed to exact letters?)
[\+\-\*]? modifier
[1-9][0-9]+ an integer

probably needs more test cases to nail this regex better

If we have more knowledge on each type of sequence even things like this
(g\.[1-9][0-9]+)|(c\.[\+\-\*]?[1-9][0-9]+([\+\-\*][1-9][0-9]+)?)

can be specified

What Cecilia says!
She is coordinating the SMEs from the content point of view and is the source of truth for the model content. I am absolutely no genomics expert :blush:
Silje & I are merely trying to understand enough of the content in order to suggest the best way to represent the data…
@ce.mascia BTW Silje & I were discussing and thinking the ‘*’ might be a different modifier that we may need to represent differently.

All help/advice gratefully received, but I suspect we’ve identified a valid new use case that is challenging the modelling tools/specs.

Cheers

Heather

I don’t completely understand this either, although the examples and the figure in Sequence Variant Nomenclature helps.

I agree a parsable could be used to represent the complex examples. But should we make the element a choice of DV_PARSABLE (for the complex coding sequence use cases) and DV_COUNT |1…*| (for the simple full genome use cases). Or should it be DV_PARSABLE only?

And as @ce.mascia asks, is it possible to set a regex to constrain the syntax of the DV_PARSABLE type?

Legal in ADL, not sure of tooling support if that’s the question

YesDV_PARSABLE allows you tp specify the ‘formalism’ as well as the string of text.

So something like

{
formalism :“HGVS-Varint -nomenclature”
value: " c.*37-3"
}

It might be possible to do some validation using regex but arguably whether there is added value.

@Silje - I’d keep everything in the same format. i.e no DV_COUNT Leave it to downstream processors/ handlers to unpick the different variations in the variances!! At least for now.

The minute you give a choice someone has to figure out of this is actually sample example of a or a complex one. and the easy ones are easy to parse.

2 Likes

This makes sense.It also allows us to always be explicit about whether we’re talking about a coding sequence (‘c.’) or genomic sequence (‘g.’, ‘o.’ or ‘m.’) position.

1 Like

To be honest, I’m not a genomics expert too…but I’m reporting the indication of my bioinformatician colleague, Paolo Uva, who has the real knowledge about the subject :slight_smile: From my side, I’m learning a lot from this discussion and from all your insights.
So, the solution seems to be model the position using a DV_PARSABLE. Thank you all for your help. Of course, if there are other suggestions just let us know :slight_smile:

1 Like

Exactly - quite an elegant ‘code’ has been worked out here. I don’t think I can see the benefit of trying to ‘decode it’ inside openEHR, though at some point I could see it become a formal datatype.

3 Likes

So it seems like our solution is to use DV_PARSABLE. This means that CLUSTER.inversion_variant has to be republished as v2, but this isn’t a big problem. One thing that we do need to decide is how to name the syntax for this parsable string.

@ian.mcnicoll suggests “HGVS-Variant-nomenclature”. Is this the official way of doing this, or something you made up, Ian? :wink: If the latter, is there an official way of referencing the syntax? @ce.mascia

Edit: I’ve uploaded a new revision of the Genetic variant - Inversion archetype, with these proposed changes: Cluster Archetype: Genetic variant - Inversion [openEHR Clinical Knowledge Manager]

Something NLMS would be the ‘custodian’ of this kind of thing ??

After today’s meeting, we’ve decided to reverse this decision. The original intent of the specific variant clusters was to be able to work with the atomic data on the genomic sequence level without having to unpack the HGVS syntax in queries, similar to the data found in VCF files.

To be able to represent this properly, we’ll keep the original DV_COUNT representation of positions, and constrain the specific variant archetypes to human genomic DNA. In other words we’re excluding coding sequence DNA, all RNA, protein sequences, and circular DNA. If later requirements turn up for unpacking this information into atomic data points, specific variant archetypes will need to be made for them. However, it will still be possible to record them using HGVS expressions in several elements in the Genetic variant archetype.

Update: New revision of the Genomic inversion variant archetype, based on this decision: Cluster Archetype: Genomic inversion variant [openEHR Clinical Knowledge Manager]

1 Like