Representation of genetic positions

siljelb · 4 March 2020 09:12

Hi everyone!

We’re working on the Genetic variant archetypes, and have run into an issue with how to represent the more complex positions within a gene. On the genomic level this is easy, a simple positive number starting from the left side of the genome and ending on the right side. For the coding sequence (a single gene), this gets more complicated because positions in the exons are numbered with positive numbers. Positions in the introns are however numbered with a combination of the closest position in the exon, followed by either a plus or minus sign based on whether the right or left exon is closer, and then another number representing the distance from the closest position in the exon (the first number). For example: ‘87+1’ or ‘88-3’.

How should we represent this in archetypes? The simple use case (genomic level) is well represented by a single DV_COUNT, constrained to 1…*. We’d prefer that the same structures be expandable to be usable for the coding sequence examples too.

EDIT: See this background material for more info and examples: https://varnomen.hgvs.org/bg-material/numbering/

Also see this archetype: https://ckm.openehr.org/ckm/archetypes/1013.1.4393. The Start position and End position elements are the ones that are affected by this issue.

yampeku · 4 March 2020 10:13

If combinations are finite (are they?), maybe a mini terminology can be defined with all combinations. Probably would ease translation to make it readable

siljelb · 4 March 2020 10:17

Hi Diego! Combinations aren’t finite, as far as I understand. Also, we’d like to be able to use the data to identify the actual numeric position in the gene or genome, and I think a terminology would be less useful for that.

damoca · 4 March 2020 11:02

Several years ago, in the conclusions of my Master Thesis, I reasoned about the need of developing specific data types for genetic information. This combined with some low-level archetypes (CLUSTERs) would be an interesting research topic.

siljelb · 10 March 2020 09:51

7 posts were split to a new topic: Data types for variation in pregnancy?

ian.mcnicoll · 5 March 2020 09:50

At least for the genetic variation, why not just handle it as parsable text. I see parallels with the ISO Duration types - sometimes a wee parsable string is the easiest way to go. I’m not sure I see the value in splitting it up in an archetype. This is a way of encoding the position but ? in a precise way (TBH I donlt completely understand!!).

[part of original post moved to Representation of variation in pregnancy]

ce.mascia · 5 March 2020 11:11

Hi all,

regarding the genetic use case, the way in which the position of a mutation is expressed depends on the reference sequence we are using to describe the variation.

The easy case is when we’re using a genomic (g.) sequence and a positive integer is enough to locate the mutation in the genome (g.123). If we’re using a coding (c.) sequence instead, things get complicated as we may need an expression like c.-85, c.-88-3 but also c.*1 or c.*37-3 as previously pointed by Silje and Heather. To generalise, we can say that the position needs four elements to be properly represented:

a modifier (+, -, *) — optional
an integer — mandatory
a modifier (+, -) — optional
an integer — optional

So, if I get your suggestion, supposing we use a DV_PARSABLE datatype. Which is the best way to express the regular expression to which the value might adhere to be considered valid? Should the regular expression be part of the model (maybe through the “Regex_any_pattern” RM attribute of the DV_PARSABLE class ) or should we leave the element open and let the implementer check the data validity?

yampeku · 5 March 2020 13:01

it can be part of the constraint on the type. We can be as strict as we want: just generalizing for this last case e.g.

[a-z]\.[\+\-\*]?[1-9][0-9]+([\+\-\*][1-9][0-9]+)?

where

[a-z]\. is = g.|c.|etc. (can be nailed to exact letters?)
[\+\-\*]? modifier
[1-9][0-9]+ an integer

probably needs more test cases to nail this regex better

If we have more knowledge on each type of sequence even things like this
(g\.[1-9][0-9]+)|(c\.[\+\-\*]?[1-9][0-9]+([\+\-\*][1-9][0-9]+)?)

can be specified

heather.leslie · 6 March 2020 01:18

What Cecilia says!
She is coordinating the SMEs from the content point of view and is the source of truth for the model content. I am absolutely no genomics expert
Silje & I are merely trying to understand enough of the content in order to suggest the best way to represent the data…
@ce.mascia BTW Silje & I were discussing and thinking the ‘*’ might be a different modifier that we may need to represent differently.

All help/advice gratefully received, but I suspect we’ve identified a valid new use case that is challenging the modelling tools/specs.

Cheers

Heather

siljelb · 6 March 2020 07:53

I don’t completely understand this either, although the examples and the figure in Redirecting… helps.

I agree a parsable could be used to represent the complex examples. But should we make the element a choice of DV_PARSABLE (for the complex coding sequence use cases) and DV_COUNT |1…*| (for the simple full genome use cases). Or should it be DV_PARSABLE only?

And as @ce.mascia asks, is it possible to set a regex to constrain the syntax of the DV_PARSABLE type?

yampeku · 6 March 2020 10:53

Legal in ADL, not sure of tooling support if that’s the question

ian.mcnicoll · 6 March 2020 11:52

YesDV_PARSABLE allows you tp specify the ‘formalism’ as well as the string of text.

So something like

{
formalism :“HGVS-Varint -nomenclature”
value: " c.*37-3"
}

It might be possible to do some validation using regex but arguably whether there is added value.

@Silje - I’d keep everything in the same format. i.e no DV_COUNT Leave it to downstream processors/ handlers to unpick the different variations in the variances!! At least for now.

The minute you give a choice someone has to figure out of this is actually sample example of a or a complex one. and the easy ones are easy to parse.

siljelb · 6 March 2020 11:58

This makes sense.It also allows us to always be explicit about whether we’re talking about a coding sequence (‘c.’) or genomic sequence (‘g.’, ‘o.’ or ‘m.’) position.

ce.mascia · 6 March 2020 12:10

To be honest, I’m not a genomics expert too…but I’m reporting the indication of my bioinformatician colleague, Paolo Uva, who has the real knowledge about the subject From my side, I’m learning a lot from this discussion and from all your insights.
So, the solution seems to be model the position using a DV_PARSABLE. Thank you all for your help. Of course, if there are other suggestions just let us know

ian.mcnicoll · 6 March 2020 15:54

Exactly - quite an elegant ‘code’ has been worked out here. I don’t think I can see the benefit of trying to ‘decode it’ inside openEHR, though at some point I could see it become a formal datatype.

siljelb · 10 March 2020 09:27

So it seems like our solution is to use DV_PARSABLE. This means that CLUSTER.inversion_variant has to be republished as v2, but this isn’t a big problem. One thing that we do need to decide is how to name the syntax for this parsable string.

@ian.mcnicoll suggests “HGVS-Variant-nomenclature”. Is this the official way of doing this, or something you made up, Ian? If the latter, is there an official way of referencing the syntax? @ce.mascia

Edit: I’ve uploaded a new revision of the Genetic variant - Inversion archetype, with these proposed changes: https://ckm.openehr.org/ckm/archetypes/1013.1.4422

ian.mcnicoll · 10 March 2020 13:50

Something NLMS would be the ‘custodian’ of this kind of thing ??

siljelb · 11 March 2020 11:51

After today’s meeting, we’ve decided to reverse this decision. The original intent of the specific variant clusters was to be able to work with the atomic data on the genomic sequence level without having to unpack the HGVS syntax in queries, similar to the data found in VCF files.

To be able to represent this properly, we’ll keep the original DV_COUNT representation of positions, and constrain the specific variant archetypes to human genomic DNA. In other words we’re excluding coding sequence DNA, all RNA, protein sequences, and circular DNA. If later requirements turn up for unpacking this information into atomic data points, specific variant archetypes will need to be made for them. However, it will still be possible to record them using HGVS expressions in several elements in the Genetic variant archetype.

Update: New revision of the Genomic inversion variant archetype, based on this decision: Cluster Archetype: Genomic inversion variant [openEHR Clinical Knowledge Manager]

Topic		Replies	Views
Archetype references in archetypes, syntax Specifications	13	409	17 December 2021
ADL2 valuesets - extend beyond 'local' terms ADL adl , archetype	12	642	14 May 2020
Medication order, Medication details, Dosage - ready for republication as a new major version CKM publication archetype	15	815	24 December 2021
Representation of variation in pregnancy Clinical archetype , obstetrics	7	758	10 March 2020
'Virtual Cluster" archetypes Clinically Relevant RM Discussions	18	419	6 September 2024
Clinical scales - ordinal or coded text? Clinical modelling-patterns	27	2813	27 May 2020
Revisiting symptom/sign Clinical archetype , modelling-patterns	48	3788	26 October 2021
CKM slot definitions being validated? Clinical	12	423	20 December 2023
Is Archetype Specialisation Dead? Clinical adl , archetype , specifications	10	181	16 December 2024
ELEMENT without value in ADL ADL	14	618	23 February 2022

Representation of genetic positions

Related topics