Z-scores and percentiles

siljelb · 19 November 2021 08:35

Z-scores, or standard scores, and percentiles are used for a whole range of clinical measurements. We haven’t been modelling these into archetypes as of yet, and I’m not sure this would be a good way to represent them, since they’re closely bound to each measurement. So this made me think; could Z-scores and percentiles be represented using an additional RM element of the DV_QUANTITY (edit: or maybe DV_AMOUNT?) data type? Or do we need to model them into every archetype where they’re potentially relevant?

heather.leslie · 20 November 2021 06:55

There was a discussion around this some time ago related to OBSERVATION.child_growth. It was modelled as an inline attribute at the time.
Also, we need similar for EDD in pregnancy - eg 34 weeks 4 days +/- 7 days.
I would prefer it as part of the RM too - for quantity and duration.

siljelb · 22 November 2021 08:48

Yep, it’s needed for spirometry too.

Specs folks, what do you think? @pieterbos @thomas.beale

Isn’t this uncertainty rather than a percentile/Z-score?

thomas.beale · 22 November 2021 10:04

Is it that you want to record just the z-score number, but have it indicated that it is a z-score value, or do you want the z-score number as well as the raw number? If the latter, currently it needs another data element.

The EDD example is an accuracy thing.

siljelb · 22 November 2021 12:04

In most cases we want the base measurement (for example height 150 cm) plus the Z-score (for example height for age 0.91) and/or percentile (for example height for age 81.9).

thomas.beale · 22 November 2021 17:32

I would potentially treat this trio (or more) as a pattern, and possible create a CLUSTER containing three ELEMENTs, named raw, percentile, z-score or whatever names you want meaning that. Then in some other archetype where you want a ‘tri-value’ of this kind (i.e. the triple), you just use that CLUSTER. In ADL2, you can just plug in the archetype directly into the parent archetype - no need for any slot.

If this need is a) fairly common and b) the shape of the data is always the same, i.e. this same 3 items (or 4, or 5 or whatever), then a pattern would make sense. You would see paths like:

.../items[id14|tri-value|]items[id309|z-score|]/value

or in readable form.,

.../tri-value/z-score/value

Another solution would be a sort of ‘template group’ of ELEMENTs, which would just be the set of 3 ELEMENTs, no CLUSTER; if the tools supported such groups, the would be dragged and dropped into the parent CLUSTER, but it would not necessarily be so obvious that the 3 were together.

Finally, if we regarded this triple (again, I’m assuming triple) as really common and useful - like an Ordinal or a Quantity - we might actually create a new descendant type under DV_QUANTIFIED or DV_ORDERED in the RM, called … STATISTICAL_POPULATION_VALUE or hopefully something shorter

heather.leslie · 23 November 2021 00:08

Standard deviation is essentially a reflection of the amount of variability within a given data set.
EDD +/- n days is documenting the variability in estimating the EDD at a given gestation. We can argue the semantics but the issues around representation are the same.

pieterbos · 23 November 2021 09:58

It seems like we have three separate but very much related things in this discussion:

accuracy of a single measurement or estimate
a measure of variability and distribution of the measurement in a population
a way of indicating how a single measurement relates to the distribution, so the z-score or percentile.

Accuracy, so the ± x days example, seems well specified, in Data Types Information Model , so the accuracy and accuracy_is_percent attributes of DV_AMOUNT and it subclasses, as specified in Data Types Information Model . Is that a good way to solve that particular issue of ‘this many weeks, ± 7 days’?

I think standard deviation is different in the sense that it is not an accuracy - the measurement could have any kind of accuracy plus a mean value and standard deviation. This relates to a distribution of measurements, and also the mean value will need to be stored besides the standard deviation. There is reference ranges in the RM, in which μ ± σ, so the mean ± standard deviation could be stored, with the meaning indicating what it is, preferably with some (snomed?) coding. I would prefer to indicate mean and standard deviation, instead of μ ± σ, but I guess this could work within the current specification?

Then the Z-score or percentile: maybe a different DV_QUANTITY, with the z-score, perhaps in a cluster with the measurement value? In these kinds of scores, is the distribution also needed in the data? In case of a normal distribution, that is sort of possible in reference range in the same way as above, but if it’s not a normal distribution (possible with percentile score, probably not with z-score?), that will add more complexity.
Do these indications with a distribution occur often? Often enough to necessitate an addition to the specification?

thomas.beale · 23 November 2021 10:34

Yes - this represents single measurements or estimates, with accuracy, i.e. error.

Right - it’s an artefact of the statistical analysis of a population.

Well, theoretically, but then we’re overriding the normal meaning of that part of the model, and people then have to write special code to look for special reference range names to see that it’s a SD band instead.

I think this is likely in the general case, so if we were thinking of adding a type to the RM, I’d probably want to model it more fully, including a field for name of distribution. Also, the purpose of a type called something like STATISTICAL_VALUE is likely to make code and data much safer. Also, such a type would be useful in secondary / aggregated applications.

pieterbos · 23 November 2021 10:55

The concept of a reference range does not seem to be well defined enough in the specification to say it is something else. If you take the definition in Reference range - Wikipedia , then it would be overriding the normal meaning, but the RM specification does not seem to limit it to that.
However, often when the standard deviation is important, it is actually to define a reference range. That will in most cases not be mean +- standard deviation, but more likely to be mean +- n*standard deviation. So my question to the modellers is: is this to indicate something that fits in a definition of a reference range, or something else?

Could be solved with RM changes, or just by modelling this with a couple of ELEMENTS or a standardised archetyped CLUSTER.
If RM changes, I am not sure if it should be a new data type. The measurement or quantity is still a regular number, often with a unit, so a subclass of DV_AMOUNT, and not a ‘statistical value’. What is desired here is a bit of extra information on how this number is to be interpreted, which in these cases happens to be information about the relation between the quantity and the distribution of this quantity in the population. Could just be some changes to DV_AMOUNT or DV_QUANTITY to add this information?

siljelb · 23 November 2021 10:58

Agree, thanks for putting it this clearly.

I’m not sure this is usable for the ‘± 7 days’ example, as accuracy in DV_AMOUNT is Real and not a type with units. Another example could be ‘3h7m ± 2 minutes’, for which we need to specify which unit we’re talking about.

I think this needs to be closely tied to the measurement in question, and I don’t see how we could closely associate a CLUSTER archetype with a specific data element in each archetype. I think an RM parameter of DV_AMOUNT would be better. For that I think we need another class, say STATISTICAL_VALUE as suggested by @thomas.beale , consisting of:

value (Real, for example ‘65’)
type (DV_TEXT, for example ‘Percentile’)
distribution (DV_TEXT, for example ’ 2000 CDC Growth Charts for the United States’)

Edit: This class would have to be able to be repeated in a List, similar to mappings and other_reference_ranges.

ian.mcnicoll · 23 November 2021 11:09

Does this ever exist in the absence of an actual ‘magnitude’ i.e it is only the STATISTICAL_VALUE that is needed?

siljelb · 23 November 2021 11:11

I can’t say, but I would be surprised if it was ever separated from the value it’s derived from.

thomas.beale · 23 November 2021 11:22

It might not be documented well enough in the spec, but the design intention was clear from the start: represent reference ranges in lab results, vital signs and any other observable for which such a range is commonly used in medicine.

That can be true, but today reference ranges could just as easily be derived by data mining (i.e. comparing input variables to outcomes). Practically speaking, the reference ranges in openEHR are not trying to do anything other than represent the ranges used by labs or other sources for that kind of patient (say, pregnant woman) for the analyte in question.

Well then we’re adding optional fields that will be void on 95% of all data, but will probably confuse developers reading the documentation. I’d prefer to see another data type, or a wrapping data type, e.g. it could be done in the form STATISTICAL_VALUE<DV_QUANTIFIED> where the outer class is adding the extra bits and pieces, and the DV_QUANTIFIED (usually a DV_QUANTITY) carries the original raw value (potentially with its own reference ranges).

thomas.beale · 23 November 2021 11:24

Accuracy in DV_DURATION is a DV_DURATION. See here.

siljelb · 23 November 2021 11:31

The problem is, Z-scores and percentiles could potentially be used on any clinical measurement, from IQ tests via lab results and head circumferences, to spirometry. I would think it would be used across a larger number of different concepts than reference ranges, which are mainly used for lab results.

thomas.beale · 23 November 2021 11:43

This points even more toward modelling it as something like STATISTICAL_VALUE<DV_QUANTIFIED> because then you can just have a statistical version of any other kind of value.

But we probably need a comprehensive statement of the problem first - maybe a wiki page? I wouldn’t like to half solve this…

siljelb · 23 November 2021 11:48

How would that work in practice?

pieterbos · 23 November 2021 11:57

If a DV_DURATION, I guess that would be represented either in seconds or as a percentage? It’s a bit technical, but very possible to build a user interface that allows input in days/weeks/etc and stored it in seconds or as a percentage.
If DV_QUANTITY with unit weeks, it can be represented as a fraction of weeks, since it is a real number. A bit ugly to represent currently, but it would be correct.
I guess a change to make DV_DURATION.accuracy a iso_duration itself might be better, but it would be a breaking change.

siljelb · 23 November 2021 12:38

Isn’t that for DV_TEMPORAL, which DV_DURATION afaics doesn’t inherit?