Data Types

thomas.beale · 1 July 2002 07:05

[original post from Prof John Roddick, Flinders University South Australia, which failed to get through]

thomas.beale · 2 July 2002 02:25

[original post from Prof John Roddick, Flinders University South Australia, which failed to get through]

Parsons, S., 1996. Current approaches to handling imperfect information in data and knowledge bases. IEEE Transactions on Knowledge and Data Engineering 8 (3): 353-372.

in which he identifies five types of imperfection in data. Namely:

1. Incomplete. (eg. test results not known or qualified as in "interim results only")

I think this is an aspect of the real-world situation, and just means that the information currently captured is only a "snapshot" along some tmeline; later, the final information will (presumably) be available. In openEHR, this would be indicated in the clinical info itself, e.g. pathology results might say "preliminary results". We don't need to do anything special in this case.

In cases like an unconscious person coming to A&E, and the admission form on the screen requires all sorts of things which cannot be answered for now, traditional computer systems do completely the wrong thing, and either prevent the form from being committed with what is known (phsyical description=xxxx, presenting complaint=partially severed left hand....) or creates dummy (but wrong) values for the fields that could not be filled in.

For this kind of situation, we have taken a lead from SCADA control systems (where I learned about software) and HL7's "flavours of null" approach. In control systems, all values have an associated "data quality" marker, which, if it indicates that the value is "old" or that serial communication from the field has stopped, you ignore the actual value (which might otherwise look like a completely legitimate transformer voltage or whatever). In HL7, all their data types include the notion of Null values in every possible field, and the include a "flavour of null" - reason for why the value is not available - e.g.. "unknown", "unavailable", "not asked", "asked but refused", "not applicable" etc (that's from memory so the values might be a bit off).

The approach we have taken in openEHR is similar to the control system approach, and uses HL7's flavour's of null. Thus, the class ELEMENT has attributes:
value: DATA_VALUE
null_flavour: DV_CODED_TEXT {value from HL7 null flavours domain}

This approach also works for database systems - there is no need to mix in fake null/0 values into the type value domain for a value field - it's a separate field, btu always associateed with the value field. So even if Oracle forces you to have a real date in the date-of-borth field (e.g. "1-1-1800"), the null_flavour sitting next to it has the value "UNK", meaning - "unknown - ignore what is in the value field".

2. Imprecise. (eg. age "between 25 and 30" etc.). This arises from a lack of granularity.

we definitely have to deal with this. The possible ways include:
- DV_INTERVAL<T> type for ranges
- partial dates & times
- using narrative text

do we need more?

3. Vague. (eg. blood pressure "high", smokes "a lot", pain "acute", etc.) This arises from the use of fuzzy terms.

we also have to deal with this, and the typical clinical version found in pathology and other areas where you get values from sets like {trace, +, ++, +++, ...}.

Currenty we have avoided a complex fuzzy data type, and provided the DV_ORDINAL data type, which allows ordinal numbers to be associated with symbols (or words). So for smoking, if you really want to avoid characterising quantitatively, you could use a DV_ORDINAL, which comes from a "Lilliputian DOH tobacco consumption" domain/set: {1=none; 2=occasional; 3= regular/light; 4=heavy; 5=going to die real soon now}. From the medical perspective I imagine that this particular example would be a spectacularly bad way to record this particular datum..... but the model will certainly let you do it, and it will also allow comparison (use of the '<' operator) by virtue of the ordinal numbers associated with the symbols. For recording pain, or the Apgar characteristics, or urinalysis values, this approach seems fairly common among clinicians.

Our idea wsith DV_ORDINAL was primarily not to prevent doctors from using "+", "++", "+++" type values, and to add a little bit of rigour (ensuring comparability).

What we are not doing is implementing a mathematical fuzzy model where each symbol is associated with a sub-section of a numerical range. For those of you into fuzzy maths, you know that to characterise these mapping requires a fair bit of extra information. However, this kind of information can be stored in archetypes, and is not needed in the data (the mappings should not change with respect to the patient), so we should probably consider this when designing the archetype version of the DV_ORDINAL class (and maybe other quantitative classes as well).

4. Uncertain. (eg. a 95% chance of accuracy). Arises from a lack of knowledge or subjective assessment.

for this we include a "confidence: REAL" attribute in the ENTRY class.

5. Inconsistent. (ie. contradictory information).

I'm not sure what should be done about this, but I think it is in the clincal domain; the level of or reason for inconsistency should be characterised in the data by its authors; I don't think it needs anyting special in the reference model. (Anyone disagree?)

to that you can add a sixth

6. Out-of-date. (ie. correct when stored by unlikely to be true now).

this is a tricky one, and an example is "smoking status"=smoker which might be true up until two years ago, but change then. Also, the converse - the EHR shows that the patient was recorded as a smoker 15 years ago, but there is no new information regarding smoking at all. Is s/he still a smoker? In general the time-based transaction concept of GEHR gives systems the basic tool for recording updates to things.

Sam has been contemplating ways of representiing the idea of "confirming" previous information whose value does not change, but we want a more recent update on teh situation (and medico-legally, the practitioner wants to show in the record that they did indeed review various things on such-and-such a date). This might require a special marker whcih does not change the valuue of something, but says that it was verified to be the same. I don't think we have and answer yet for this in the architecture.

These can, of course, be combined!

Incompleteness has traditionally been handled in databases with the null value. In my opinion this has been totally inadequate but that doesn't stop it being the only option available in most systems. Imprecision and uncertainly is often handled through coercion to the nearest value with all the problems that might cause and vagueness and inconsistency is often not handled at all. Out-of-date-ness is handled by assuming it doesn't happen.

John's long experience with the horrors of inadequate data handling certainly rings true with me.

For the purposes of GEHR, I would suggest that No. 5. Inconsistent data is a fact of life and since this is somewhat different (it required two pieces of information for example) then we should leave this category to constraint handling and expert interpretation.

Agree.

However, I would suggest we need to find a way of handling the other 5. It's not initially clear how though. Perhaps a qualifying field for each critical value?

how do you feel about the current ways of dealing with the problems, detailed above? We would value your expert opinion.

- thomas beale

Douglas_Carnall · 2 July 2002 18:17

In cases like an unconscious person coming to A&E, and the admission
form on the screen requires all sorts of things which cannot be answered
for now, traditional computer systems do completely the wrong thing, and
either prevent the form from being committed with what is known
(phsyical description=xxxx, presenting complaint=partially severed left
hand....) or creates dummy (but wrong) values for the fields that could
not be filled in.

Yes. This is a VITAL thing to recognise: that any data entry method that
forces clinicians to commit when we do not wish to commit will lead either
to:
(1) nonsensical data (garbage in... )
(2) user anomie ("it forced me to lie")

In control systems, all values have an associated "data
quality" marker, which, if it indicates that the value is "old" or that
serial communication from the field has stopped, you ignore the actual
value (which might otherwise look like a completely legitimate
transformer voltage or whatever). In HL7, all their data types include
the notion of Null values in every possible field, and the include a
"flavour of null" - reason for why the value is not available - e.g..
"unknown", "unavailable", "not asked", "asked but refused", "not
applicable" etc (that's from memory so the values might be a bit off).

The need for a language of uncertainty... interesting. Will think more about
this.

>> 2. Imprecise. (eg. age "between 25 and 30" etc.). This arises from
>> a lack of granularity.

we definitely have to deal with this. The possible ways include:
- DV_INTERVAL<T> type for ranges
- partial dates & times
- using narrative text

do we need more?

I often find, when I'm coding a consultation, that I am happy to use a precise
Clinical Term as a starting point for a statement of a diagnosis, but want to
qualify it in some way. Most commonly, I want to add "?" or "??" or "???", or
all three in a list of differential diagnoses of descending probability, or
utility.

e.g.

chest pain ?ischaemic ??muscular ???emotional

One often wants to rule out important, but less likely diagnoses. In the
example above, my hunch might be that the pain has an emotional cause, but I
want to catch the ECG technician before she goes home, then complete my
examination, before bringing the patient back to a longer appointment to
discuss her recent bereavement.

I think the answer to this problem as a system designer is recognise that
qualifiers are likely to be needed in many situations, and leave a space for
their collection and definition. The space should be as unassuming as
possible about the kind of data that may be found there. Once the system has
been used, one important use for the data gathered in the uncertainty space
will be to enable a retrospective qualitative examination of the kinds of
qualifers that have been felt useful by clinicians, and attempt some taxonomy
for Version 2.

>> 3. Vague. (eg. blood pressure "high", smokes "a lot", pain "acute",
>> etc.) This arises from the use of fuzzy terms.

we also have to deal with this, and the typical clinical version found
in pathology and other areas where you get values from sets like {trace,
+, ++, +++, ...}.

I'm not sure that I can be certain about the difference between "vague" and
"imprecise"

All of the examples of vague data given above are of course amenable to
further quantification, should someone feel it is important to take the time
to do so. The individual clinican probably knows what he or she means when
using such a term. I carry a number of definitions in my head of what I mean
by "high" blood pressure for example, but they may vary (hopefully not too
wildly) from those of my colleague down the corridor, or those that the
Professor of Cardiovascular Medicine uses.

If the system could record a set of "imprecision preferences" for each
individual user, it could enable:
1) subsequent users of a vague value to get a feel for how the individual who
recorded it has used vague values in other instances;
2) individuals to compare their own use of vague values with others, and
migrate towards a mean.

Currenty we have avoided a complex fuzzy data type, and provided the
DV_ORDINAL data type, which allows ordinal numbers to be associated with
symbols (or words). So for smoking, if you really want to avoid
characterising quantitatively, you could use a DV_ORDINAL, which comes
from a "Lilliputian DOH tobacco consumption" domain/set: {1=none;
2=occasional; 3= regular/light; 4=heavy; 5=going to die real soon now}.
From the medical perspective I imagine that this particular example
would be a spectacularly bad way to record this particular datum.....

The most useful quantitative predictor for harm caused by tobacco consumption
is the pack year (=packs/day * years_of_consumption) i.e. if I smoke 10/day
for 15 years this would be 7.5 pack years. But if I was a tobacco cessation
therapist, I might be interested in recording that someone who had previously
smoked 18/day had cut down to 17/day over the last week (a negligible change
in terms of calculating health risk).

Although you're right that your ordinal set is not the way I'd choose to
record smoking data myself, if another clinician chose to use that framework,
I could still draw useful inferences from it subsequently.

There must be lots of data gathering designs out there on this topic; I think

never?/ever?/now? are the main top heads for tobacco consumption;

if never record null value

if now=no but ever=yes
then offer opportunity to record start date, cessation date and pack years

if now=yes
then offer opportunity to record start date and daily consumption

(note, people don't always smoke the same amount each day)

is the way I like it (and think about it) but others might find
implementations based on this irritating. So why not allow clinicians to have
a box in which they can either:

1) just write text about tobacco consumption
2) set up their own structure that is meaningful for them (and share those
structures on the internet, like complicated config files are shared for
example, those of mutt or bash.

but the model will certainly let you do it, and it will also allow
comparison (use of the '<' operator) by virtue of the ordinal numbers
associated with the symbols. For recording pain, or the Apgar
characteristics, or urinalysis values, this approach seems fairly common
among clinicians.

Our idea wsith DV_ORDINAL was primarily not to prevent doctors from
using "+", "++", "+++" type values, and to add a little bit of rigour
(ensuring comparability).

Let's not inflict rigour on people. Let's offer clinicians delightful
opportunities to express what they thought, and what they mean. Comparison
should be a secondary function.

What we are not doing is implementing a mathematical fuzzy model where
each symbol is associated with a sub-section of a numerical range. For
those of you into fuzzy maths, you know that to characterise these
mapping requires a fair bit of extra information. However, this kind of
information can be stored in archetypes, and is not needed in the data
(the mappings should not change with respect to the patient), so we
should probably consider this when designing the archetype version of
the DV_ORDINAL class (and maybe other quantitative classes as well).

>> 4. Uncertain. (eg. a 95% chance of accuracy). Arises from a lack
>> of knowledge or subjective assessment.

for this we include a "confidence: REAL" attribute in the ENTRY class.

Err... I'm quite happy telling a patient that I'm 90% confident that it's just
a virus, but I'll see them next week if it's not settling down. I'm not so
sure that it'll be meaningful to make comparisons with my colleagues'
statements that they were 85% and 95% confident that the patients they saw
had viruses too.

Just because you've got two numbers doesn't mean that you can perform
arithmetic with them.

>> 5. Inconsistent. (ie. contradictory information).

I'm not sure what should be done about this, but I think it is in the
clincal domain; the level of or reason for inconsistency should be
characterised in the data by its authors; I don't think it needs anyting
special in the reference model. (Anyone disagree?)

See my suggestion for a meta/qualifier/uncertainty space above.

>> to that you can add a sixth
>>
>> 6. Out-of-date. (ie. correct when stored by unlikely to be true now).

this is a tricky one, and an example is "smoking status"=smoker which
might be true up until two years ago, but change then.

I'm much more relaxed about this. Clinicians are experts at interpreting old
data from clinical records. As long as I know that, at date X, clinician Y
thought Z was true, I can form a judgement about what that means, and what
action if any I need to take to refresh the data now.

Sam has been contemplating ways of representiing the idea of
"confirming" previous information whose value does not change, but we
want a more recent update on teh situation (and medico-legally, the
practitioner wants to show in the record that they did indeed review
various things on such-and-such a date). This might require a special
marker whcih does not change the valuue of something, but says that it
was verified to be the same. I don't think we have and answer yet for
this in the architecture.

Sort of like the Unix "touch" command?

>> These can, of course, be combined!

Ha ha! That's the world I live in for sure.

D.

thomas.beale · 4 July 2002 23:38

Douglas Carnall wrote:

In control systems, all values have an associated "data
quality" marker, which, if it indicates that the value is "old" or that
serial communication from the field has stopped, you ignore the actual
value (which might otherwise look like a completely legitimate
transformer voltage or whatever). In HL7, all their data types include
the notion of Null values in every possible field, and the include a
"flavour of null" - reason for why the value is not available - e.g..
"unknown", "unavailable", "not asked", "asked but refused", "not
applicable" etc (that's from memory so the values might be a bit off).

The need for a language of uncertainty... interesting. Will think more about this.

the HL7 ballot has the full explanation if you are interested. But there is a big difference in the we way apply it and the way they do. They specify that not only can a whole datum be Null (with its flavour of null stated), but any attribute thereof can as well. This means that you can get partially populated data items (e.g. a quantity with no units, an interval with missing limits etc) which I have argued is more complex to process, and more likely to result in software errors (given the quality of real world software). Theoretically, there is nothing wrong with their approach (in fact it's quite an interesting idea), but for the moment,we are going to go a simpler, more expected direction. Time and experience will tell whcih approach is more appropriate.

2. Imprecise. (eg. age "between 25 and 30" etc.). This arises from
a lack of granularity.

I often find, when I'm coding a consultation, that I am happy to use a precise Clinical Term as a starting point for a statement of a diagnosis, but want to qualify it in some way. Most commonly, I want to add "?" or "??" or "???", or all three in a list of differential diagnoses of descending probability, or utility.

e.g.

chest pain ?ischaemic ??muscular ???emotional

ok - at the moment, we would say that a differetnial diagnosis would be defiined by archetypes, which in your case would be associations of terms and confidence factors expressed as what we call DV_ORDINALs, giving you the ablity to just use "?", "??", "???". This means your software could be written to accept exactly what you have put in above.

Once the system has been used, one important use for the data gathered in the uncertainty space will be to enable a retrospective qualitative examination of the kinds of qualifers that have been felt useful by clinicians, and attempt some taxonomy for Version 2.

agree - we need some more in-use experience before further theorising...

3. Vague. (eg. blood pressure "high", smokes "a lot", pain "acute",
etc.) This arises from the use of fuzzy terms.

If the system could record a set of "imprecision preferences" for each individual user, it could enable:
1) subsequent users of a vague value to get a feel for how the individual who recorded it has used vague values in other instances;
2) individuals to compare their own use of vague values with others, and migrate towards a mean.

what this means is actually using fuzzy quantitative mappings for imprecise terms. The fuzzy (numeric) data has to be carried with the symbolic datum each time, so it can be compared to other's data, and the comparison will work, even if your "high" is someone else's "critical". We haven't yet got this facility, but I think it is important enough to start designing into the archetype model.

Although you're right that your ordinal set is not the way I'd choose to record smoking data myself, if another clinician chose to use that framework, I could still draw useful inferences from it subsequently.

right.

There must be lots of data gathering designs out there on this topic; I think

never?/ever?/now? are the main top heads for tobacco consumption;

if never record null value

if now=no but ever=yes
then offer opportunity to record start date, cessation date and pack years

if now=yes
then offer opportunity to record start date and daily consumption

(note, people don't always smoke the same amount each day)

is the way I like it (and think about it) but others might find implementations based on this irritating. So why not allow clinicians to have a box in which they can either:

1) just write text about tobacco consumption
2) set up their own structure that is meaningful for them (and share those structures on the internet, like complicated config files are shared for example, those of mutt or bash.

well, this is what archetypes are about. But we can go frther with them, since we can allow 2 or 3 alternative smoking archetypes, and computationally convert between them, but comparing their interfaces. This won't necessarily be easy, and in some cases will be very challenging (e.g. comparing numeric nr packets to "heavy smoker") but the principle is there....

Our idea wsith DV_ORDINAL was primarily not to prevent doctors from
using "+", "++", "+++" type values, and to add a little bit of rigour
(ensuring comparability).

Let's not inflict rigour on people. Let's offer clinicians delightful opportunities to express what they thought, and what they mean. Comparison should be a secondary function.

it's hidden in the model anyway - they won't see it. But it's useful for pseudo-standardised sets of symbols for e.g. urinalysis

4. Uncertain. (eg. a 95% chance of accuracy). Arises from a lack
of knowledge or subjective assessment.

for this we include a "confidence: REAL" attribute in the ENTRY class.

Err... I'm quite happy telling a patient that I'm 90% confident that it's just a virus, but I'll see them next week if it's not settling down. I'm not so sure that it'll be meaningful to make comparisons with my colleagues' statements that they were 85% and 95% confident that the patients they saw had viruses too.

Just because you've got two numbers doesn't mean that you can perform arithmetic with them.

this is true, but I'm not sure what the alternative, since at least a % is more neutral than "low", "med", "high". Is there any research in this area I wonder?

6. Out-of-date. (ie. correct when stored by unlikely to be true now).

this is a tricky one, and an example is "smoking status"=smoker which
might be true up until two years ago, but change then.

I'm much more relaxed about this. Clinicians are experts at interpreting old data from clinical records. As long as I know that, at date X, clinician Y thought Z was true, I can form a judgement about what that means, and what action if any I need to take to refresh the data now.

right. I am right behind the idea that the EHR is to help clinicians do their job (which is a lot of the time: thinking, evaluating, deciding...) not try to take it over. Well leave it to Bill Gates and his paper clip to do that.

Sam has been contemplating ways of representiing the idea of
"confirming" previous information whose value does not change, but we
want a more recent update on teh situation (and medico-legally, the
practitioner wants to show in the record that they did indeed review
various things on such-and-such a date). This might require a special
marker whcih does not change the valuue of something, but says that it
was verified to be the same. I don't think we have and answer yet for
this in the architecture.

Sort of like the Unix "touch" command?

that's it.

thanks for the input. I think the two items for us to consider from this are:

a) need for fuzzy quantification in archetypes to correspond to ordinal symbols (+, ++, +++, ?, ??, ??? etc)
b) possible re-evaluation of % as a way of expressing subjective certainty.

- thomas beale

Topic		Replies	Views
More on ISO 21090 complexity Technical (archive)	35	3	24 November 2010
Pathology numeric values not supported in DV_Quantity Technical (archive)	52	0	4 May 2006
Flavour of null Technical (archive)	31	2	1 June 2005
DV_WORLD_TIME and timezone Technical (archive)	40	0	18 September 2002
openEHR / FHIR data types cross analysis Technical (archive)	20	0	27 March 2012
Intro & Questions: null Technical (archive)	2	1	26 March 2003
The concept of contribution Technical (archive)	45	2	13 June 2002
Null Flavours, boolean values in openEHR Technical (archive)	4	0	11 December 2007
Use of Identifiers in archetypes Technical (archive)	19	0	19 January 2011
Representing binary values with DV_BOOLEAN Technical (archive)	25	1	10 February 2011

Data Types

Related topics