character sets and languages in openEHR

thomas.beale · 6 March 2004 22:57

A couple of technical questions prior to declaring the 0.9 baseline in openEHR:

One of the major openEHR implementors here in Australia has suggested moving the attributes 'language' and 'charset' in the class DV_TEXT to some higher level class - e.g. COMPOSITION, since almost all the time it is the same on DV_TEXT items in a given EHR. We don't think it should be that high, since language cannot be guaranteed the same throughout a COMPOSITION (in their scheme, you would set the attribute on COMPOSITION and then override it on lower nodes if they were different; however, I am very wary of this sort of logic - HL7 uses it a lot and it really complicates things for developers; at the moment we prefer to avoid it completely). One possibility is to move the language attribute to the ENTRY class, on the basis that an ENTRY is the minimium indivisible unit of information in openEHR (this is true, even for 'large' Entries like a microbiology test result). It was initially on DV_TEXT for safety reasons - you would always know what language a text fragment is in (this is important for words which are the same apearance but different meaning in different languages); however, ENTRY is probably just as safe from this point of view.

Q: can anyone think of a scenario where there could be multiple languages inside an ENTRY?

Character set is more difficult to work out. So far, we have specified that Unicode should be used in all strings. This means that in theory there is no need to record the character set name (e.g. iso-latin-1, iso-greek, etc). However, there is still a need to choose between UTF-8, UTF-16 and so on in Unicode. And in any case, I am unsure if all implementation technologies implement unicode in strings; is there a legacy reason to store non-unicode character set names anyway?

- thomas beale

Eric_Browne1 · 17 March 2004 10:13

Tom,

I have pondered the same issue before. I think it unlikely that language
would change inside an entry, but I did think of the possibility of
medicines, e.g. chinese medicines, or part thereof, being described by
specificly foreign names.

cheers,
eric
[ btw, you may wish to check your computer's date/time. I know Queensland
lags in some respects, but 3 days would make the cows very sore! :-)]

system · 17 March 2004 13:51

Hi,

Anamnesis in psychiatry:
And then the disturbed patient said: "Merdre". [Translation: shit]

Family history:
My father was diagnosed as suffering from: "Engelse ziekte" [Translation: Rickets dissease]

Codingsystems
ICPC-1 Dutch version.
Code: R05.
Displayed text: Hoest
Added translation: Cough

Gerard
-- <private> --
Gerard Freriks, arts
Huigsloterdijk 378
2158 LR Buitenkaag
The Netherlands

+31 252 544896
+31 654 792800

b.cohen · 17 March 2004 17:54

The single most common cause of absenteeism in France is a complaint recorded as
"crise de foie", a diagnosis that is translatable into English only as
"hangover".

Quoting gfrer <gfrer@luna.nl>:

Tim_Churches · 17 March 2004 19:50

Yes, I thought of examples which were similar to these. And it is not
just a matter of the recording health professional not knowing what
"Engelse ziekte" means, and thus having to record to verbatim and
untranslated - many diagnoses have no equivalent in other
languages/cultures, and are thus untranslatable (at least not without
some information loss). Given that the "foreign" language text may
require accented characters, or even a completely different character
set, then the Unicode encoding used for the entry will need to be
captured as well as the language, unless openEHR will be restricted
purely to one Unicode encoding, such as UTF-8. Remember the golden rule
with Unicode: "If you don't know the encoding, you don't know nuffin'."

The only problem with "UTF-8 everywhere" is that it is Roman alphabet
chauvinistic, in that the basic Roman characters are all represented
with one byte, but everything else needs two bytes. That dooms all
Russian openEHR records to using twice as much storage as the equivalent
English openEHR records. In these days of massive cheap disc storage and
high speed networks, that fact probably doesn't matter, but it just
seems unfair, although I can't think of a better alternative. As an
English speaker, I would not be keen if openEHR mandated the use of
UTF-16, thus forcing me to use two bytes for every letter. Yet that's
what UTF-8 forces Russians, and Greeks, and Thais and Vietnamese and
just about every other non-Roman alphabetic language speaker to do. Of
course, ideographic languages like Chinese are doomed to use more than
one byte per character, but then the language itself encodes a lot more
information in each character, so it probably works out about the same
in the end.

David1 · 17 March 2004 20:51

Some work is being done with upper ontologies that enable the conversion of concepts etc from one language system to another (ABC model?). This could be between different languages or between say health and welfare sectors of one country. I was involved recently in a meeting with the CEO of DSTC, a government/education/private sector cooperative. I have only had a fleeting glimpse of what they are doing but it may have some bearing on this sort of thing.

DSTC is involved with one of the proof projects for Australia’s HealthConnect initiative and are working with archetypes. Sam H, Thomas B and Peter S are probably already aware of at least some of their activities. Not sure if DSTC are party to this discussion group.

CEO of DSTC is Mark Gibson, mark.gibson@dstc.edu.au

David Neilsen
AIHW

Tim Churches wrote:

thomas.beale · 17 March 2004 22:46

Tim Churches wrote:

Yes, I thought of examples which were similar to these. And it is not
just a matter of the recording health professional not knowing what
"Engelse ziekte" means, and thus having to record to verbatim and
untranslated - many diagnoses have no equivalent in other
languages/cultures, and are thus untranslatable (at least not without
some information loss).

actually, these kinds of expressions are not the problem - they can happily be recorded inside a DV_TEXT object which has the language set to English or Dutch or whatever it may be; an inline occurrence of a 'foreign' term that is routinely used by speakers of a different language (the way we use 'gesundheit' or 'triage' in english) can be assumed to be understood and is probably even in the dictionary of the language of narration.

The problem is when there are text fragments recorded where the words are viable in more than one language, and do not usually have the same meaning in each. Words in Danish & Norwegian should be almost the same, but I assume there are by now some small differences; there are certainly words in most of the European languages which occur in another language, and are completely unrelated. So in theory a language marker is needed to ensure that a later reader knows what language the words were in (maybe even to allow them to know what kind of translator to call). So the question remains - do we need the ability to have multiple languages inside a single entry? For Gerard's examples - would it really be necessary to indicate what the other languages were or not, given that they are probably obvious to most users who will use them?

The real reason for the question is that having to record language everywhere all the time means wasting a certain amount of data stroage on every text fragment stored in the record; the alternative seems to be to record it on Entry; if we decide that it has to be possible to have text fragments within an Entry for which athe name of a different language is actually recorded, we can use an optional language attribute on DV_TEXT which is understood as overriding the value elsewhere. In general I am against this kind of overriding of values in lower objects in a composition - it is not OO, and it is often misunderstood by programmers given the specifications; in general it is dangerous. However, maybe this is an exception which justifies its use....

As for Unicode, obviously we cannot do much about the standard; but I guess someone had to have the 8-bit part of the code space.

system · 17 March 2004 23:39

Hi,

The examples I provided were those that I could think of.

The real question to be asked is:
Why would we want to record the 'language' of a text fragment?
The only correct answer will be:
Because of computational reasons.

In the light of this there is no real use case for this attribute in question other than to indicate in what language the author is documenting its provision of healthcare.
Coding systems will have to be used to indicate in an 'absolute' sense the meaning of things in a computational and language independent way.

If and when this assumption is true then the level of Composition (somewhere high) will be appropriate to record this optional attribute.

Gerard

-- <private> --
Gerard Freriks, arts
Huigsloterdijk 378
2158 LR Buitenkaag
The Netherlands

+31 252 544896
+31 654 792800

Puvanendran_SenthilR · 18 March 2004 02:56

Hi all,
First of all I am new person for this open EHR
project. I got this web link from University of
Queensland professor. Because that I have research
interest on this area.

Actually I am from Sri Lanka and working for
Singapore and Taiwan clients especially for Healthcare
sector. I had good experience with Medical Laboratory
Information Systems,HL7 and ASTM standards. And all of
those projects are multilingual system. (English and
Traditional Chinese).

With that, Experience, I like to recommend to use
UTF16. In patient records, there are more free text
types of results especially in microbiology results.
That can be in other languages and that depend on
place and HIS. I don’t know whether I misunderstood
your project concepts. If it is yes then sorry for
that. I like to participate in this email group and
share my knowledge and get your ideas in this area.

Thanks.

Regards,
P. Senthilruban

thomas.beale · 18 March 2004 12:32

gfrer wrote:

Hi,

The examples I provided were those that I could think of.

The real question to be asked is:
Why would we want to record the 'language' of a text fragment?
The only correct answer will be:
Because of computational reasons.

In the light of this there is no real use case for this attribute in question other than to indicate in what language the author is documenting its provision of healthcare.
Coding systems will have to be used to indicate in an 'absolute' sense the meaning of things in a computational and language independent way.

I agree about the use of codes; but when we have narrative text which is not coded, the meaning could be ambiguous for human readers, and also natural language processors, if not more mundane computing functions. I can imagine that this might be more important in psychiatry or other disciplines where a lot of narrative is generated.

- thomas

Gavin_Brelstaff · 18 March 2004 16:48

Thomas Beale wrote:

Tim Churches wrote:

Yes, I thought of examples which were similar to these. And it is not
just a matter of the recording health professional not knowing what
"Engelse ziekte" means, and thus having to record to verbatim and
untranslated - many diagnoses have no equivalent in other
languages/cultures, and are thus untranslatable (at least not without
some information loss).

actually, these kinds of expressions are not the problem - they can happily be recorded inside a DV_TEXT object which has the language set to English or Dutch or whatever it may be; an inline occurrence of a 'foreign' term that is routinely used by speakers of a different language (the way we use 'gesundheit' or 'triage' in english) can be assumed to be understood and is probably even in the dictionary of the language of narration.

The problem is when there are text fragments recorded where the words are viable in more than one language, and do not usually have the same meaning in each. Words in Danish & Norwegian should be almost the same, but I assume there are by now some small differences; there are certainly words in most of the European languages which occur in another language, and are completely unrelated. So in theory a language marker is needed to ensure that a later reader knows what language the words were in (maybe even to allow them to know what kind of translator to call). So the question remains - do we need the ability to have multiple languages inside a single entry? For Gerard's examples - would it really be necessary to indicate what the other languages were or not, given that they are probably obvious to most users who will use them?

You might also consider the false-friend problem - when learning another
language we optimistically expect words to mean the same as they do in
our native tongue. FOr example I was surprised that "water" in Italian
means W.C. not acqua!

Dipak_Kalra · 18 March 2004 17:47

Dear All,

The proposal being discussed necessitates that THERE BE NO REQUIREMENT FOR a Composition (or maybe an Entry) to internally formally specify the language used for a textual expression, to differentiate it from the language applying to the Composition/Entry as a whole. This ought not to be determined by storage concerns, but on the basis of the requirement. If we cannot be confident that there is no need, we ought not to remove the present functionality which permits it.

When considering this issue the scenarios that come most to my mind are genuinely multi-lingual cultures or contexts. In some countries more than one language prevails, and health care professionals might be proficient in more than one of these. In such situations, the health care agent might have a principal language for record-keeping and an ability and a wish to capture some aspects of the consultation in a patient's own (different) language, such as a symptom description. The HCA might also wish to compose some patient instructions in the patient's own language knowing, perhaps, that the patient can go home and view this EHR on their own computer. Although London is not officially a multi-lingual city, I was working in east London with Bengali-speaking health workers to pilot such schemes (using multi-lingual paper based records) several years ago.

Since we do permit pre-existing EHR information (e.g. an Entry) to be included as a referenced copy in a new Composition, there might also arise a situation when the information so included was recorded in another language.

One question raised in discussions so far is if we need to formally specify the language, or if it is obvious and can be inferred. For words that have been absorbed from other languages (e.g. ,in English, laissez faire) this might be true, but there is a risk that the same word in two languages has different meanings. Torbjorn Nystadnes has told me of one European example:

" Rolig "

Norwegian = quiet (calm, peaceful)
Swedish = funny

in Norway: He died 'rolig' = he died peacefully
in Sweden: you can imagine!!

Whilst natural language translation facilities are still limited, you might feel it is OK to reply upon human "common sense" in such situations. But when records really are travelling across the globe, and such translation software is mature, will we have prevented a valuable aid to safe health care?

A second concern that comes to mind, maybe erroneously, relates to applications and EHR systems that are used across national boundaries. For example, an English-language application being used in France. Consider perhaps that the form labels have been translated into French but not the underling code. Might some attribute values such as the Name be committed to the EHR in English (determined by the form field being used, and as encoded by the app developer, and not a Name value chosen by a French user) and the textual value of that box (written by the user in French). If the language attribute is removed from the DV_TEXT class can we still represent an Element whose Name (or some other attribute value) is in English but whose textual Data Value is in French?

A third concern is interoperability. Since both CEN and HL7 presently carry language as part of textual data types, is it going to be unhelpful for us to do this differently? i.e will we ALWAYS be able to map safely between the proposed modification to openEHR and CEN/HL7? I am not great champion for gratuitously identical behaviour, but we do also need to help bring interoperable EHRs into existence despite all of the business drivers to the contrary!

So, are we confident that we can remove this function from lower levels of the model?

With best wishes,

Dipak

Tim_Cook1 · 19 March 2004 03:20

Getting in late on comments but.........

some higher level class - e.g. COMPOSITION, since almost all the time it
is the same on DV_TEXT items in a given EHR. We don't think it should be
that high, since language cannot be guaranteed the same throughout a
COMPOSITION

I wholly agree with your analysis.

The key trigger phrase above is "almost all the time". Anytime there is
vagueness then a solution should be taken into account. This really is
the real reason for this specification and model anyway isn't it? To
get away from all those "it hardly ever happens", "we'll use the notes
field for that" or "five is enough addresses" ... instances in other
models.

The scenarios given have been excellent and I especially appreciate
Dipak's comment; "But when records really are travelling (sic) across
the globe, and such translation software is mature, will we have
prevented a valuable aid to safe health care?" That kind of vision
shared by all those that have worked so hard for so long on this is what
makes it the prime solution that it is going to be.

Sorry....broke into a little cheer leading there.....<g>

Ciao,
Tim

thomas.beale · 19 March 2004 14:36

Agree with Tim's comments - health information is not something we can be sloppy with. Also, Dipak's Norwegian example was exactly the kind of example I was thinking of - where thinking you know what language the words are in is probably dangerous.

As a practical solution which satisfies the need for non-ambiguity of language, but also doesn't cause too much data excess in EHRs in very mono-lingual contexts (e.g. most anglo countries, but also a surprising number of other countries, e.g. Brazil, Russia, France...) - is:

ENTRY class has
- a mandatory language attribute
- a mandatory character encoding attribute (says which flavour of unicode). This forces the whole ENTRY to be encoded the same way no matter what, but also allows distinct ENTRYs to be encoded in e.g. UTF-8, UTF-16.

DV_TEXT class has
- an optional language attribute, which is understood to override the one from its enclosing ENTRY.

further thoughts from the group?

- thomas beale

Tim Cook wrote:

system · 20 March 2004 20:21

I agree.

GF

-- <private> --
Gerard Freriks, arts
Huigsloterdijk 378
2158 LR Buitenkaag
The Netherlands

+31 252 544896
+31 654 792800

Hoylen_Sue · 22 March 2004 23:24

It is not necessary for openEHR to specify the encoding
format (UTF-8, UTF-16, etc).

Since openEHR does not dictate an implementation or
transport format, it does not need to -- and should not --
specify the character encoding format.

Just saying text will be using the Unicode character set
(and maybe indicating which particular version of Unicode is
being used, version 4.0 is currently the latest) is
sufficient.

For example, if you are encoding openEHR records using XML,
the XML format already has its own mechanism for identifying
the character encoding of the document (the XML declaration,
BOM, etc). Having the character encoding in the Entry would
be meaningless and a potential source of conflict.

Hoylen

thomas.beale · 6 April 2004 02:33

Hoylen Sue wrote:

It is not necessary for openEHR to specify the encoding
format (UTF-8, UTF-16, etc).

Since openEHR does not dictate an implementation or
transport format, it does not need to -- and should not --
specify the character encoding format.

Just saying text will be using the Unicode character set
(and maybe indicating which particular version of Unicode is
being used, version 4.0 is currently the latest) is
sufficient.

I wonder if this is true for people using openEHR-based components via an API rather than communicating via data messages. I assume that the unicode implemementation used in the String type in most of today's languages make it easy to determine what width unicode characters you have in the data?

I agree that being able to commit to as little as possible and still get the effect of standardisation is completely desirable.

- thomas beale

Puvanendran_SenthilR · 6 April 2004 02:46

This is second time, I am sending my comments. I too
agree with Thomas Beale. But, UTF16 format is really
good solution for this problem.
But, If you are thinking about capacity, Then we
should have some standardization for this.

Regards,
P.Senthilruban

Hoylen_Sue · 6 April 2004 03:37

Thomas Beale <thomas@deepthought.com.au> writes:

I wonder if this is true for people using openEHR-based components via
an API rather than communicating via data messages. I assume that the
unicode implemementation used in the String type in most of today's
languages make it easy to determine what width unicode characters you
have in the data?

In all the case I know of, once the data has been read into
the native string type, discovering its "width" is no longer
an issue. This is because the native string type is defined
to support only a single encoding. Data is converted into
that encoding when it is read in.

The string type in a language is not the same between
different languages. For example, in C# the string type
contains UTF-16, whereas in some Unix string libraries they
are in UTF-8, and some C++ libraries use UTF-32.

APIs could be another reason not to specify the encoding
format in the standard. Remember, character set is an
independent issue to encoding.

Hoylen

P.S. As an aside, Java 1.1 and later uses Unicode 2.0
character set. So if Unicode 3.0 or Unicode 4.0 is the
target character set, implementations may be forced to
implement their own string class rather than using the
native java.lang.String. Something to consider when picking
which Unicode version as the standard character set.

thomas.beale · 6 April 2004 04:14

Hoylen Sue wrote:

Thomas Beale <thomas@deepthought.com.au> writes:

I wonder if this is true for people using openEHR-based components via
an API rather than communicating via data messages. I assume that the
unicode implemementation used in the String type in most of today's
languages make it easy to determine what width unicode characters you
have in the data?

In all the case I know of, once the data has been read into
the native string type, discovering its "width" is no longer
an issue. This is because the native string type is defined
to support only a single encoding. Data is converted into
that encoding when it is read in.

so if it is stored in UTF-16 say, the library in a unix application will detect the relevant byte pattern and dispatch the appropriate conversion routine to do utf-16 -> utf-8, or utf-32 -> utf-8? If I remember correctly, this is possible because it is possible to tell from the actual binary data which width encoding is being used; since there is no guarantee that the data is in XML form, with its encoding stated.

If the above is true, we still presumably need to mark the data as being Unicode v 2, 3 or 4 etc? This would only be necessary if the earlier versions were not pure subsets of the later ones. Can you clarify this Hoylen?

P.S. As an aside, Java 1.1 and later uses Unicode 2.0
character set. So if Unicode 3.0 or Unicode 4.0 is the
target character set, implementations may be forced to
implement their own string class rather than using the
native java.lang.String. Something to consider when picking
which Unicode version as the standard character set.

I would suggest that the requirements with respect to string representation are:

    * systems get to store their data in whatever is most convenient locally (e.g. whatever the dbms wants to use)
    * it must be possible for a 3rd party application which is openEHR compliant to read the data in a system, even it if it not in its own "preferred" form
       (vertical interoperability)
    * it must be possible for the data to be exported in a way that it can be universally read or transformed into a readable form for use in another system
       (horizontal interoperability)
    * the specifications commit implementors to as little as possible, while allowing the above requirements to be met.

Based on Hoylen's more recent info, a draft modelling solution seems to be:

1. openEHR states that all strings are in Unicode in its abstract specifications.

Question: if there are no "simple strings" at all in the data and everything is a unicoded string, is this safe?

2. that the following abstract model is used (improved from version a few weeks ago):
    ENTRY class has
        - a mandatory language attribute
        - a mandatory character encoding attribute (says which VERSION and which flavour of unicode).
            This forces the whole ENTRY to be encoded the same way no matter what,
            but also allows distinct ENTRYs to be encoded in e.g. Unicode 3.0/UTF-8, Unicode 4.0/UTF-16.

DV_TEXT class has
- an optional language attribute, which is understood to override the one from its enclosing ENTRY.

3. Implementation specifications like XML-schemas and software APIs are required to make the character
encoding and Unicode version attributes visible, so that clients can process / convert the data properly.

further comments to clean this up will be much appreciated.

- thomas

Topic		Replies	Views
questions about string literals Technical (archive)	6	11	8 October 2006
CEN meeting and data types Clinical (archive)	14	27	7 March 2007
Proposed slightly radical change to CODE_PHRASE in Text package in openEHR Technical (archive)	9	4	23 January 2006
loss of type information in ID classes Implementers (archive)	18	8	4 March 2007
lessons from Intermountain Health, and starting work on openEHR 2.x Technical (archive)	30	63	8 October 2012
Byte Order Marks Technical (archive)	5	13	3 November 2008
openEHR XML schemas Technical (archive)	6	10	18 December 2002
Implementation fine details - case sensitivity and date time formats Implementers (archive)	4	9	3 July 2008
Data Types RM Technical (archive)	5	11	27 March 2003
Private response, so OpenEHR list is not for further discussion? Clinical (archive)	5	5	19 September 2005

character sets and languages in openEHR

Related topics