Invalid language codes in languages codeset

Ralph_van_Etten · 17 January 2014 11:06

Hi,

I noticed there are invalid ISO 639-1 language codes used in

https://github.com/openEHR/terminology/blob/master/openEHR_RM/RM/Release-1.0.2/external_terminologies.xml#L263

For instance, "ar-sa", "en-us" or "nl-be" are all invalid ISO 639-1
codes. Only codes consisting of two letters are allowed according to ISO
639-1.

Is this on purpose?

If it is on purpose, then what standard is used to define the language
code? And shouldn't the documentation and external_id of the codeset
reflect this standard?

Regards,

Ralph van Etten
MEDvision360

BTW: "ar-sa", "en-us" and "nl-be" all look like RFC 5646 [1] codes but
they are not. RFC 5646 codes are defined as ISO 639-1 code followed by
an ISO 3166-1 code and ISO 3166-1 code are all uppercase letters. So the
correct language code according to RFC 5646 would be "ar-SA", "en-US"
and "nl-BE".

[1]: http://tools.ietf.org/html/rfc5646

thomas.beale · 17 January 2014 23:46

Ralph,

well spotted. I suggest in the short term we do the following:

create a new code_set called something like languages_regional (that's just an openEHR id, so it's not really important), so we end up with two code sets:

Then we split out the codes like ar-SA etc into the second group, so that we have two correct code-sets. Both code sets then need the missing codes to be added in. I'm not that clear what tools are using this file, but it will clearly be better if it is made regular rather than its current form of following no standard properly.

thoughts?

- thomas

Ralph_van_Etten · 19 January 2014 08:07

Hi Thomas,

well spotted. I suggest in the short term we do the following:

create a new code_set called something like languages_regional (that's
just an openEHR id, so it's not really important),

Does this imply that if the specification currently says it should
contain a 'languages' it can also contain a 'languages_regional' ?

This would be the preferred solution for us but I am not sure if this is
going to work for others.
I can image some people have implemented the 'languages' according to
spec by only allowing ISO 639-1 codes. If the spec changes by also
allowing RFC 5646 language codes it might break interoperability.

so we end up with two
code sets:

<codeset issuer="ISO" openehr_id="langu ages"
external_id="ISO_639-1">

<codeset issuer="ISO" openehr_id="langu ages_regional"
external_id="RFC_5646">

The issuer of an RFC would be IETF.

Anyway, I think it is important to use RFC 5646 and include the region
when specifying a language. After all, there are words with a completely
different meaning depending on the region you live in.

Then we split out the codes like ar-SA etc into the second group, so
that we have two correct code-sets. Both code sets then need the missing
codes to be added in. I'm not that clear what tools are using this file,

I know it is used by the mini termservice of the Java reference
implementation:

https://github.com/openEHR/java-libs/blob/master/mini-termserv/src/main/resources/external_terminologies_en.xml#L263

We use the mini termservice in MEDrecord while processing and validating
openEHR data.

Regards.

Ralph van Etten
MEDvision360

Ralph_van_Etten · 20 January 2014 08:49

Hi,

I looked into it some more and it seems there is some inconsistency
between various classes regarding language and territory.

For instance, a COMPOSITION has a language and territory field. So
storing a language like 'en-UK' is already possible in a COMPOSITION.

However an ENTRY only has a language field but not a territory field so
it is not possible to store a language like 'en-UK'.

Is this deliberate? Or has the territory field in COMPOSITION a
different purpose?

To be consistent it would be better if everything uses either RFC 5646
encoded languages or has separate fields for language and territory.

Regards,

Ralph van Etten
MEDvision360

Jan-Marc_Verlinden1 · 20 January 2014 09:39

Hi Thomas,

It looks strange that this was not found before by others (and if they did how did they solve this). Would it be wise to just think of some -short cut- solution, as in fact I think the standard (and ref implementation) should be rewritten.

What do you think?

Regards, Jan-Marc

thomas.beale · 20 January 2014 14:07

The original idea (now 12 years old!) was to use just the ISO-639 code in both COMPOSITION and ENTRY. The COMPOSITION also had territory on the basis that information being committed to a system always happens in some real place, i.e. a country / state.

I would think that a more modern idea of this would be that we could treat those language fields as RFC 5646 coded fields instead, since it allows more specific languages where the territory really affects things e.g. pt-BR and pt-PT can be quite different for many words. If we started doing that right now, it could break some software, but we can easily find out from vendors and implementers if their system would break or not. We should do this and if all implementers are ok, we upgrade the spec to say that these two fields are RFC-5646 compliant. Now, one thing we need to know is if “en” is legal in 5646. If it is, it means that 5646 is a superset of 639, and even just single language codes can be allowed.

I don’t have time to check all the details right now but if someone could check on this, then I suggest you post on the implementers list, and ask the question about what the impact of upgrading the language field in both classes is on everyone’s implementation. This would be a good thing to fix in the current specs. Could you also please raise this as an issue on the spec issue tracker.

seem reasonable?

thomas

pablo · 20 January 2014 18:57

Hi Ralph, archetypes have also those invalid codes. What I did in Open EHRGen was creating a mapping from those codes to the right ones, so internally I have ISO valid codes but EHRGen consumes ADL with invalid ones.

It would be nice to fix all the ADLs and terminology to simplify development and this kind of horrible hacks.

Ralph_van_Etten · 21 January 2014 09:05

You are right. It is also in the ADL files.

For example the blood pressure archetype uses:

language = <[ISO_639-1::zh-cn]>

But you translate them to valid ISO codes internally? Does that mean
"pt-PT" and "pt-BR" are both mapped to the same language ("pt") ?

Regards,

Ralph van Etten
MEDvision360

Ralph_van_Etten · 21 January 2014 09:44

Hi Thomas,

I would think that a more modern idea of this would be that we could
treat those language fields as RFC 5646 coded fields instead, since it
allows more specific languages where the territory really affects things
e.g. pt-BR and pt-PT can be quite different for many words. If we
started doing that right now, it could break some software, but we can
easily find out from vendors and implementers if their system would
break or not.

In various places they are already using the <language>-<country>
format. But it is a good idea to check.

However, since RFC-5646 allows many more formats besides
<language>-<country> it might lead to some problems for some of the
implementations if the full RFC-5646 is supported. Maybe it would be
better to say at least the <language> and <language>-<country> formats
of RFC-5646 must be supported by implementations and all other formats
are optional?

We should do this and if all implementers are ok, we
upgrade the spec to say that these two fields are RFC-5646 compliant.

It is not just those two fields, there are may more fields using
language codes, including the ADL files.

Now, one thing we need to know is if "en" is legal in 5646. If it is, it
means that 5646 is a superset of 639, and even just single language
codes can be allowed.

Yes, a single language code is also allowed in RFC 5646

I don't have time to check all the details right now but if someone
could check on this, then I suggest you post on the implementers list,
and ask the question about what the impact of upgrading the language
field in both classes is on everyone's implementation. This would be a
good thing to fix in the current specs. Could you also please raise this
as an issue on the spec issue tracker
<http://www.openehr.org/issues/browse/SPECPR>\.

I created an issue: http://www.openehr.org/issues/browse/SPECPR-95

Regards,

Ralph van Etten
MEDvision360

system · 21 January 2014 10:01

It is unbelievable, how can ISO publish a language-code system in which it is impossible to distinguish Portuguese and Brazilian-Portuguese? Where the Brazilians sleeping? Didn't they protest?

I don't know much about Portuguese, so I cannot indicate how bad this is.

But I know about French, they also speak a kind of French in Belgium.
It is almost on bicycle-distance from here that they say "septante, huitante, nonante".

If you cycle an hour further, you reach the French border, and suddenly your blood-pressure is not anymore septante, septante-et-un, septante-et-deux, huitante or nonante, nonante-et-deux,
but soixante-dix, soixante-onze, soixante-douze, quatre-vingts or quatre-vingts-dix, quatre-vingts-douze, which is maybe low, after an extra hour through the hills on bike.

The thing is, many French people don't know that. This is because, when the Belgians go to France, they say it in a French way, and French people, why should they ever leave their country?

Incroyable, ISO did not notice that. Not only the Brazilians were sleeping, but the Belgians too.

The OpenEHR community discovered it, they also created a solution for that, very clever, it discovered a shortcoming in an ISO-standard which the whole world did not discover when ISO made it, and they had the courage to repair it.

But how was it possible to call that solution ISO 639-1 and that that mistake survived for so many years? That is another mystery on this matter.

I think, someone makes the mistake and the rest of us are having blind faith.

A truly educational experience, une expérience vraiment éducatif

Very clever of Ralph to discover that. Always, somewhere, somebody is awake.

Best regards
Bert Verhees

pablo · 22 January 2014 03:27

Hi Ralph, in EHRGen we need the archetype language to be the language selected by the user, because we use archetypes to generate UI, so all labels that appear in the UI are taken from archetypes. So if the user configures “pt” as the language, the terminology resolver checks for that to be defined on the archetype, if this doesn’t exists, then it checks if it is some locale code with the language part equals to “pt”, if it finds “pt-PT” first, then those terms are used.

Also, if the configured locale is “pt-PT”, EHRGen looks for “pt-pt” in the archetype, if that is not found, then it tries only “pt”.

So we check several combinations, trying to find the best match.

thomas.beale · 22 January 2014 07:03

I'm travelling at the moment. When I am back home I'll try to provide an analysis of what the specs say, and what they probably should say.

In the meantime, if implementers here can have a look and state what your preferred solution for the future is, taking into account there is a fair bit of data already with the existing ISO-639 codes.

I think it's only archetypes where it is the mixture of RFC 5646 and 639 codes, and in the ADL workbench, I added some code a long time ago to gracefully deal with either.

- thomas

pablo · 22 January 2014 12:25

Hi Thomas,

IMO Java Locale doc gives a nice solution for language, country, variant, etc., mentioning each standard used for each part:

http://docs.oracle.com/javase/7/docs/api/java/util/Locale.html

I think variants whould be useful in OPTs to localize composition definition for a specific region or even one healthcare center.

system · 22 January 2014 20:04

Also see: http://tools.ietf.org/search/bcp47

op 22-01-14 13:25, pablo pazos schreef:

system · 22 January 2014 20:08

I think it is the best solution, instead of providing a list to specify countries/languages, it is better to provide a way to construct tags.
For the case you find a small community somewhere which has a different leading language, you don't get stuck on a list which hadn't thought about that.

I remember there are some old isolated German speaking communities in the USA.

Bert

op 22-01-14 13:25, pablo pazos schreef:

system · 23 January 2014 08:30

At the ISO 639 web we can read the following (http://www.loc.gov/standards/iso639-2/faq.html):

How does one indicate the language variation spoken in a particular country?
The ISO 639 standards and RFC 4646 allow for combining the language code with a country code from ISO 3166 to denote the area in which a term, phrase, or language is used. For instance, using RFC 4646, English as spoken in the United States may be indicated with the following: en-US

So it seems the standard allows that kind of combination. Moreover, I have found another interesting answer:

Are separate language codes defined for dialects of languages?
A dialect of a language is usually represented by the same language code as that used for the language. If the language is assigned to a collective language code, the dialect is assigned to the same collective language code. Generally, dialects are not given different codes, but determining the difference between dialects and languages will be decided on a case-by-case basis. In the future ISO 639-6, currently under development, may be used to identify language variants and dialects.

I have searched for it and while ISO 639-1 has around 150 languages, ISO 639-6 grows to around 25000! If you want to be precise, things get complex.

system · 23 January 2014 08:39

I think this works exactly in the same way in openEHR and ISO EN 13606. The
territory is separated in a specific field because it represents the
country under whose laws the Composition is created. This is important
because it affects to things like privacy and access policies.

This cannot be merged to language or language+dialect. There will be
clinicians working in Spain from Latin America or in UK from US or
Australia.

Ralph_van_Etten · 23 January 2014 09:16

Good to know.

It would be nice if this kind of information can be included in the
documentation.
Is that possible?

Regards,

Ralph van Etten
MEDvision360

yampeku · 2 April 2014 10:33

Just today we had another interesting discussion on a related topic
about languages, translations, and slot solving.
The problem comes when you have an archetype whose original language
is different from the one that you are solving the slot with. There
are several alternatives, but seems that there is no 'perfect' one.

There is always the possibility of taking the original language of the
solved slot archetype and just add it to the original archetype as a
translation, marking the strings in the other languages in some way.

This is related to the languages codes as we could assume that a slot
with an 'en-gb' can be safely placed in a 'en' archetype and reuse all
the texts and descriptions. The problem comes the other way around
(can we assume that a 'en' slot can be safely placed in a 'en-gb'
archetype?).

Even if you have the same language in both archetypes, we have to
consider if a translation from a slot has the same validity to be
included in the original language descriptions of an given archetype
(In theory, all translations should be made from the original
language, and if the original language was another one, can we assure
that the meaning is the same as the original?)

Anyone of the list has been dealing with this problem? Which solutions
have you adopted for your tools/systems?

thomas.beale · 2 April 2014 10:51

I have not been following closely here, but I think the general approach should be that you perform a /design time /validation pass, that reports things like language incompatibility, i.e. never let there be ambiguity close to runtime.

The question then is: how does the validation of this particular thing work? The first thing to note is that the possible slot fillers of a given slot in an archetype are only those that are found in the current /working set /of archetypes, not some theoretical maximum set (e.g. all of CKM or all Spanish MOH etc). So, within a chosen working set, validation on language compatibilty can probably only occur at the point of operational template (OPT) generation, i.e. where the user specifies which actual languages and terminologies (for terminology bindings) should be used, then a tool could run a relatively simple test to see that all archetypes in the working set do have translations in the chosen language(s).

One could imagine more complex validations to do with figuring out of all slot-filling archetypes have any language in common with slot-defining archetypes, but I don't think this is useful.

I have no check like this in the ADL Workbench yet, so I am interested to know what others think it should be.

We don't really have a proper definition of 'working set' or other possible 'sets' of archetypes, but we probably need them. Getting a common definition means everyone agreeing on a standard workflow for archetype development, and possible ideas like defining a 'deployment set' from a larger 'working set', or maybe a publisher's 'release set'.

- thomas

(attachments)

Topic		Replies	Views
ISO 639 language codes Specifications	5	498	2 September 2024
SimpleTerminologyTest XML file, possible error in the XML file or in the constructor of the class TranslationDetails Reference Implementation: Java (archive)	3	0	21 October 2008
Norwegian languages in openehr terminology Reference Implementation: Java (archive)	7	0	15 November 2011
Where to use "en" and [ISO_639-1::en] Reference Implementation: Java (archive)	2	0	2 March 2007
TERMINOLOGY_ID Implementers (archive)	7	0	9 March 2007
Alignment of languages and translations in templates (was: Invalid language codes in languages codeset) Technical (archive)	4	2	2 April 2014
Bug in terminology.xml Technical (archive)	1	0	4 December 2009
Data Types RM document replaced Technical (archive)	3	1	29 August 2002
question abouit minitermserv Reference Implementation: Java (archive)	2	1	12 April 2008
Bug in openehr_terminology_en.xml Reference Implementation: Java (archive)	1	0	28 November 2009

Invalid language codes in languages codeset

Related topics