Invalid language codes in languages codeset

Hi,

I noticed there are invalid ISO 639-1 language codes used in

https://github.com/openEHR/terminology/blob/master/openEHR_RM/RM/Release-1.0.2/external_terminologies.xml#L263

For instance, "ar-sa", "en-us" or "nl-be" are all invalid ISO 639-1
codes. Only codes consisting of two letters are allowed according to ISO
639-1.

Is this on purpose?

If it is on purpose, then what standard is used to define the language
code? And shouldn't the documentation and external_id of the codeset
reflect this standard?

Regards,

Ralph van Etten
MEDvision360

BTW: "ar-sa", "en-us" and "nl-be" all look like RFC 5646 [1] codes but
they are not. RFC 5646 codes are defined as ISO 639-1 code followed by
an ISO 3166-1 code and ISO 3166-1 code are all uppercase letters. So the
correct language code according to RFC 5646 would be "ar-SA", "en-US"
and "nl-BE".

[1]: http://tools.ietf.org/html/rfc5646

Ralph,

well spotted. I suggest in the short term we do the following:

create a new code_set called something like languages_regional (that's just an openEHR id, so it's not really important), so we end up with two code sets:

<codeset issuer="ISO" openehr_id="languages" external_id="ISO_639-1">

<codeset issuer="ISO" openehr_id="languages_regional" external_id="RFC_5646">

Then we split out the codes like ar-SA etc into the second group, so that we have two correct code-sets. Both code sets then need the missing codes to be added in. I'm not that clear what tools are using this file, but it will clearly be better if it is made regular rather than its current form of following no standard properly.

thoughts?

- thomas

Hi Thomas,

well spotted. I suggest in the short term we do the following:

create a new code_set called something like languages_regional (that's
just an openEHR id, so it's not really important),

Does this imply that if the specification currently says it should
contain a 'languages' it can also contain a 'languages_regional' ?

This would be the preferred solution for us but I am not sure if this is
going to work for others.
I can image some people have implemented the 'languages' according to
spec by only allowing ISO 639-1 codes. If the spec changes by also
allowing RFC 5646 language codes it might break interoperability.

so we end up with two
code sets:

        <codeset issuer="ISO" openehr_id="langu ages"
external_id="ISO_639-1">

        <codeset issuer="ISO" openehr_id="langu ages_regional"
external_id="RFC_5646">

The issuer of an RFC would be IETF.

Anyway, I think it is important to use RFC 5646 and include the region
when specifying a language. After all, there are words with a completely
different meaning depending on the region you live in.

Then we split out the codes like ar-SA etc into the second group, so
that we have two correct code-sets. Both code sets then need the missing
codes to be added in. I'm not that clear what tools are using this file,

I know it is used by the mini termservice of the Java reference
implementation:

https://github.com/openEHR/java-libs/blob/master/mini-termserv/src/main/resources/external_terminologies_en.xml#L263

We use the mini termservice in MEDrecord while processing and validating
openEHR data.

Regards.

Ralph van Etten
MEDvision360

Hi,

I looked into it some more and it seems there is some inconsistency
between various classes regarding language and territory.

For instance, a COMPOSITION has a language and territory field. So
storing a language like 'en-UK' is already possible in a COMPOSITION.

However an ENTRY only has a language field but not a territory field so
it is not possible to store a language like 'en-UK'.

Is this deliberate? Or has the territory field in COMPOSITION a
different purpose?

To be consistent it would be better if everything uses either RFC 5646
encoded languages or has separate fields for language and territory.

Regards,

Ralph van Etten
MEDvision360

Hi Thomas,

It looks strange that this was not found before by others (and if they did how did they solve this). Would it be wise to just think of some -short cut- solution, as in fact I think the standard (and ref implementation) should be rewritten.

What do you think?

Regards, Jan-Marc

The original idea (now 12 years old!) was to use just the ISO-639 code in both COMPOSITION and ENTRY. The COMPOSITION also had territory on the basis that information being committed to a system always happens in some real place, i.e. a country / state.

I would think that a more modern idea of this would be that we could treat those language fields as RFC 5646 coded fields instead, since it allows more specific languages where the territory really affects things e.g. pt-BR and pt-PT can be quite different for many words. If we started doing that right now, it could break some software, but we can easily find out from vendors and implementers if their system would break or not. We should do this and if all implementers are ok, we upgrade the spec to say that these two fields are RFC-5646 compliant. Now, one thing we need to know is if “en” is legal in 5646. If it is, it means that 5646 is a superset of 639, and even just single language codes can be allowed.

I don’t have time to check all the details right now but if someone could check on this, then I suggest you post on the implementers list, and ask the question about what the impact of upgrading the language field in both classes is on everyone’s implementation. This would be a good thing to fix in the current specs. Could you also please raise this as an issue on the spec issue tracker.

seem reasonable?

  • thomas

Hi Ralph, archetypes have also those invalid codes. What I did in Open EHRGen was creating a mapping from those codes to the right ones, so internally I have ISO valid codes but EHRGen consumes ADL with invalid ones.

It would be nice to fix all the ADLs and terminology to simplify development and this kind of horrible hacks.

You are right. It is also in the ADL files.

For example the blood pressure archetype uses:

   language = <[ISO_639-1::zh-cn]>

But you translate them to valid ISO codes internally? Does that mean
"pt-PT" and "pt-BR" are both mapped to the same language ("pt") ?

Regards,

Ralph van Etten
MEDvision360

Hi Thomas,

I would think that a more modern idea of this would be that we could
treat those language fields as RFC 5646 coded fields instead, since it
allows more specific languages where the territory really affects things
e.g. pt-BR and pt-PT can be quite different for many words. If we
started doing that right now, it could break some software, but we can
easily find out from vendors and implementers if their system would
break or not.

In various places they are already using the <language>-<country>
format. But it is a good idea to check.

However, since RFC-5646 allows many more formats besides
<language>-<country> it might lead to some problems for some of the
implementations if the full RFC-5646 is supported. Maybe it would be
better to say at least the <language> and <language>-<country> formats
of RFC-5646 must be supported by implementations and all other formats
are optional?

We should do this and if all implementers are ok, we
upgrade the spec to say that these two fields are RFC-5646 compliant.

It is not just those two fields, there are may more fields using
language codes, including the ADL files.

Now, one thing we need to know is if "en" is legal in 5646. If it is, it
means that 5646 is a superset of 639, and even just single language
codes can be allowed.

Yes, a single language code is also allowed in RFC 5646

I don't have time to check all the details right now but if someone
could check on this, then I suggest you post on the implementers list,
and ask the question about what the impact of upgrading the language
field in both classes is on everyone's implementation. This would be a
good thing to fix in the current specs. Could you also please raise this
as an issue on the spec issue tracker
<http://www.openehr.org/issues/browse/SPECPR&gt;\.

I created an issue: http://www.openehr.org/issues/browse/SPECPR-95

Regards,

Ralph van Etten
MEDvision360

It is unbelievable, how can ISO publish a language-code system in which it is impossible to distinguish Portuguese and Brazilian-Portuguese? Where the Brazilians sleeping? Didn't they protest?

I don't know much about Portuguese, so I cannot indicate how bad this is.

But I know about French, they also speak a kind of French in Belgium.
It is almost on bicycle-distance from here that they say "septante, huitante, nonante".

If you cycle an hour further, you reach the French border, and suddenly your blood-pressure is not anymore septante, septante-et-un, septante-et-deux, huitante or nonante, nonante-et-deux,
but soixante-dix, soixante-onze, soixante-douze, quatre-vingts or quatre-vingts-dix, quatre-vingts-douze, which is maybe low, after an extra hour through the hills on bike.

The thing is, many French people don't know that. This is because, when the Belgians go to France, they say it in a French way, and French people, why should they ever leave their country?

Incroyable, ISO did not notice that. Not only the Brazilians were sleeping, but the Belgians too.

The OpenEHR community discovered it, they also created a solution for that, very clever, it discovered a shortcoming in an ISO-standard which the whole world did not discover when ISO made it, and they had the courage to repair it.

But how was it possible to call that solution ISO 639-1 and that that mistake survived for so many years? That is another mystery on this matter.

I think, someone makes the mistake and the rest of us are having blind faith.

A truly educational experience, une expérience vraiment éducatif

Very clever of Ralph to discover that. Always, somewhere, somebody is awake.

Best regards
Bert Verhees

Hi Ralph, in EHRGen we need the archetype language to be the language selected by the user, because we use archetypes to generate UI, so all labels that appear in the UI are taken from archetypes. So if the user configures “pt” as the language, the terminology resolver checks for that to be defined on the archetype, if this doesn’t exists, then it checks if it is some locale code with the language part equals to “pt”, if it finds “pt-PT” first, then those terms are used.

Also, if the configured locale is “pt-PT”, EHRGen looks for “pt-pt” in the archetype, if that is not found, then it tries only “pt”.

So we check several combinations, trying to find the best match.

I'm travelling at the moment. When I am back home I'll try to provide an analysis of what the specs say, and what they probably should say.

In the meantime, if implementers here can have a look and state what your preferred solution for the future is, taking into account there is a fair bit of data already with the existing ISO-639 codes.

I think it's only archetypes where it is the mixture of RFC 5646 and 639 codes, and in the ADL workbench, I added some code a long time ago to gracefully deal with either.

- thomas

Hi Thomas,

IMO Java Locale doc gives a nice solution for language, country, variant, etc., mentioning each standard used for each part:

http://docs.oracle.com/javase/7/docs/api/java/util/Locale.html

I think variants whould be useful in OPTs to localize composition definition for a specific region or even one healthcare center.

Also see: http://tools.ietf.org/search/bcp47

op 22-01-14 13:25, pablo pazos schreef:

I think it is the best solution, instead of providing a list to specify countries/languages, it is better to provide a way to construct tags.
For the case you find a small community somewhere which has a different leading language, you don't get stuck on a list which hadn't thought about that.

I remember there are some old isolated German speaking communities in the USA.

Bert

op 22-01-14 13:25, pablo pazos schreef:

At the ISO 639 web we can read the following (http://www.loc.gov/standards/iso639-2/faq.html):

How does one indicate the language variation spoken in a particular country?
The ISO 639 standards and RFC 4646 allow for combining the language code with a country code from ISO 3166 to denote the area in which a term, phrase, or language is used. For instance, using RFC 4646, English as spoken in the United States may be indicated with the following: en-US

So it seems the standard allows that kind of combination. Moreover, I have found another interesting answer:

Are separate language codes defined for dialects of languages?
A dialect of a language is usually represented by the same language code as that used for the language. If the language is assigned to a collective language code, the dialect is assigned to the same collective language code. Generally, dialects are not given different codes, but determining the difference between dialects and languages will be decided on a case-by-case basis. In the future ISO 639-6, currently under development, may be used to identify language variants and dialects.

I have searched for it and while ISO 639-1 has around 150 languages, ISO 639-6 grows to around 25000! If you want to be precise, things get complex.

I think this works exactly in the same way in openEHR and ISO EN 13606. The
territory is separated in a specific field because it represents the
country under whose laws the Composition is created. This is important
because it affects to things like privacy and access policies.

This cannot be merged to language or language+dialect. There will be
clinicians working in Spain from Latin America or in UK from US or
Australia.

Good to know.

It would be nice if this kind of information can be included in the
documentation.
Is that possible?

Regards,

Ralph van Etten
MEDvision360

Just today we had another interesting discussion on a related topic
about languages, translations, and slot solving.
The problem comes when you have an archetype whose original language
is different from the one that you are solving the slot with. There
are several alternatives, but seems that there is no 'perfect' one.

There is always the possibility of taking the original language of the
solved slot archetype and just add it to the original archetype as a
translation, marking the strings in the other languages in some way.

This is related to the languages codes as we could assume that a slot
with an 'en-gb' can be safely placed in a 'en' archetype and reuse all
the texts and descriptions. The problem comes the other way around
(can we assume that a 'en' slot can be safely placed in a 'en-gb'
archetype?).

Even if you have the same language in both archetypes, we have to
consider if a translation from a slot has the same validity to be
included in the original language descriptions of an given archetype
(In theory, all translations should be made from the original
language, and if the original language was another one, can we assure
that the meaning is the same as the original?)

Anyone of the list has been dealing with this problem? Which solutions
have you adopted for your tools/systems?

I have not been following closely here, but I think the general approach should be that you perform a /design time /validation pass, that reports things like language incompatibility, i.e. never let there be ambiguity close to runtime.

The question then is: how does the validation of this particular thing work? The first thing to note is that the possible slot fillers of a given slot in an archetype are only those that are found in the current /working set /of archetypes, not some theoretical maximum set (e.g. all of CKM or all Spanish MOH etc). So, within a chosen working set, validation on language compatibilty can probably only occur at the point of operational template (OPT) generation, i.e. where the user specifies which actual languages and terminologies (for terminology bindings) should be used, then a tool could run a relatively simple test to see that all archetypes in the working set do have translations in the chosen language(s).

One could imagine more complex validations to do with figuring out of all slot-filling archetypes have any language in common with slot-defining archetypes, but I don't think this is useful.

I have no check like this in the ADL Workbench yet, so I am interested to know what others think it should be.

We don't really have a proper definition of 'working set' or other possible 'sets' of archetypes, but we probably need them. Getting a common definition means everyone agreeing on a standard workflow for archetype development, and possible ideas like defining a 'deployment set' from a larger 'working set', or maybe a publisher's 'release set'.

- thomas

(attachments)

oceanfullsmall.jpg
btnliprofileblue80x15.png