ISO 639 language codes

siljelb · 1 October 2021 13:15

Currently we use ISO 639-1 codes to specify languages in archetypes. This works fine for most major languages. We may at some point in the future need to be able to tell the difference between several different Sámi languages of which one exists in ISO-639-1 (se/sme), and two only in ISO 639-2 and -3 (sma, smj).

Is it valid to use ISO-639-2 or -3 in archetypes today? If not, are there any plans to extend the specs to include them?

yampeku · 3 October 2021 11:34

Maybe this is related to openEHR JIRA

(thinking a little more about it maybe not, as I assume they share country)
I think your use case would be solved if we update descs, as I’m assuming no tool does a hard check over the lenght or existence of these languages

sebastian.garde · 4 October 2021 07:14

I agree with the need to specify more languages in archetypes as raised by Silje.

Howwever, from my point of view this involves more than just updating the descriptions in the specs:
I am pretty sure this is checked in one way or another in various tooling, either by using the openEHR external terminology file (as the Java Ref Impl does if I remember correctly), or in a more lenient form if you use for example .NET’s CultureInfo in some way, etc.
This is for a) enforcing the specs, b) retrieving and displaying the name of the language, etc.
In CKM, we use the same set of languages for users and archetypes (as specified in the external terminology file) and essentially tag archetypes as well as users with these languages. A different user interface is required, especially for ISO639-3 with its 6900 languages.

The Jira proposal Diego mentions referring to RFC-5646 is useful and would probably really just be a description update because as it is said there, this is what is done anyway (ISO 639-1 language code optionally followed by an ISO 3166-1 alpha-2 country). Would this actually work the same way with ISO 639-3?

Also a careful consideration if we want to allow all the parts of ISO639 or only one - especially in the common cases where a languge has (different) codes in more than one part of ISO639. It seems important that we don’t use various codes for the same language. There seems to be a canonical form in RFC-5646 which may be be usable for this purpose (not sure) - creating and verifying this canonical form seems rather hard though in its general form.

linforest · 2 September 2024 07:46

Current combination of Language Tag for Chinese (zh) and territory code (CN) is not granular enough to accurately represent Simplified Chinese used in mainland China:

  <language>
    <terminology_id>
      <value>ISO_639-1</value>
    </terminology_id>
    <code_string>zh</code_string>
  </language>
  <territory>
    <terminology_id>
      <value>ISO_3166-1</value>
    </terminology_id>
    <code_string>CN</code_string>
  </territory>

Desirable Language Tag for Simplified Chinese used in mainland China:

zh-Hans-CN (Chinese written using the Simplified script as used in
      mainland China)

See IETF 4646 Tags for Identifying Languages for details:
https://www.ietf.org/rfc/rfc4646.txt

So, could the current language tag be changed to the following encoding?

  <language>
    <terminology_id>
      <value>IETF_4646</value>
    </terminology_id>
    <code_string>zh-Hans</code_string>
  </language>
  <territory>
    <terminology_id>
      <value>ISO_3166-1</value>
    </terminology_id>
    <code_string>CN</code_string>
  </territory>

sebastian.iancu · 2 September 2024 20:32

@siljelb, I think that (besides issues mentioned by @sebastian.garde ) there is another aspect: using other sets (terminology-id) in RM instance should be fine from RM perspective, but the question is what would be the impact or the expected behavior when data will be read at later stage; should the client-app have support to validate/understand all these sets and codes? So far, we ‘help’ only providing a copy of ISO 639-1 in openEHR terminology specifications, other sets are not yet supported, neither we had plans (as far as I know).

But other then all these, I think having other more granular languages should technically be ok.

linforest · 2 September 2024 20:50

Thanks, @sebastian.iancu.