[[JIRA] Created: (SPEC-302) Translations embedded in the ADL are not efficient and should instead use 'gettext' catalogs.]

Hi All,

I raised the below CR and I wanted to open up discussion on this issue.
Actually I brought it up a few years ago but I don't have a record of
where/when; now.

I know that this have a major impact on implementers but I think the
current way we handle translations in ADL is a monster that is only
going to get worse.

Thoughts?

--Tim

Hi Tim, an important issue for me too!

The archetypes I am working with (Endoscopy ones) are already quite big and there are 11 (yes eleven) languages! I completely translated one of them (MST Colon) and it is 6359 lines long. I also had quite an experience with localisation of software in past where strategies similar to gettext was quite practical and successful.

BUT: I think all aspects of the “knowledge” - including the translations should better stay within the Archetypes. Because:

  1. If the purpose if for ‘resuse’, then having multiple files around can be quite problematic.
  2. same with specialisations - have to maintain a complex versioning process
  3. For non-English primary language archetypes, this will make it impossible for others (well 99% of the rest of the World I guess) to understand without using tools (i.e. a text editor).

However, with the current scheme we must also decide how to modify translations. I don’t know if this has been discussed before but translation changes are frequently needed and as I can see this does not necessitate specialisation. Perhaps with versioning of archetypes - but then this might create many versions out there and can make things complicated. In this case your proposal seems more appropriate.

One straightforward solution might be to adopt a propriety file type for representing archetypes - i.e. a multipart file. Still human readable but can accomodate more than a single file and perhaps support other file types as well…However I fear this is how usually things get very complicated and result in the increase of the ciriticism that openEHR is too complex and technical!

So as result, I am inclined towards keeping it all in archetypes - unless we find a sound and sensible approach :wink:

-koray

Tim Cook wrote:

I agree Tim that this is an issue. One of the possible ways forward is allowing a user to select which translations they want in the archetype before downloading it from the repository. This would be something that would be relatively easy to do in CKM - this means you only get the languages that you need.

regards Hugh

Hi Koray,

I will let others respond about translations etc, but I did want to pick up on your point about multi-part file. This was an option recently consider when we were looking at a mechanism to record an MD5 Hash of the archetype. There was a desire to provide this hash external to the ADL itself whilst making it available to the archetype consumer locally so it was not necessary to query some external notary service to do the integrity check. Using a multi-part file would allow the usual PGP message and signature parts to be used. It was thought to be quite a disruptive change, but if there are other reasons to do this…

Heath

Hi Hugh,

I agree Tim that this is an issue. One of the possible ways forward
is allowing a user to select which translations they want in the
archetype before downloading it from the repository. This would be
something that would be relatively easy to do in CKM - this means you
only get the languages that you need.

This would be a good option at this point. As it was raised with me
from a potential user about get several languages that they would never
need and the potential impact on applications having to manage all
those.

But my real argument is that we should adopt industry standards in this
area as we have in other areas. It is not only efficient but it makes
people feel more comfortable. If the CKM has (can have) that export
capability. I wonder how difficult it would be for you (???) to add
collaborative translation ability like Launchpad has?

See: https://translations.launchpad.net/oship/trunk/+pots/oship
You can upload new templates as need and download .po (or already
compiled .mo) files that use all of the standard tools and application
framework mechanisms.

As my CR is a major change I would like to ask the Jira admins to open
up a target ADL version 2.0. There is already an ADL 2.0 draft
specification. I envision that this is a long term project. Especially
since ADL is now an ISO spec. I would like to target this for 2.0 (not
even on the roadmap yet.

Also, based on the openEHR mantra of
"implementation,implementation,implementation"... volunteer the OSHIP
project to experiment with not only using this approach but to build a
tool that can extract the languages and create the .po files.

Thoughts?

--Tim

The MD5 for an archetype is now available in the CKM. IIRC, this was to
verify the validity of the actual ADL source aas having come from the
CKM.

But, since archetype exports are being generated as .zip files there is
no reason (that I can think of) why this couldn't be applied on the fly
if a user selects only a few languages (as Hugh suggested) or if a
bundle is created that includes the original language plus specific .po
translation files.

As I said in my response to Hugh, this is certainly a long term issue
but I think it should be addressed sooner rather than later.

Cheers,
Tim

It is clearly true that with a number of translations the archetype will grow
bigger, and initially (some years ago) I thought separate files might be
better as well. But I really wonder if it makes any difference in the end -
since, in generating the 'operational' (aka 'flat') form of an archetype that
is for end use, the languages required (which might still be more than one)
can be retained, and the others filtered out. I don't think gettext would deal
with this properly - the idea that an artefact can have more than one language
active.

The other good thing about the current format (which will eventually migrate
to pure dADL+cADL) is that it is a direct object serialisation, and can be
deserialised straight into in-memory objects (Hash tables in the case of the
translations).

Anyway, I think that we need to carefully look at the requirements on this
one, before leaping to a solution...

- thomas

Thomas Beale wrote:

It is clearly true that with a number of translations the archetype will grow
bigger, and initially (some years ago) I thought separate files might be
better as well. But I really wonder if it makes any difference in the end -
since, in generating the 'operational' (aka 'flat') form of an archetype that
is for end use, the languages required (which might still be more than one)
can be retained, and the others filtered out. I don't think gettext would deal
with this properly - the idea that an artefact can have more than one language
active.

The other good thing about the current format (which will eventually migrate
to pure dADL+cADL) is that it is a direct object serialisation, and can be
deserialised straight into in-memory objects (Hash tables in the case of the
translations).

Anyway, I think that we need to carefully look at the requirements on this
one, before leaping to a solution...

- thomas

One problem with the current way is that regional translations (e.g.
en-us) are not treated any different than language-only translations
(e.g. en).
Essentially I believe 'en-us' should only carry the changes from 'en' -
falling back to en if a code is not defined in en-us.

This doesn't mean that I am supporting separate files for this reason
(as it should be possible to do this with the current one file
approach), it's just another issue to consider when looking at the
requirements)

Sebastian

It is clearly true that with a number of translations the archetype will grow
bigger, and initially (some years ago) I thought separate files might be
better as well. But I really wonder if it makes any difference in the end -
since, in generating the 'operational' (aka 'flat') form of an archetype that
is for end use, the languages required (which might still be more than one)
can be retained, and the others filtered out. I don't think gettext would deal
with this properly - the idea that an artefact can have more than one language
active.

I can only refer you to the "bazillions" of applications that use
gettext. Browsers and GUI widgets everywhere are designed, expecting
gettext catalogs. Not using gettext means that every implementation has
to develop their own filtering mechanisms; in place of reuse of proven
existing technology. OR; you could choose to develop an openEHR
filtering specification. Then develop browser interfaces and widget
interfaces to match.

The other good thing about the current format (which will eventually migrate
to pure dADL+cADL) is that it is a direct object serialisation, and can be
deserialised straight into in-memory objects (Hash tables in the case of the
translations).

Hmmmm, sorry, I don't get the point here. Seems to me you are saying
that you pull all translations into memory. Instead of letting the
application decide which one it wants.

Anyway, I think that we need to carefully look at the requirements on this
one, before leaping to a solution...

Of course. That is why I suggested targeting the 2.0 version. There is
a good chance that there will be knock on effects (good or bad) to the
RM (AuthoredResource, et.al.) as well.

I'd like to go back to a very basic question I have. What is the use of
having the original language as (a specific) part of the archetype if it
isn't meant to be the validation language? Seems to me that it is "the"
expression of the original author for the construction of the
archetype. Translations are a convenience for everyone else.

--Tim

Sebastian Garde wrote:

... it's just another issue to consider when looking at the
requirements)
  
Another grey area -- that's "grey" (en), translated as "gray" (en-us)
:wink: -- is that sometimes a translator might want two terms to express
one concept in the primary language.

A case I've encountered is that given a list of English personal forms
of address --Mr, Mrs, Miss, Ms, etc. -- a Spanish translator wanted two
translations of "Mr", namely "Señor" and "Don". He actually added an
extra internal code, with its own at-code.

Now this wish to capture nuances that don't exist in the primary
language strikes me as perfectly reasonable, but it's certainly stepping
outside the bounds of translation. How do we handle this?

- Peter

Local specializations (en-us) :wink: of parent archetypes.

--Tim

Tim,
As I mentioned, the requirement was to have a hash that can be referenced at
runtime without the need to reference an online service. For example a
recent version of the Ocean Template Designer includes integrity checking of
archetypes used in a template to ensure that the archetype is the same as
the one used when the template was last saved. If this process needed to
perform these hash lookups using an online service there would be
significant impact on the user experience. The latest version of the Ocean
Archetype Editor provides this hash in the description other_details so that
it can be easily ignored by systems that don't utilise it.

The other point you make about the hash being provided by CKM is also worth
commenting on. A simple hash on the ADL was not considered useful for the
process above. For one, the hash of the archetype would be different for
ADL and XML files. Secondly, any insignificant change (comments,
whitespace, description meta-data) to the ADL file will change the hash even
though the content (Archetype) model has not. For this reason we developed
a canonical archetype serialization algorithm ensuring that the hash would
be constant as long as the content model was the same. Unfortunately, this
algorithm did include the ontology and its translations but this was deemed
to be a change to the content model, hence a new hash value was necessary.

I will contribute this canonical archetype serialization and hashing
specification to the openEHR wiki as soon as I can get Thomas to create me a
page.

Regards

Heath

Koray Atalag wrote:

Hi Tim, an important issue for me too!

The archetypes I am working with (Endoscopy ones) are already quite big and there are 11 (yes eleven) languages! I completely translated one of them (MST Colon) and it is 6359 lines long. I also had quite an experience with localisation of software in past where strategies similar to gettext was quite practical and successful.

BUT: I think all aspects of the “knowledge” - including the translations should better stay within the Archetypes. Because:

  1. If the purpose if for ‘resuse’, then having multiple files around can be quite problematic.
  2. same with specialisations - have to maintain a complex versioning process
  3. For non-English primary language archetypes, this will make it impossible for others (well 99% of the rest of the World I guess) to understand without using tools (i.e. a text editor).

However, with the current scheme we must also decide how to modify translations. I don’t know if this has been discussed before but translation changes are frequently needed and as I can see this does not necessitate specialisation. Perhaps with versioning of archetypes - but then this might create many versions out there and can make things complicated. In this case your proposal seems more appropriate.

this is already handled in the archetype formalism - with ‘revisions’, which are finally starting to be supported by the tools, including CKM.

I am not a priori for or against any particular solution, but I think
we should remember that the source form of archetypes is not the
format for which effciency matters; the operational template form is
the important one.

see your point. Since an author apparently cannot change the status of
their own entry in Jira. I have asked the administrators to
close/delete this one as appropriate.

Thanks for the discussion.

--Tim

Hi Tim

I am keen, as some others have said, that CKM manages this with users being
able to get as many translations as they need when they download the
archetype. The advantage with the approach is that you do not need the
source language to translate using the archetype so you do not need to keep
languages not used locally.

It does rely on CKM but allows full language archetypes for repositories
whenever they are sought.

Sebastian has already got to a point in CKM where languages are stored
separately and languages can be added, reviewed and approved independently.

Cheers, Sam

Tim Cook wrote:


It is clearly true that with a number of translations the archetype will grow
bigger, and initially (some years ago) I thought separate files might be
better as well. But I really wonder if it makes any difference in the end -
since, in generating the 'operational' (aka 'flat') form of an archetype that
is for end use, the languages required (which might still be more than one)
can be retained, and the others filtered out. I don't think gettext would deal
with this properly - the idea that an artefact can have more than one language
active.


I can only refer you to the "bazillions" of applications that use
gettext. Browsers and GUI widgets everywhere are designed, expecting
gettext catalogs. Not using gettext means that every implementation has
to develop their own filtering mechanisms; in place of reuse of proven
existing technology.  OR; you could choose to develop an openEHR
filtering specification.  Then develop browser interfaces and widget
interfaces to match.

but my question was: if we want an archetype to retain 2 languages, e.g. english and spanish, out of the (say) dozen available translations, can gettext be made to do that?


The other good thing about the current format (which will eventually migrate
to pure dADL+cADL) is that it is a direct object serialisation, and can be
deserialised straight into in-memory objects (Hash tables in the case of the
translations).


Hmmmm, sorry, I don't get the point here.  Seems to me you are saying
that you pull all translations into memory.  Instead of letting the
application decide which one it wants.

well that is the default; but depending on what ‘application’ we are talking about, this is quite likely what is wanted - e.g. if it is an archetype design tool that also managed translations. But I take your point - we probably should make it so that dADL can ignore some parts of an input file.


Anyway, I think that we need to carefully look at the requirements on this
one, before leaping to a solution...


Of course.  That is why I suggested targeting the 2.0 version.  There is
a good chance that there will be knock on effects (good or bad) to the
RM (AuthoredResource, et.al.) as well.

I'd like to go back to a very basic question I have.  What is the use of
having the original language as (a specific) part of the archetype if it
isn't meant to be the validation language?  Seems to me that it is "the"
expression of the original author for the  construction of the
archetype.  Translations are a convenience for everyone else.

Not sure I understand the question Tim - do you mean: is the original language used in validation? There are very few things that are linguistically dependent in the validation operation - only where regular expression constraints are used…can’t think of any others off hand. The linguistic elements of the ontology section get used on the UI of course, and in documents, but that is for humans, not computing.

  • thomas

Sebastian Garde wrote:

One problem with the current way is that regional translations (e.g.
en-us) are not treated any different than language-only translations
(e.g. en).
Essentially I believe 'en-us' should only carry the changes from 'en' -
falling back to en if a code is not defined in en-us.

yes, this is definitely the way it should work. Sebastian, I suggest you raise this as an openEHR PR, because I don’t think we have it explicitly on the radar yet. See http://www.openehr.org/issues/browse/SPECPR

  • thomas

We handle it by adding this part of the problem description to the PR Sebastian is raising :wink:

  • thomas

Peter Gummer wrote:

(attachments)

OceanCsmall.png

If we use this approach, then we are doing a kind of filtering operation on an archetype which has all languages, to produce a copy which just has less languages…this will pose some problems for managing such artefacts on file systems perhaps - but it should be manageable with tools.

But I think we need to think about how translation tools would work efficiently - typically a translator will have at least the source and target language available, so that’s two languages; but a good multi-lingual translator might easily cross-reference to other languages as well, e.g. they might work with en, es, pt and de all in the tool at once. Whereas I think Tim’s original requirement was to be able to choose which language to use in the runtime system. The latter requirement is already part of the emerging ADL 1.5 & Template specifications, and supported in some tools. I think the main lack is in the tooling support at the moment -we don’t quite yet have a full open source tooling chain from archetype → operational template.

  • thomas

Sam Heard wrote:

(attachments)

OceanCsmall.png

Created:
http://www.openehr.org/issues/browse/SPECPR-22

Thomas Beale wrote: