Archetypes on CKM encoding problem (UTF-8 with BOM)

Hi all,

I’ve downloaded a couple of archetypes from the CKM to start playing around with ADL on PHP :slight_smile:

I’ve noticed those archetypes contain the BOM (byte order mark), the weird characters at the beginning of the ADL.

Shouldn’t the ADL files be UTF-8 without BOM?

[0] => **ï»****¿**archetype (adl_version=1.4)
[1] => 	openEHR-EHR-OBSERVATION.apgar.v1
[0] => **ï»****¿**archetype (adl_version=1.4)
[1] => 	openEHR-EHR-OBSERVATION.body_mass_index.v1
[0] => **ï»****¿**archetype (adl_version=1.4)
[1] => 	openEHR-EHR-OBSERVATION.body_weight.v1

For testing purposes I removed the BOM using Notepad++ Encoding change to UTF-8 Without BOM.

Hi Pablo,

There was a lot of discussion about this five years ago. Back then, it was decided that ADL files would always be UTF-8.

The problem is that text editors don't know that we always encode ADL as UTF-8. Without the BOM, there is no way for a text editor to know that it is not in some other encoding. Different text editors behave differently; some of them infer that it's UTF-8, presumably by analysing the byte sequences, and they get it right most of the time; but other text editors assume that it's Latin-1, so they don't handle the ADL correctly.

I forget what the final decision was.

Peter

Also Microsoft text editors (at least) put BOM in UTF-8 files they write, when they really shouldn't.

We decided that tools should tolerate but not require BOM in ADL files.

- thomas

@Sebastian, @Thomas - is this something that can be picked up by the
validators? It has certainly also caused us an occasional problem.

Ian

Hi all,

I am pretty sure that the discussion lead to the conclusion we should always have a BOM, at least hen the archetype is served by CKM.
I guess ~5 years ago, UTF8 was still less accepted and there were problems when you didn’t have the BOM…e.g. people editing the file in a text editor and the non-ASCII characters in non-English languages broke, sometimes without the user noticing and uploading to CKM and creating a character mess.
These problems may be less relevant now, I don’t know, but the conclusion then was that we should always provide the BOM.
While it is obviously not marking a byte order in UTF-8, the BOM is not illegal in UTF8, and provides the information that it is UTF without first having to look through the file or just assume that it is UTF8.
For CKM, it wouldn’t matter much as long as we can safely assume that an uploaded archetype is ALWAYS in UTF8.

Cheers
Sebastian

Hi Ian,

I believe we are actually adding the BOM if it is not there at the moment.
When it started to appear in archetypes, presumably added by the Archetype Editor, we actually changed the Java parser to be able to deal with the initial BOM.

There are problems either way, the question is, what is causing less problems.

It is not illegal to have, and text editors can get it wrong if it is not present.
Removing it if it is causing a problem seems to be less problematic to me than messing up characters without realising - e.g. in German there are only some of these problematic characters (ä,ü,ä,ß) and you may not realise the problem immediately.

Cheers
Sebastian

Hi Sebastian, Ian & Thomas,

Thanks for all the info. I haven’t considered what Sebastian said and it makes a lot of sense: BOM is needed to deal with languages that doesn’t use common characters like Russian or Arabic because editors can guess wrong encoding if they find “weird” characters.

For now on I’ll assume that all archetypes downloaded from the CKM have the BOM.

Thanks again!