Archetypes on CKM encoding problem (UTF-8 with BOM)

pablo · 24 May 2013 01:13

Hi all,

I’ve downloaded a couple of archetypes from the CKM to start playing around with ADL on PHP

I’ve noticed those archetypes contain the BOM (byte order mark), the weird characters at the beginning of the ADL.

Shouldn’t the ADL files be UTF-8 without BOM?

[0] => **ï»****¿**archetype (adl_version=1.4)
[1] => 	openEHR-EHR-OBSERVATION.apgar.v1

[0] => **ï»****¿**archetype (adl_version=1.4)
[1] => 	openEHR-EHR-OBSERVATION.body_mass_index.v1

[0] => **ï»****¿**archetype (adl_version=1.4)
[1] => 	openEHR-EHR-OBSERVATION.body_weight.v1

For testing purposes I removed the BOM using Notepad++ Encoding change to UTF-8 Without BOM.

Peter_Gummer1 · 24 May 2013 02:03

Hi Pablo,

There was a lot of discussion about this five years ago. Back then, it was decided that ADL files would always be UTF-8.

The problem is that text editors don't know that we always encode ADL as UTF-8. Without the BOM, there is no way for a text editor to know that it is not in some other encoding. Different text editors behave differently; some of them infer that it's UTF-8, presumably by analysing the byte sequences, and they get it right most of the time; but other text editors assume that it's Latin-1, so they don't handle the ADL correctly.

I forget what the final decision was.

Peter

thomas.beale · 24 May 2013 06:10

Also Microsoft text editors (at least) put BOM in UTF-8 files they write, when they really shouldn't.

We decided that tools should tolerate but not require BOM in ADL files.

- thomas

ian.mcnicoll · 24 May 2013 06:24

@Sebastian, @Thomas - is this something that can be picked up by the
validators? It has certainly also caused us an occasional problem.

Ian

system · 24 May 2013 06:47

Hi all,

I am pretty sure that the discussion lead to the conclusion we should always have a BOM, at least hen the archetype is served by CKM.
I guess ~5 years ago, UTF8 was still less accepted and there were problems when you didn’t have the BOM…e.g. people editing the file in a text editor and the non-ASCII characters in non-English languages broke, sometimes without the user noticing and uploading to CKM and creating a character mess.
These problems may be less relevant now, I don’t know, but the conclusion then was that we should always provide the BOM.
While it is obviously not marking a byte order in UTF-8, the BOM is not illegal in UTF8, and provides the information that it is UTF without first having to look through the file or just assume that it is UTF8.
For CKM, it wouldn’t matter much as long as we can safely assume that an uploaded archetype is ALWAYS in UTF8.

Cheers
Sebastian

system · 24 May 2013 06:58

Hi Ian,

I believe we are actually adding the BOM if it is not there at the moment.
When it started to appear in archetypes, presumably added by the Archetype Editor, we actually changed the Java parser to be able to deal with the initial BOM.

There are problems either way, the question is, what is causing less problems.

It is not illegal to have, and text editors can get it wrong if it is not present.
Removing it if it is causing a problem seems to be less problematic to me than messing up characters without realising - e.g. in German there are only some of these problematic characters (ä,ü,ä,ß) and you may not realise the problem immediately.

Cheers
Sebastian

pablo · 3 June 2013 21:32

Hi Sebastian, Ian & Thomas,

Thanks for all the info. I haven’t considered what Sebastian said and it makes a lot of sense: BOM is needed to deal with languages that doesn’t use common characters like Russian or Arabic because editors can guess wrong encoding if they find “weird” characters.

For now on I’ll assume that all archetypes downloaded from the CKM have the BOM.

Thanks again!

Topic		Replies	Views
Byte Order Marks Technical (archive)	5	1	3 November 2008
Parsing of UTF-8 archetypes Reference Implementation: Java (archive)	5	0	24 October 2007
Possible unknown issue with Archetype Editor Technical (archive)	7	0	14 September 2012
ADL Workbench exceptions when opening the audiogram archetype Technical (archive)	6	0	18 June 2016
Issue (probably known) with ADL Workbench Technical (archive)	15	0	3 October 2012
testParsingWithoutUTF8Encoding Error Reference Implementation: Java (archive)	10	0	12 February 2010
Archetype original language and translation policies Clinical	10	921	22 June 2021
Missing lab test archetypes from CKM Clinical (archive)	2	0	4 January 2017
Archetype IDs starting with a numeric? Specifications	11	765	3 March 2021
Archetype Editors Implementers (archive)	4	0	6 April 2007

Archetypes on CKM encoding problem (UTF-8 with BOM)

Related topics