Byte Order Marks

Adam_Flinton · 25 October 2008 09:27

Byte Order Marks

What is the default CharSet for OpenEHR ADL?

ASCII? UTF-8?

I ask because ADL itself does not anywhere declare a character set & we
have had a number of adl files which have failed (either to be opened or
to be transformed into XML) & in each occasion the reason has been the
presence of a byte order mark (hex bytes EF BB BF) e.g.

Exception in thread "main" se.acode.openehr.parser.TokenMgrError:
Lexical error at line 1, column 1. Encountered: "\u00ef" (239), after : ""
at
se.acode.openehr.parser.ADLParserTokenManager.getNextToken(ADLParserTokenManager.java:27554)

Equally if a text editor opens an ADL, assumes UTF-8 & puts on a BOM
then the Archetype editor dies in addition to the Java ADL parser & the
Windows ADL2XML converter.

Our standard for XML is UTF-8 & I am wondering that if the std is in ADL
ASCII then how does/will adl support extended charater sets?

e.g. in one of our adl there is some Dutch including

"Een patiënt in rolstoel moet zonder hulp met hoeken en deuren kunnen
omgaan" where "patiënt" is mis-rendered (though given mail clients are
pretty good this will probably be correctly rendered)

With other tooling we always had a problem wrt people pasting in content
from other tools (mostly Word) which had a non-UTF-8 codeset.
Equally we have a fair amount of existing content which might also be
pasted into the ADL files via the Archetype editor.
How does the Archetype editor deal with non-ASCII chars if they are
pasted in?

Is there a possible loss of fidelity when converting between ADL (in
ASCII) and XML in UTF8?

So...

A) In general what are the stds etc for Char Sets & ADL?
B) The various parsers etc should not blow up upon running into a std
Byte Order Mark.
C) Would setting the ADL to a non BOM UTF (e.g. UTF-16LE) be OK?

Right now I can simply clean up the adl by running each one through a
CharsetEncoder/Decoder however....I would prefer this to be fixed at source.

Adam

thomas.beale · 25 October 2008 11:05

Adam Flinton wrote:

Byte Order Marks

What is the default CharSet for OpenEHR ADL?

ASCII? UTF-8?

Hi Adam,

UTF-8 is the preferred. See section 3 of
http://www.openehr.org/releases/1.0.1/architecture/am/adl.pdf

I ask because ADL itself does not anywhere declare a character set & we
have had a number of adl files which have failed (either to be opened or
to be transformed into XML) & in each occasion the reason has been the
presence of a byte order mark (hex bytes EF BB BF) e.g.

Exception in thread "main" se.acode.openehr.parser.TokenMgrError:
Lexical error at line 1, column 1. Encountered: "\u00ef" (239), after : ""
at
se.acode.openehr.parser.ADLParserTokenManager.getNextToken(ADLParserTokenManager.java:27554)

Equally if a text editor opens an ADL, assumes UTF-8 & puts on a BOM
then the Archetype editor dies in addition to the Java ADL parser & the
Windows ADL2XML converter.

BOMs should not be used in UTF-8 files, they are only for UTF-16 files.
Lots of broken programs unfortunately seem to add a BOM for UTF-8, but
it breaks things on unix, or unix-like environments. The ADL parser in
the ADL Workbench detects BOMs and ignores them. You can try the
unicode/family_history archetype in the dev/test area on SVN in the ADL
Workbench - it is in farsi, to see unicode working. It works in all
languages we have tested.

If you really want proof ( you can see how the ADL parser (used
inside the Archetype Editor and ADL workbench) does its character
matching - see
http://www.openehr.org/svn/ref_impl_eiffel/BRANCHES/specialisation/libraries/common_libs/src/structures/syntax/dadl/parser/dadl_scanner.l
and
http://www.openehr.org/svn/ref_impl_eiffel/BRANCHES/specialisation/components/adl_parser/src/syntax/cadl/parser/cadl_scanner.l

these are the dADL and cADL lexers respectively. They parse all strings
in a UTF-8 aware way. In addition, you can see here where the BOMs are
removed from ADL files, if found -
http://www.openehr.org/svn/ref_impl_eiffel/BRANCHES/specialisation/libraries/common_libs/src/file_system/file_context.e

Our standard for XML is UTF-8 & I am wondering that if the std is in ADL
ASCII then how does/will adl support extended charater sets?

e.g. in one of our adl there is some Dutch including

"Een patiënt in rolstoel moet zonder hulp met hoeken en deuren kunnen
omgaan" where "patiënt" is mis-rendered (though given mail clients are
pretty good this will probably be correctly rendered)

See above, there is no problem in ADL files - they support UTF-8 and
have been tested extensively. If there are problems in other tools, we
need to look carefully at the actions that lead to the problems. I can't
answer off the top of my head whether there are cut and paste problems
in the Archetype Editor, but I don't believe there should be any
problems in its ADL serialisation. Perhaps there are inconsistencies in
the XML serialiser - it is a separate piece of code. We should certainly
get it fixed at source if there are any problems, but I am sure they
will be in the code, not the archetype files themselves.

- thomas beale

Peter_Gummer1 · 25 October 2008 22:55

Adam Flinton wrote:

Equally if a text editor opens an ADL, assumes UTF-8 & puts on a BOM
then the Archetype editor dies ...

I assume you mean the Java Archetype Editor, Adam. The Ocean Archetype
Editor accepts ADL files with or without the BOM.

There are pros and cons whether tools should put a BOM at the start of
ADL files.

* As Thomas pointed out, tools that are not Unicode-aware may blow up if
the BOM is present.

* On the other hand, if you omit the BOM then Unicode-aware tools have a
big problem when they open a file. What encoding should they assume?
Some tools like Windows Notepad seem to be very clever at figuring it
out, but others that I have tried in the past (Visual Studio 2005, Vim
6.4 and Mac TextEdit) misinterpret BOM-less UTF-8 files as Latin-1.

- Peter

system · 27 October 2008 07:03

Peter Gummer wrote:

Adam Flinton wrote:


Equally if a text editor opens an ADL, assumes UTF-8 & puts on a BOM
then the Archetype editor dies ...

I assume you mean the Java Archetype Editor, Adam. The Ocean Archetype
Editor accepts ADL files with or without the BOM.

I am pretty sure that recent versions of the Java Parser won't mind the
BOM being present.
The Java Archetype Editor may still use an old version of the parser?

There are pros and cons whether tools should put a BOM at the start of
ADL files.

* As Thomas pointed out, tools that are not Unicode-aware may blow up if
the BOM is present.

* On the other hand, if you omit the BOM then Unicode-aware tools have a
big problem when they open a file. What encoding should they assume?
Some tools like Windows Notepad seem to be very clever at figuring it
out, but others that I have tried in the past (Visual Studio 2005, Vim
6.4 and Mac TextEdit) misinterpret BOM-less UTF-8 files as Latin-1.

That's why my assumption till now was that the BOM is optional [but not
forbiddedn] for UTF-8 - and also is of some additional value to
differentiate between UTF-8 and ISO-8859-1 (as long as you assume that a
text doesn't start with ï»¿ - the BOM in ISO, one can safely
differentiate between the two without problems)

Sebastian

Adam_Flinton · 3 November 2008 13:25

See above, there is no problem in ADL files - they support UTF-8 and
have been tested extensively. If there are problems in other tools, we
need to look carefully at the actions that lead to the problems. I can't
answer off the top of my head whether there are cut and paste problems
in the Archetype Editor, but I don't believe there should be any
problems in its ADL serialisation. Perhaps there are inconsistencies in
the XML serialiser - it is a separate piece of code. We should certainly
get it fixed at source if there are any problems, but I am sure they
will be in the code, not the archetype files themselves.

So just to be certain, if a user has some existing textual content (e.g.
a description of some construct) in a non-UTF-8 source (e.g. a Word
document or HTML page) and pastes into a text area in the Archetype editor:
A) All (mappable) non-UTF-8 chars would be transformed into UTF-8 chars?
B) What about unmappable chars etc ?

e.g. in our WYSIWYG XHTML editor for HL7 Mif we had to put in a decoder e.g.

String charSet = "UTF-8";

<cut>

Charset charset = Charset.forName(charSet);
CharsetDecoder dec = charset.newDecoder();

dec.onMalformedInput(CodingErrorAction.IGNORE);
dec.onUnmappableCharacter(CodingErrorAction.IGNORE);

i.e. it just drops the chars which are either malformed or unmappable.

What happens wrt the Archetype editor?

Adam

Peter_Gummer1 · 3 November 2008 21:29

Adam Flinton wrote:

So just to be certain, if a user has some existing textual content (e.g.
a description of some construct) in a non-UTF-8 source (e.g. a Word
document or HTML page) and pastes into a text area in the Archetype editor:
A) All (mappable) non-UTF-8 chars would be transformed into UTF-8 chars?
B) What about unmappable chars etc ?

...

What happens wrt the Archetype editor?

Archetype Editor is built with standard .NET controls, which look after
those details. It's all just standard operating system stuff. A lot of
people from various cultures have been working with it, so this has been
well tested.

The only bug reported in 18 months has been in a DataGrid that we're
using, where entering a special character by keying its code on the
numeric keypad whilst holding down the Alt key can cause the wrong
character to appear in the wrong row.

- Peter

Topic		Replies	Views
questions about string literals Technical (archive)	6	11	8 October 2006
Archetypes on CKM encoding problem (UTF-8 with BOM) Implementers (archive)	6	17	3 June 2013
character sets and languages in openEHR Technical (archive)	19	23	6 April 2004
Unhandled exception parsing ADL Reference Implementation: Java (archive)	4	10	3 November 2008
Parsing of UTF-8 archetypes Reference Implementation: Java (archive)	5	13	24 October 2007
Parsing archetype xml with JAXB Technical (archive)	11	16	9 June 2008
[[JIRA] Created: (SPEC-302) Translations embedded in the ADL are not efficient and should instead use 'gettext' catalogs.] Technical (archive)	19	20	4 May 2009
ADL / archetype wish list Reference Implementation: Java (archive)	8	14	21 March 2006
Data-entry for OpenEhr Technical (archive)	16	22	9 May 2008
Microsoft/NHS common health interface and openEHR datatypes Clinical (archive)	24	39	14 February 2008

Byte Order Marks

Related topics