I ask because ADL itself does not anywhere declare a character set & we
have had a number of adl files which have failed (either to be opened or
to be transformed into XML) & in each occasion the reason has been the
presence of a byte order mark (hex bytes EF BB BF) e.g.
Exception in thread "main" se.acode.openehr.parser.TokenMgrError:
Lexical error at line 1, column 1. Encountered: "\u00ef" (239), after : ""
at
se.acode.openehr.parser.ADLParserTokenManager.getNextToken(ADLParserTokenManager.java:27554)
Equally if a text editor opens an ADL, assumes UTF-8 & puts on a BOM
then the Archetype editor dies in addition to the Java ADL parser & the
Windows ADL2XML converter.
Our standard for XML is UTF-8 & I am wondering that if the std is in ADL
ASCII then how does/will adl support extended charater sets?
e.g. in one of our adl there is some Dutch including
"Een patiënt in rolstoel moet zonder hulp met hoeken en deuren kunnen
omgaan" where "patiënt" is mis-rendered (though given mail clients are
pretty good this will probably be correctly rendered)
With other tooling we always had a problem wrt people pasting in content
from other tools (mostly Word) which had a non-UTF-8 codeset.
Equally we have a fair amount of existing content which might also be
pasted into the ADL files via the Archetype editor.
How does the Archetype editor deal with non-ASCII chars if they are
pasted in?
Is there a possible loss of fidelity when converting between ADL (in
ASCII) and XML in UTF8?
So...
A) In general what are the stds etc for Char Sets & ADL?
B) The various parsers etc should not blow up upon running into a std
Byte Order Mark.
C) Would setting the ADL to a non BOM UTF (e.g. UTF-16LE) be OK?
Right now I can simply clean up the adl by running each one through a
CharsetEncoder/Decoder however....I would prefer this to be fixed at source.
I ask because ADL itself does not anywhere declare a character set & we
have had a number of adl files which have failed (either to be opened or
to be transformed into XML) & in each occasion the reason has been the
presence of a byte order mark (hex bytes EF BB BF) e.g.
Exception in thread "main" se.acode.openehr.parser.TokenMgrError:
Lexical error at line 1, column 1. Encountered: "\u00ef" (239), after : ""
at
se.acode.openehr.parser.ADLParserTokenManager.getNextToken(ADLParserTokenManager.java:27554)
Equally if a text editor opens an ADL, assumes UTF-8 & puts on a BOM
then the Archetype editor dies in addition to the Java ADL parser & the
Windows ADL2XML converter.
BOMs should not be used in UTF-8 files, they are only for UTF-16 files.
Lots of broken programs unfortunately seem to add a BOM for UTF-8, but
it breaks things on unix, or unix-like environments. The ADL parser in
the ADL Workbench detects BOMs and ignores them. You can try the
unicode/family_history archetype in the dev/test area on SVN in the ADL
Workbench - it is in farsi, to see unicode working. It works in all
languages we have tested.
Our standard for XML is UTF-8 & I am wondering that if the std is in ADL
ASCII then how does/will adl support extended charater sets?
e.g. in one of our adl there is some Dutch including
"Een patiënt in rolstoel moet zonder hulp met hoeken en deuren kunnen
omgaan" where "patiënt" is mis-rendered (though given mail clients are
pretty good this will probably be correctly rendered)
See above, there is no problem in ADL files - they support UTF-8 and
have been tested extensively. If there are problems in other tools, we
need to look carefully at the actions that lead to the problems. I can't
answer off the top of my head whether there are cut and paste problems
in the Archetype Editor, but I don't believe there should be any
problems in its ADL serialisation. Perhaps there are inconsistencies in
the XML serialiser - it is a separate piece of code. We should certainly
get it fixed at source if there are any problems, but I am sure they
will be in the code, not the archetype files themselves.
Equally if a text editor opens an ADL, assumes UTF-8 & puts on a BOM
then the Archetype editor dies ...
I assume you mean the Java Archetype Editor, Adam. The Ocean Archetype
Editor accepts ADL files with or without the BOM.
There are pros and cons whether tools should put a BOM at the start of
ADL files.
* As Thomas pointed out, tools that are not Unicode-aware may blow up if
the BOM is present.
* On the other hand, if you omit the BOM then Unicode-aware tools have a
big problem when they open a file. What encoding should they assume?
Some tools like Windows Notepad seem to be very clever at figuring it
out, but others that I have tried in the past (Visual Studio 2005, Vim
6.4 and Mac TextEdit) misinterpret BOM-less UTF-8 files as Latin-1.
Equally if a text editor opens an ADL, assumes UTF-8 & puts on a BOM
then the Archetype editor dies ...
I assume you mean the Java Archetype Editor, Adam. The Ocean Archetype
Editor accepts ADL files with or without the BOM.
I am pretty sure that recent versions of the Java Parser won't mind the
BOM being present.
The Java Archetype Editor may still use an old version of the parser?
There are pros and cons whether tools should put a BOM at the start of
ADL files.
* As Thomas pointed out, tools that are not Unicode-aware may blow up if
the BOM is present.
* On the other hand, if you omit the BOM then Unicode-aware tools have a
big problem when they open a file. What encoding should they assume?
Some tools like Windows Notepad seem to be very clever at figuring it
out, but others that I have tried in the past (Visual Studio 2005, Vim
6.4 and Mac TextEdit) misinterpret BOM-less UTF-8 files as Latin-1.
That's why my assumption till now was that the BOM is optional [but not
forbiddedn] for UTF-8 - and also is of some additional value to
differentiate between UTF-8 and ISO-8859-1 (as long as you assume that a
text doesn't start with  - the BOM in ISO, one can safely
differentiate between the two without problems)
See above, there is no problem in ADL files - they support UTF-8 and
have been tested extensively. If there are problems in other tools, we
need to look carefully at the actions that lead to the problems. I can't
answer off the top of my head whether there are cut and paste problems
in the Archetype Editor, but I don't believe there should be any
problems in its ADL serialisation. Perhaps there are inconsistencies in
the XML serialiser - it is a separate piece of code. We should certainly
get it fixed at source if there are any problems, but I am sure they
will be in the code, not the archetype files themselves.
So just to be certain, if a user has some existing textual content (e.g.
a description of some construct) in a non-UTF-8 source (e.g. a Word
document or HTML page) and pastes into a text area in the Archetype editor:
A) All (mappable) non-UTF-8 chars would be transformed into UTF-8 chars?
B) What about unmappable chars etc ?
e.g. in our WYSIWYG XHTML editor for HL7 Mif we had to put in a decoder e.g.
String charSet = "UTF-8";
<cut>
Charset charset = Charset.forName(charSet);
CharsetDecoder dec = charset.newDecoder();
So just to be certain, if a user has some existing textual content (e.g.
a description of some construct) in a non-UTF-8 source (e.g. a Word
document or HTML page) and pastes into a text area in the Archetype editor:
A) All (mappable) non-UTF-8 chars would be transformed into UTF-8 chars?
B) What about unmappable chars etc ?
...
What happens wrt the Archetype editor?
Archetype Editor is built with standard .NET controls, which look after
those details. It's all just standard operating system stuff. A lot of
people from various cultures have been working with it, so this has been
well tested.
The only bug reported in 18 months has been in a DataGrid that we're
using, where entering a special character by keying its code on the
numeric keypad whilst holding down the Alt key can cause the wrong
character to appear in the wrong row.