# Byte Order Marks **Category:** [Technical (archive)](https://discourse.openehr.org/c/technical-archive/156) **Created:** 2008-10-25 09:27 UTC **Views:** 2 **Replies:** 5 **URL:** https://discourse.openehr.org/t/byte-order-marks/14842 --- ## Post #1 by @Adam_Flinton Byte Order Marks What is the default CharSet for OpenEHR ADL? ASCII? UTF\-8? I ask because ADL itself does not anywhere declare a character set & we have had a number of adl files which have failed \(either to be opened or to be transformed into XML\) & in each occasion the reason has been the presence of a byte order mark \(hex bytes EF BB BF\) e\.g\. Exception in thread "main" se\.acode\.openehr\.parser\.TokenMgrError: Lexical error at line 1, column 1\. Encountered: "\\u00ef" \(239\), after : ""     at se\.acode\.openehr\.parser\.ADLParserTokenManager\.getNextToken\(ADLParserTokenManager\.java:27554\) Equally if a text editor opens an ADL, assumes UTF\-8 & puts on a BOM then the Archetype editor dies in addition to the Java ADL parser & the Windows ADL2XML converter\. Our standard for XML is UTF\-8 & I am wondering that if the std is in ADL ASCII then how does/will adl support extended charater sets? e\.g\. in one of our adl there is some Dutch including "Een patiënt in rolstoel moet zonder hulp met hoeken en deuren kunnen omgaan" where "patiënt" is mis\-rendered \(though given mail clients are pretty good this will probably be correctly rendered\) With other tooling we always had a problem wrt people pasting in content from other tools \(mostly Word\) which had a non\-UTF\-8 codeset\. Equally we have a fair amount of existing content which might also be pasted into the ADL files via the Archetype editor\. How does the Archetype editor deal with non\-ASCII chars if they are pasted in? Is there a possible loss of fidelity when converting between ADL \(in ASCII\) and XML in UTF8? So\.\.\. A\) In general what are the stds etc for Char Sets & ADL? B\) The various parsers etc should not blow up upon running into a std Byte Order Mark\. C\) Would setting the ADL to a non BOM UTF \(e\.g\. UTF\-16LE\) be OK? Right now I can simply clean up the adl by running each one through a CharsetEncoder/Decoder however\.\.\.\.I would prefer this to be fixed at source\. Adam --- ## Post #2 by @thomas.beale Adam Flinton wrote: > Byte Order Marks > > What is the default CharSet for OpenEHR ADL? > > ASCII? UTF\-8? >   Hi Adam, UTF\-8 is the preferred\. See section 3 of http://www.openehr.org/releases/1.0.1/architecture/am/adl.pdf > I ask because ADL itself does not anywhere declare a character set & we > have had a number of adl files which have failed \(either to be opened or > to be transformed into XML\) & in each occasion the reason has been the > presence of a byte order mark \(hex bytes EF BB BF\) e\.g\. > > Exception in thread "main" se\.acode\.openehr\.parser\.TokenMgrError: > Lexical error at line 1, column 1\. Encountered: "\\u00ef" \(239\), after : "" >     at > se\.acode\.openehr\.parser\.ADLParserTokenManager\.getNextToken\(ADLParserTokenManager\.java:27554\) > > Equally if a text editor opens an ADL, assumes UTF\-8 & puts on a BOM > then the Archetype editor dies in addition to the Java ADL parser & the > Windows ADL2XML converter\. >   BOMs should not be used in UTF\-8 files, they are only for UTF\-16 files\. Lots of broken programs unfortunately seem to add a BOM for UTF\-8, but it breaks things on unix, or unix\-like environments\. The ADL parser in the ADL Workbench detects BOMs and ignores them\. You can try the unicode/family\_history archetype in the dev/test area on SVN in the ADL Workbench \- it is in farsi, to see unicode working\. It works in all languages we have tested\. If you really want proof \(;\-\) you can see how the ADL parser \(used inside the Archetype Editor and ADL workbench\) does its character matching \- see http://www.openehr.org/svn/ref_impl_eiffel/BRANCHES/specialisation/libraries/common_libs/src/structures/syntax/dadl/parser/dadl_scanner.l and http://www.openehr.org/svn/ref_impl_eiffel/BRANCHES/specialisation/components/adl_parser/src/syntax/cadl/parser/cadl_scanner.l these are the dADL and cADL lexers respectively\. They parse all strings in a UTF\-8 aware way\. In addition, you can see here where the BOMs are removed from ADL files, if found \- http://www.openehr.org/svn/ref_impl_eiffel/BRANCHES/specialisation/libraries/common_libs/src/file_system/file_context.e > Our standard for XML is UTF\-8 & I am wondering that if the std is in ADL > ASCII then how does/will adl support extended charater sets? > > e\.g\. in one of our adl there is some Dutch including > > "Een patiënt in rolstoel moet zonder hulp met hoeken en deuren kunnen > omgaan" where "patiënt" is mis\-rendered \(though given mail clients are > pretty good this will probably be correctly rendered\) >   See above, there is no problem in ADL files \- they support UTF\-8 and have been tested extensively\. If there are problems in other tools, we need to look carefully at the actions that lead to the problems\. I can't answer off the top of my head whether there are cut and paste problems in the Archetype Editor, but I don't believe there should be any problems in its ADL serialisation\. Perhaps there are inconsistencies in the XML serialiser \- it is a separate piece of code\. We should certainly get it fixed at source if there are any problems, but I am sure they will be in the code, not the archetype files themselves\. \- thomas beale --- ## Post #3 by @Peter_Gummer1 Adam Flinton wrote: > Equally if a text editor opens an ADL, assumes UTF\-8 & puts on a BOM > then the Archetype editor dies \.\.\. >   I assume you mean the Java Archetype Editor, Adam\. The Ocean Archetype Editor accepts ADL files with or without the BOM\. There are pros and cons whether tools should put a BOM at the start of ADL files\. \* As Thomas pointed out, tools that are not Unicode\-aware may blow up if the BOM is present\. \* On the other hand, if you omit the BOM then Unicode\-aware tools have a big problem when they open a file\. What encoding should they assume? Some tools like Windows Notepad seem to be very clever at figuring it out, but others that I have tried in the past \(Visual Studio 2005, Vim 6\.4 and Mac TextEdit\) misinterpret BOM\-less UTF\-8 files as Latin\-1\. \- Peter --- ## Post #4 by @system Peter Gummer wrote: > Adam Flinton wrote: >   >> Equally if a text editor opens an ADL, assumes UTF\-8 & puts on a BOM >> then the Archetype editor dies \.\.\. >>     > I assume you mean the Java Archetype Editor, Adam\. The Ocean Archetype > Editor accepts ADL files with or without the BOM\. >   I am pretty sure that recent versions of the Java Parser won't mind the BOM being present\. The Java Archetype Editor may still use an old version of the parser? > There are pros and cons whether tools should put a BOM at the start of > ADL files\. > > \* As Thomas pointed out, tools that are not Unicode\-aware may blow up if > the BOM is present\. > > \* On the other hand, if you omit the BOM then Unicode\-aware tools have a > big problem when they open a file\. What encoding should they assume? > Some tools like Windows Notepad seem to be very clever at figuring it > out, but others that I have tried in the past \(Visual Studio 2005, Vim > 6\.4 and Mac TextEdit\) misinterpret BOM\-less UTF\-8 files as Latin\-1\. >   That's why my assumption till now was that the BOM is optional \[but not forbiddedn\] for UTF\-8 \- and also is of some additional value to differentiate between UTF\-8 and ISO\-8859\-1 \(as long as you assume that a text doesn't start with  \- the BOM in ISO, one can safely differentiate between the two without problems\) Sebastian --- ## Post #5 by @Adam_Flinton > See above, there is no problem in ADL files \- they support UTF\-8 and > have been tested extensively\. If there are problems in other tools, we > need to look carefully at the actions that lead to the problems\. I can't > answer off the top of my head whether there are cut and paste problems > in the Archetype Editor, but I don't believe there should be any > problems in its ADL serialisation\. Perhaps there are inconsistencies in > the XML serialiser \- it is a separate piece of code\. We should certainly > get it fixed at source if there are any problems, but I am sure they > will be in the code, not the archetype files themselves\. > So just to be certain, if a user has some existing textual content \(e\.g\. a description of some construct\) in a non\-UTF\-8 source \(e\.g\. a Word document or HTML page\) and pastes into a text area in the Archetype editor: A\) All \(mappable\) non\-UTF\-8 chars would be transformed into UTF\-8 chars? B\) What about unmappable chars etc ? e\.g\. in our WYSIWYG XHTML editor for HL7 Mif we had to put in a decoder e\.g\. String charSet = "UTF\-8"; <cut>             Charset charset = Charset\.forName\(charSet\);             CharsetDecoder dec = charset\.newDecoder\(\);             dec\.onMalformedInput\(CodingErrorAction\.IGNORE\);             dec\.onUnmappableCharacter\(CodingErrorAction\.IGNORE\); i\.e\. it just drops the chars which are either malformed or unmappable\. What happens wrt the Archetype editor? Adam --- ## Post #6 by @Peter_Gummer1 Adam Flinton wrote: > So just to be certain, if a user has some existing textual content \(e\.g\. > a description of some construct\) in a non\-UTF\-8 source \(e\.g\. a Word > document or HTML page\) and pastes into a text area in the Archetype editor: > A\) All \(mappable\) non\-UTF\-8 chars would be transformed into UTF\-8 chars? > B\) What about unmappable chars etc ? > > \.\.\. > > What happens wrt the Archetype editor? >   Archetype Editor is built with standard \.NET controls, which look after those details\. It's all just standard operating system stuff\. A lot of people from various cultures have been working with it, so this has been well tested\. The only bug reported in 18 months has been in a DataGrid that we're using, where entering a special character by keying its code on the numeric keypad whilst holding down the Alt key can cause the wrong character to appear in the wrong row\. \- Peter --- **Canonical:** https://discourse.openehr.org/t/byte-order-marks/14842 **Original content:** https://discourse.openehr.org/t/byte-order-marks/14842