# Parsing of UTF-8 archetypes **Category:** [Reference Implementation: Java (archive)](https://discourse.openehr.org/c/reference-implementation-java-archive/154) **Created:** 2007-10-19 06:34 UTC **Views:** 5 **Replies:** 5 **URL:** https://discourse.openehr.org/t/parsing-of-utf-8-archetypes/14671 --- ## Post #1 by @sebastian.garde Dear all, Ever tried to parse a UTF\-8\-encoded archetype with the Java ADL Parser? If it worked you were probably lucky to have the 'right' UTF\-8, i\.e\. one that does not contain the optional \(i\.e\. legal, yet totally superfluous\) initial Byte Order Mark \(see http://en.wikipedia.org/wiki/Byte_Order_Mark \)\.\.\. \(Most Windows tools like Notepad, Notepad\+\+, Babelpad add this BOM\.\.\.\) I just learned the hard way that Java InputStreams/Readers insist on interpreting this BOM \(when present in UTF\-8\) as a normal character, thus providing the wrong input to the parser \- it \(quite rightly so\) doesn't parse a BOM and creates an error \(\\ufeff is the BOM\) se\.acode\.openehr\.parser\.TokenMgrError: Lexical error at line 1, column 1\. Encountered: "\\ufeff" \(65279\), after : "": Stack: se\.acode\.openehr\.parser\.TokenMgrError: Lexical error at line 1, column 1\. Encountered: "\\ufeff" \(65279\), after : ""             at se\.acode\.openehr\.parser\.ADLParserTokenManager\.getNextToken\(ADLParserToke nManager\.java:27429\)             at se\.acode\.openehr\.parser\.ADLParser\.jj\_consume\_token\(ADLParser\.java:7031\)             at se\.acode\.openehr\.parser\.ADLParser\.archetype\(ADLParser\.java:212\)             at se\.acode\.openehr\.parser\.ADLParser\.parse\(ADLParser\.java:100\) This is clearly a Java issue \(I hate to say\.\.\.\), see http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4508058 and http://bugs.sun.com/bugdatabase/view_bug.do;jsessionid=b6c309b1448bb5610 f8fe04e3c6df?bug\_id=6378911 \- but it looks like they don't want to fix it due to legacy system relying on the wrong behaviour, really nice discussions here :\-\) http://koti.mbnet.fi/akini/java/unicodereader/ provides an easy to use fix for this\. I use this class now to ensure that I have the right encoding:         BufferedReader in = new BufferedReader\(                                new UnicodeReader\(                                   new FileInputStream\(\("infilename"\),"UTF\-8"\)\); Instead of         BufferedReader in = new BufferedReader\(                                new InputStreamReader\(                                   new FileInputStream\("infilename"\), "UTF\-8"\)\); The question I have is, if something like this should be implemented directly in the ADL Parser, transparent to the user of the Parser, so no matter what UTF\-8 is obtained and no matter how incorrectly Java itself handles this, the Parser does the right thing \(and skips the initial BOM if present\)? Cheers Sebastian Dr Sebastian Garde Dr\. sc\. hum\., Dipl\.\-Inform\. Med, FACHI Faculty of Business and Informatics, Central Queensland University Austin Centre for Applied Clinical Informatics, Austin Health Heidelberg Vic 3084, Australia s\.garde@cqu\.edu\.au <mailto:s.garde@cqu.edu.au> Ph: \+61 \(0\)3 9496 4040 Fax: \+61 \(0\)3 9496 4224 http://healthinformatics.cqu.edu.au http://www.acaci.org.au Visit the new open access electronic Journal of Health Informatics \(eJHI\): http://ejhi.net/> --- ## Post #2 by @system > Dear all, > > Ever tried to parse a UTF-8-encoded archetype with the Java ADL Parser? Hi Sebastian, Yes, in fact there is a testcase dedicated to UTF-8 support. It verifies a text both in Chinese and Swedish are not distorted after parsed through the ADLParser using UTF-8 encoding. The TestCase class: [http://svn.openehr.org/ref_impl_java/TRUNK/adl-parser/src/test/java/se/acode/openehr/parser/UnicodeSupportTest.java](http://svn.openehr.org/ref_impl_java/TRUNK/adl-parser/src/test/java/se/acode/openehr/parser/UnicodeSupportTest.java) The test ADL file: [http://svn.openehr.org/ref_impl_java/TRUNK/adl-parser/src/test/resources/adl-test-entry.unicode_support.test.adl](http://svn.openehr.org/ref_impl_java/TRUNK/adl-parser/src/test/resources/adl-test-entry.unicode_support.test.adl) > If it worked you were probably lucky to have the 'right' UTF-8, i.e. one that does not contain the optional (i.e. legal, yet totally superfluous) initial Byte Order Mark (see [http://en.wikipedia.org/wiki/Byte_Order_Mark](http://en.wikipedia.org/wiki/Byte_Order_Mark) )... (Most Windows tools like Notepad, Notepad++, Babelpad add this BOM…) That's probably why it worked in my case, since I used Eclipse when edit the test ADL file and viewed it with Firefox browser just to be sure. Have you tried JEdit ( [http://www.jedit.org/](http://www.jedit.org/))? I used to use it for editing file in different encoding. > I just learned the hard way that Java InputStreams/Readers insist on interpreting this BOM (when present in UTF-8) as a normal character, thus providing the wrong input to the parser – it (quite rightly so) doesn't parse a BOM and creates an error (\ufeff is the BOM) > > se.acode.openehr.parser.TokenMgrError: Lexical error at line 1, column 1. Encountered: "\ufeff" (65279), after : "": > > Stack: > > se.acode.openehr.parser.TokenMgrError: Lexical error at line 1, column 1. Encountered: "\ufeff" (65279), after : "" > > at se.acode.openehr.parser.ADLParserTokenManager.getNextToken(ADLParserTokenManager.java:27429) > > at se.acode.openehr.parser.ADLParser.jj_consume_token(ADLParser.java:7031) > > at se.acode.openehr.parser.ADLParser.archetype(ADLParser.java:212) > > at se.acode.openehr.parser.ADLParser.parse(ADLParser.java:100) > > This is clearly a Java issue (I hate to say…), see [http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4508058](http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4508058) and [http://bugs.sun.com/bugdatabase/view_bug.do;jsessionid=b6c309b1448bb5610f8fe04e3c6df?bug_id=6378911 ](http://bugs.sun.com/bugdatabase/view_bug.do;jsessionid=b6c309b1448bb5610f8fe04e3c6df?bug_id=6378911)– but it looks like they don't want to fix it due to legacy system relying on the wrong behaviour, really nice discussions here J > > ``` > > ``` > > ``` > > http://koti.mbnet.fi/akini/java/unicodereader/ provides an easy to use fix for this. > ``` > > ``` > > ``` > > ``` > > I use this class now to ensure that I have the right encoding: > ``` > > ``` > > > ``` > > ``` > BufferedReader in = > **new** BufferedReader( > ``` > > ``` > > > **new** > UnicodeReader( > ``` > > ``` > > **new** FileInputStream(("infilename" > > ),"UTF-8")); > > ``` > > ``` > Instead of > ``` > > ``` > BufferedReader in = > **new** BufferedReader( > ``` > > ``` > > **new** InputStreamReader( > > ``` > > ``` > **new** > > FileInputStream(*"infilename"*), > "UTF-8")); > ``` > > ``` > > ``` > > ``` > > > > ``` > > ``` > The question I have is, if something like this should be implemented directly in the ADL Parser, transparent to the user of the Parser, so no matter what UTF-8 is obtained and no matter how incorrectly Java itself handles this, the Parser does the right thing (and skips the initial BOM if present)? > > ``` I am inclined to add this support with a flag (instead of as default) for two reasons: 1) hopefully the bug with optional BOM could be fixed in the JDK library one day; 2) I am slightly afraid of potential bugs introduced by this workaround. Cheers, Rong --- ## Post #3 by @thomas.beale Sebastian Garde wrote: > Dear all, > > Ever tried to parse a UTF-8-encoded archetype with the Java ADL Parser? > > If it worked you were probably lucky to have the ‘right’ UTF-8, i.e. one that does not contain the optional (i.e. legal, yet totally superfluous) initial Byte Order Mark (see [http://en.wikipedia.org/wiki/Byte_Order_Mark](http://en.wikipedia.org/wiki/Byte_Order_Mark) )... (Most Windows tools like Notepad, Notepad++, Babelpad add this BOM…) it should only be added to UTF-16, UTF-32 files...we also discovered this, and in the Eiffel parser we check if it exists and then we remove it. I suggest you do the same in the Java parser, even though it is annoying to have to compensate for a) dumb tools that do the wrong thing with UTF-8 and b) naively trusting Java unicode libraries.... - thomas [details="(attachments)"] ![OceanC\_small.png|74x72](upload://5I367QG2SMJUp18Pt3jF6yz13Ey.png) [/details] --- ## Post #4 by @system Sebastian, Can you add this support into adl-parser on the TRUNK? It fits perfectly as your first task as Java project committer =) Cheers, Rong [details="(attachments)"] ![OceanC\_small.png|74x72](upload://5I367QG2SMJUp18Pt3jF6yz13Ey.png) [/details] --- ## Post #5 by @sebastian.garde Rong, Okay, I did modify and commit it\. I chose the following solution, I think it is the easiest and probably even cleanest \(as you have to modify the adl\.jj File no matter what you do anyway\) and this one only requires one line\.\.\. I tested with UTF\-8, and also with UTF\-16BE and UTF\-16LE and it seems to work fine \(they have different BOMs and anyway the BOM would be consumed by the Reader/InputStream anyway\)\. I created an additional Test Case for it based on your Unicode Support Test\. < \* > SKIP : /\* WHITE SPACE \*/ \{   " " > "\\t" > "\\n" > "\\r" > "\\f" > "\\ufeff" /\* UTF\-8 Byte Order Mark \*/ \} Hope this is ok\.\.\. Cheers Sebastian [details="(attachments)"] ![image002.gif|74x72](upload://kJLFDCA6F7I7km3BJpeFgSFTIaa.gif) [/details] --- ## Post #6 by @system Well done! =) /Rong [details="(attachments)"] ![image002.gif|74x72](upload://kJLFDCA6F7I7km3BJpeFgSFTIaa.gif) [/details] --- **Canonical:** https://discourse.openehr.org/t/parsing-of-utf-8-archetypes/14671 **Original content:** https://discourse.openehr.org/t/parsing-of-utf-8-archetypes/14671