Parsing of UTF-8 archetypes

Dear all,

Ever tried to parse a UTF-8-encoded archetype with the Java ADL Parser?

If it worked you were probably lucky to have the 'right' UTF-8, i.e. one
that does not contain the optional (i.e. legal, yet totally superfluous)
initial Byte Order Mark (see
http://en.wikipedia.org/wiki/Byte_Order_Mark )... (Most Windows tools
like Notepad, Notepad++, Babelpad add this BOM...)

I just learned the hard way that Java InputStreams/Readers insist on
interpreting this BOM (when present in UTF-8) as a normal character,
thus providing the wrong input to the parser - it (quite rightly so)
doesn't parse a BOM and creates an error (\ufeff is the BOM)

se.acode.openehr.parser.TokenMgrError: Lexical error at line 1, column
1. Encountered: "\ufeff" (65279), after : "":

Stack:

se.acode.openehr.parser.TokenMgrError: Lexical error at line 1, column
1. Encountered: "\ufeff" (65279), after : ""

            at
se.acode.openehr.parser.ADLParserTokenManager.getNextToken(ADLParserToke
nManager.java:27429)

            at
se.acode.openehr.parser.ADLParser.jj_consume_token(ADLParser.java:7031)

            at
se.acode.openehr.parser.ADLParser.archetype(ADLParser.java:212)

            at
se.acode.openehr.parser.ADLParser.parse(ADLParser.java:100)

This is clearly a Java issue (I hate to say...), see
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4508058 and
http://bugs.sun.com/bugdatabase/view_bug.do;jsessionid=b6c309b1448bb5610
f8fe04e3c6df?bug_id=6378911 - but it looks like they don't want to fix
it due to legacy system relying on the wrong behaviour, really nice
discussions here :slight_smile:

http://koti.mbnet.fi/akini/java/unicodereader/ provides an easy to use
fix for this.

I use this class now to ensure that I have the right encoding:

        BufferedReader in = new BufferedReader(
                               new UnicodeReader(
                                  new
FileInputStream(("infilename"),"UTF-8"));

Instead of
        BufferedReader in = new BufferedReader(
                               new InputStreamReader(
                                  new FileInputStream("infilename"),
"UTF-8"));

The question I have is, if something like this should be implemented
directly in the ADL Parser, transparent to the user of the Parser, so no
matter what UTF-8 is obtained and no matter how incorrectly Java itself
handles this, the Parser does the right thing (and skips the initial BOM
if present)?

Cheers
Sebastian

Dr Sebastian Garde
Dr. sc. hum., Dipl.-Inform. Med, FACHI

Faculty of Business and Informatics, Central Queensland University
Austin Centre for Applied Clinical Informatics, Austin Health
Heidelberg Vic 3084, Australia

s.garde@cqu.edu.au <mailto:s.garde@cqu.edu.au>
Ph: +61 (0)3 9496 4040
Fax: +61 (0)3 9496 4224

http://healthinformatics.cqu.edu.au
http://www.acaci.org.au

Visit the new open access electronic Journal of Health Informatics
(eJHI): http://ejhi.net/&gt;

Dear all,

Ever tried to parse a UTF-8-encoded archetype with the Java ADL Parser?

Hi Sebastian,

Yes, in fact there is a testcase dedicated to UTF-8 support. It verifies a text both in Chinese and Swedish are not distorted after parsed through the ADLParser using UTF-8 encoding.

The TestCase class:
http://svn.openehr.org/ref_impl_java/TRUNK/adl-parser/src/test/java/se/acode/openehr/parser/UnicodeSupportTest.java

The test ADL file:
http://svn.openehr.org/ref_impl_java/TRUNK/adl-parser/src/test/resources/adl-test-entry.unicode_support.test.adl

If it worked you were probably lucky to have the ‘right’ UTF-8, i.e. one that does not contain the optional (i.e. legal, yet totally superfluous) initial Byte Order Mark (see http://en.wikipedia.org/wiki/Byte_Order_Mark )… (Most Windows tools like Notepad, Notepad++, Babelpad add this BOM…)

That’s probably why it worked in my case, since I used Eclipse when edit the test ADL file and viewed it with Firefox browser just to be sure. Have you tried JEdit ( http://www.jedit.org/)? I used to use it for editing file in different encoding.

I just learned the hard way that Java InputStreams/Readers insist on interpreting this BOM (when present in UTF-8) as a normal character, thus providing the wrong input to the parser – it (quite rightly so) doesn’t parse a BOM and creates an error (\ufeff is the BOM)

se.acode.openehr.parser.TokenMgrError: Lexical error at line 1, column 1. Encountered: “\ufeff” (65279), after : “”:

Stack:

se.acode.openehr.parser.TokenMgrError: Lexical error at line 1, column 1. Encountered: “\ufeff” (65279), after : “”

at se.acode.openehr.parser.ADLParserTokenManager.getNextToken(ADLParserTokenManager.java:27429)

at se.acode.openehr.parser.ADLParser.jj_consume_token(ADLParser.java:7031)

at se.acode.openehr.parser.ADLParser.archetype(ADLParser.java:212)

at se.acode.openehr.parser.ADLParser.parse(ADLParser.java:100)

This is clearly a Java issue (I hate to say…), see http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4508058 and http://bugs.sun.com/bugdatabase/view_bug.do;jsessionid=b6c309b1448bb5610f8fe04e3c6df?bug_id=6378911 – but it looks like they don’t want to fix it due to legacy system relying on the wrong behaviour, really nice discussions here J

 

http://koti.mbnet.fi/akini/java/unicodereader/ provides an easy to use fix for this.
 

I use this class now to ensure that I have the right encoding:
 

        BufferedReader in =
**new** BufferedReader(
                              

**new**
UnicodeReader(
                                 
**new** FileInputStream((<i>"infilename"
</i>
),"UTF-8"));

Instead of
        BufferedReader in =
**new** BufferedReader(

                               **new** InputStreamReader(

                                  **new**

 FileInputStream(*"infilename"*),
"UTF-8"));
 

 

The question I have is, if something like this should be implemented directly in the ADL Parser, transparent to the user of the Parser, so no matter what UTF-8 is obtained and no matter how incorrectly Java itself handles this, the Parser does the right thing (and skips the initial BOM if present)?

I am inclined to add this support with a flag (instead of as default) for two reasons: 1) hopefully the bug with optional BOM could be fixed in the JDK library one day; 2) I am slightly afraid of potential bugs introduced by this workaround.

Cheers,
Rong

Sebastian Garde wrote:

Dear all,

Ever tried to parse a UTF-8-encoded archetype with the Java ADL Parser?

If it worked you were probably lucky to have the ‘right’ UTF-8, i.e. one that does not contain the optional (i.e. legal, yet totally superfluous) initial Byte Order Mark (see http://en.wikipedia.org/wiki/Byte_Order_Mark )… (Most Windows tools like Notepad, Notepad++, Babelpad add this BOM…)

it should only be added to UTF-16, UTF-32 files…we also discovered this, and in the Eiffel parser we check if it exists and then we remove it. I suggest you do the same in the Java parser, even though it is annoying to have to compensate for a) dumb tools that do the wrong thing with UTF-8 and b) naively trusting Java unicode libraries…

  • thomas
(attachments)

OceanCsmall.png

Sebastian,

Can you add this support into adl-parser on the TRUNK? It fits perfectly as your first task as Java project committer =)

Cheers,
Rong

(attachments)

OceanCsmall.png

Rong,

Okay, I did modify and commit it.

I chose the following solution, I think it is the easiest and probably
even cleanest (as you have to modify the adl.jj File no matter what you
do anyway) and this one only requires one line...

I tested with UTF-8, and also with UTF-16BE and UTF-16LE and it seems to
work fine (they have different BOMs and anyway the BOM would be consumed
by the Reader/InputStream anyway).

I created an additional Test Case for it based on your Unicode Support
Test.

< * > SKIP : /* WHITE SPACE */

{

  " "

"\t"

"\n"

"\r"

"\f"

"\ufeff" /* UTF-8 Byte Order Mark */

}

Hope this is ok...

Cheers

Sebastian

(attachments)

image002.gif

Well done! =)
/Rong

(attachments)

image002.gif