Dear all,
Ever tried to parse a UTF-8-encoded archetype with the Java ADL Parser?
If it worked you were probably lucky to have the 'right' UTF-8, i.e. one
that does not contain the optional (i.e. legal, yet totally superfluous)
initial Byte Order Mark (see
http://en.wikipedia.org/wiki/Byte_Order_Mark )... (Most Windows tools
like Notepad, Notepad++, Babelpad add this BOM...)
I just learned the hard way that Java InputStreams/Readers insist on
interpreting this BOM (when present in UTF-8) as a normal character,
thus providing the wrong input to the parser - it (quite rightly so)
doesn't parse a BOM and creates an error (\ufeff is the BOM)
se.acode.openehr.parser.TokenMgrError: Lexical error at line 1, column
1. Encountered: "\ufeff" (65279), after : "":
Stack:
se.acode.openehr.parser.TokenMgrError: Lexical error at line 1, column
1. Encountered: "\ufeff" (65279), after : ""
at
se.acode.openehr.parser.ADLParserTokenManager.getNextToken(ADLParserToke
nManager.java:27429)
at
se.acode.openehr.parser.ADLParser.jj_consume_token(ADLParser.java:7031)
at
se.acode.openehr.parser.ADLParser.archetype(ADLParser.java:212)
at
se.acode.openehr.parser.ADLParser.parse(ADLParser.java:100)
This is clearly a Java issue (I hate to say...), see
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4508058 and
http://bugs.sun.com/bugdatabase/view_bug.do;jsessionid=b6c309b1448bb5610
f8fe04e3c6df?bug_id=6378911 - but it looks like they don't want to fix
it due to legacy system relying on the wrong behaviour, really nice
discussions here ![]()
http://koti.mbnet.fi/akini/java/unicodereader/ provides an easy to use
fix for this.
I use this class now to ensure that I have the right encoding:
BufferedReader in = new BufferedReader(
new UnicodeReader(
new
FileInputStream(("infilename"),"UTF-8"));
Instead of
BufferedReader in = new BufferedReader(
new InputStreamReader(
new FileInputStream("infilename"),
"UTF-8"));
The question I have is, if something like this should be implemented
directly in the ADL Parser, transparent to the user of the Parser, so no
matter what UTF-8 is obtained and no matter how incorrectly Java itself
handles this, the Parser does the right thing (and skips the initial BOM
if present)?
Cheers
Sebastian
Dr Sebastian Garde
Dr. sc. hum., Dipl.-Inform. Med, FACHI
Faculty of Business and Informatics, Central Queensland University
Austin Centre for Applied Clinical Informatics, Austin Health
Heidelberg Vic 3084, Australia
s.garde@cqu.edu.au <mailto:s.garde@cqu.edu.au>
Ph: +61 (0)3 9496 4040
Fax: +61 (0)3 9496 4224
http://healthinformatics.cqu.edu.au
http://www.acaci.org.au
Visit the new open access electronic Journal of Health Informatics
(eJHI): http://ejhi.net/>
