testParsingWithoutUTF8Encoding Error

Fabio_Nogueira_de_Lu · 7 February 2010 23:48

Hi,

the unit test testParsingWithoutUTF8Encoding fails on my Mac. After trying to make my Mac use UTF-8 by default instead of Mac Roman I gave up. However, i am not sure the test is right. The test should work even with a different default encoding, or not?

Thanks in advance.

Fábio

system · 8 February 2010 12:43

Hi Fábio,

I think the test itself is probably right.

public void testParsingWithUTF8Encoding() throws Exception {
try {
ADLParser parser = new ADLParser(loadFromClasspath(
“adl-test-entry.unicode_BOM_support.test.adl”), “UTF-8”);
parser.parse();

} catch(Throwable t) {
fail(“failed to parse BOM with UTF8 encoding..”);
}
}

If the test fails, I believe something is going wrong in the ADLParser as the archetype cannot be parsed.
Something in the ADLParser is going wrong in a Mac environment without UTF-8 default encoding.
I unfortunately don’t own a Mac, and not sure if Rong does, but maybe you can get the error message to us to see what the parser is expecting?

I also notice that the archetype is in Windows format (i.e. using pair of CR and LF characters to terminate lines, whereas Unix uses an LF character only and Mac uses a CR character only.)
Maybe you can convert it to Unix or Mac format and see if that helps (but be sure to keep the invisible Byte order mark (BOM) at the beginning of the file or test may be ok, but not testing what it should test anymore)

Regards
Sebastian

Fábio Nogueira de Lucena wrote:

(attachments)

Fabio_Nogueira_de_Lu · 8 February 2010 19:25

Hi Sebastian,

the method testParsingWithUTF8Encoding works fine as expected. The only test that fails is testParsingWithoutUTF8Encoding.

----------------- WHAT’s SEEMS TO BE HAPPENING ----------------------

testParsingWithoutUTF8Encoding calls ADLParser with only one argument which calls the constructor

public SimpleCharStream(java.io.InputStream dstream, String encoding, int startline,
int startcolumn, int buffersize)

with parameter encoding == null. In this case InputStreamReader with just one argument is used. In other words, inputStream instance variable of SimpleCharStream uses default encoding for reading. In Macs default encoding is Mac Roman (Java). Instead of using default encoding maybe ADLParser should use UTF-8 (I am not sure if it is right). When ADLParser uses UTF-8 on Mac everything works fine (testParsingWithUTF8Encoding passes).

(attachments)

system · 9 February 2010 10:34

Hi Fábio,

sorry that was the wrong test I looked at.

I think you are right, we need to construct the parser specifying UTF-8 here.

ADLParser parser = new ADLParser(loadFromClasspath(
“adl-test-entry.unicode_BOM_support.test.adl”), “UTF-8”);

I have checked in the updated code, can you please check if this works ok for you?

Cheers
Sebastian

Fábio Nogueira de Lucena wrote:

(attachments)

system · 9 February 2010 11:56

Hi Sebastian, Fábio

If “UTF-8” is specified in the parser constructor, the test will be identical to the previous one, testParsingWithUTF8Encoding. So perhaps we should just specify an encoding other than UTF-8, for example “ISO-8859-1”.

Cheers,
Rong

ISO-8859-1

(attachments)

system · 9 February 2010 12:13

That’s a good idea.
Does that work for you on your Mac, Fábio?

Cheers
Sebastian

Rong Chen wrote:

(attachments)

Fabio_Nogueira_de_Lu · 9 February 2010 12:30

Hi Sebastian, Rong

there is a third option to consider: when ADLParser is used with only one argument the encoding used can be UTF-8 instead of the one used by the platform. In this case, no change to the tests are needed. But documentation should state clearly that UTF-8 is used when no specific encoding is provided. This is in accordance with “Support Information Model”, page 18, section 3.3.1.1 which states: “… In openEHR, UTF-8 encoding is assumed”.

Cheers,

Fábio

(attachments)

system · 9 February 2010 14:08

I like this idea! It can be implemented. What do others think about this proposal?

Cheers,
Rong

2010/2/9 Fábio Nogueira de Lucena <fabio@engenhariadesoftware.inf.br>

(attachments)

system · 9 February 2010 14:22

Hi,

I agree and like it too, but haven’t found a straightforward way of doing this:

The constructor taking an InputStream only is generated automatically by JavaCC, which sends it to the constructor with encoding=null specified

public ADLParser(java.io.InputStream stream) {
this(stream, null);
}

This constructor therefore is fairly unsafe to use because it just assumes the system’s default encoding.

Maybe we can hide it using this approach, which I think would be a breaking change though:
http://www.pisolutions.eu/node/11

Rong may have a far better idea how to change this behaviour.

Cheers
Sebastian

Fábio Nogueira de Lucena wrote:

(attachments)

system · 10 February 2010 08:37

Hi Sebastian,

This is a very good option. It will be nice if the new API wouldn’t break existing code in other components. Perhaps change the original class ADLParser to something else and name the new interface ADLParser with the same kind of constructors. Then in the default constructor of the new class, we instantiate the real parser with UTF encoding.

Cheers,
Rong

system · 12 February 2010 12:18

Hi Rong,

that makes a lot of sense!

Cheers
Sebastgian

Rong Chen wrote:

(attachments)

Topic		Replies	Views
Parsing of UTF-8 archetypes Reference Implementation: Java (archive)	5	13	24 October 2007
OxFEFF breaks tests in MacOS Reference Implementation: Java (archive)	5	16	5 April 2011
Unhandled exception parsing ADL Reference Implementation: Java (archive)	4	12	3 November 2008
questions about string literals Technical (archive)	6	13	8 October 2006
Byte Order Marks Technical (archive)	5	13	3 November 2008
Java Parser: Problem with using Translation_details' Accreditation in ADL Reference Implementation: Java (archive)	1	5	13 November 2008
Archetypes on CKM encoding problem (UTF-8 with BOM) Implementers (archive)	6	17	3 June 2013
One more Reference Implementation: Java (archive)	2	10	23 November 2007
Java XML-Archetype Parser? Reference Implementation: Java (archive)	14	21	8 May 2008
Ref_impl_Java mailing list archives Reference Implementation: Java (archive)	3	10	10 May 2010

testParsingWithoutUTF8Encoding Error

Related topics