questions about string literals

I am having trouble with the exact definition of the
string literal..

I've been mulling over this for a while. It seems for quoting we have
the following possibilities:

- just have a couple of basic \ rules, and avoid quoted unicode
altogether, on the basis that we can use real unicode files (which we
already can in the next generation of the tools)
- use the ISO rules, that the spec currently indicates, i.e. &aaaa or
&#xHHHH
- use the \uNNNN approach Andrew suggests (is this hex or decimal?)

We have already built some unicode archetypes in Farsi, and no quoting
is needed. The current generation of tools are not quite up to
displaying them yet, but the next generation will do it. So - is there a
strong argument for quoted unicode at all? Is it that we need to cater
for tools or situations where only ascii is allowable in the saved form
of the file? I'm quite happy to go with the \uNNNN approach, but we need
to be clear what it's for; and it seems to me that we need to state
clearly that we support 2 kinds of serialisation: 1 to ASCII, in which
anything not in the basic ISO latin-1 charset is shown as quoted
unicode, and 2, to true unicode UTF-8 (which is what we have stated
elsewhere in openEHR we will use as the encoding).

As for the other quoted characters, I don't see what the need for things
like \f (formfeed) is; what we need is to decide a minimum set which
might be:
- \r - carriage return
- \n - linefeed
- \t - tab
- \\ - backslash
- \" - literal "

Is anything else needed?

- thomas

Andrew Patterson wrote:

- just have a couple of basic \ rules, and avoid quoted unicode
altogether, on the basis that we can use real unicode files (which we
already can in the next generation of the tools)

fine by me.

- use the ISO rules, that the spec currently indicates, i.e. &aaaa or
&#xHHHH

I think this would be a nightmare - what happens to normal
&'s - these must then be quoted. Also, are all the symbolic
unicode names supported ´ etc?

- use the \uNNNN approach Andrew suggests (is this hex or decimal?)

This is hexadecimal (as per the unicode spec for unicode codepoints).
C# and Java use this notation - C# extends it to also have \UXXXXXXXX
for 32 bit codepoints (as per the new unicode versions)

As for the other quoted characters, I don't see what the need for things
like \f (formfeed) is; what we need is to decide a minimum set which
might be:
- \r - carriage return
- \n - linefeed
- \t - tab
- \\ - backslash
- \" - literal "

Is anything else needed?

In characters, \' for literal '

Andrew

Andrew Patterson wrote:

  

- use the \uNNNN approach Andrew suggests (is this hex or decimal?)
    
This is hexadecimal (as per the unicode spec for unicode codepoints).
C# and Java use this notation - C# extends it to also have \UXXXXXXXX
for 32 bit codepoints (as per the new unicode versions)

One of Andrew's issues I don't think matters too much - the quoting of
&; this is because & is used to quote the '&' character. However, I
agree that using a more uniform '\'-based quoting approach will be
clearer, and make for easier parser construction. So let's say that we
will go for the \uHHHH \UHHHHHHHH approach.

Onto more important issues. When do we use real unicode, and when do we
use ASCII files containing quoted unicode? Currently we have made real
unicode work in the ADL workbench and Archetype Editor, and I would not
anticipate any problems in the Java Archetype tools. So it is mostly
likely a question not of archetype tools, but of sharing ADL files. With
no unicode, and assuming latin-1 based languages, ADL files are (as far
as I can tell) safe to transport as text files. However, even for
languages like Turkish (which has an odd situation to do with upper and
lower case), these files get broken, and unicode is needed; but then an
ADL file is no longer a "text file" from the point of view of file
sharing, mime-type and so on. We have not defined a mime-type, but it
would be one of the application ones I guess.

One problem is that a person receiving an ADL file under the quoting
proposal here is that they might be receiving:
- a "safe" text file with only ASCII / latin-1 alphabet characters in it
("real" ascii)
- a "safe" text file with quoted unicode, that is in fact an archetype
written in say Turkish, Farsi, Chinese etc
- a binary file containing UTF-8 unicode characters, that will look like
a text file with some funny characters in it depending on how smart your
editor is...
- or....a UTF-8 encoded file that also contained \uHHHH encoded
characters (due to cut and paste in some editor environment)

There seem to be a couple of ways of dealing with this:
- include an "encoding" attribute at the top of ADL files, indicating
how to read the file
- create a new file extension and specify that .adl is for UTF-8 encoded
files, and that (say) .uadl is for ascii encoded files containing
unicode quoting...

The first is the more obvious thing to do, since it is what XML, HTML
and probably other formats (RTF?) do; this is easy to add to ADL
archetypes as a field. It would have to be an optional field, so that
all current ADL files are not invalidated. This means we a) have to
choose the allowable encoding names (UTF-8 is the default in openEHR for
true unicode; the other will presumably be ISO-8859-1); we then need to
specify which encoding is assumed for an ADL file with no encoding
marker; I propose that it is UTF-8, since we already have "cracked" that
problem, and we say that it is only ISO-8859-1 if it actually says so.
This might sound odd, but remember UTF-8 is a proper superset of ASCII
anyway, so for all us western language people wondering if our files
will look funny, they won't. However, we could do it the other way round
- I don't see any terribly strong arguments one way or the other.

further thoughts anyone?

- thomas beale

One of Andrew's issues I don't think matters too much - the quoting of
&; this is because & is used to quote the '&' character. However, I
agree that using a more uniform '\'-based quoting approach will be
clearer, and make for easier parser construction. So let's say that we
will go for the \uHHHH \UHHHHHHHH approach.

Are you saying that the \u quoting will be used instead of the
XML quoting or in addition to? If you are saying the first, please
ignore the following rant :slight_smile:

I still think the & is needlessly confusing and pointless. My
issues are:

1) it is completely non-obvious - as an ADL user I would never expect to use the
XML quoting rules in the string definition in ADL because ADL
is clearly not an XML document.. sure, it has bits that are like XML, but
if you want it to be XML, then go the whole way. More importantly,
for one of the target groups of ADL, the clinicians, it is a behaviour
that I imagine could confuse them. They have never heard of XML
quoting rules and hence may just type in strings like
"term code meaning pain to head & chest" in their ADL strings.
Now this may be mitigated by the fact that they will often
be editing ADL in a tool, but if ADL is only going to be edited
by tools we should drop the human parseable format and
do the whole thing in XML.

2) It is a pain to implement - now every ADL parser needs to
have an XML entity converter built in as well - which entities are
included - just the XML ones (< > &)? What about the
HTML/SGML ones (´ `)?? Does every ADL
implementation need to have the table of standard unicode
names built in to be able to parse strings? Do angle brackets
need to be quoted - they do in XML but that is because they
have special meaning. Yet within ADL strings they don't. Of course,
the two characters that do need to be quoted are the \ and the
quotation mark. Are these quoted in XML? Not by default, and
so now the XML programmers are confused :slight_smile:

choose the allowable encoding names (UTF-8 is the default in openEHR for
true unicode; the other will presumably be ISO-8859-1); we then need to
specify which encoding is assumed for an ADL file with no encoding
marker; I propose that it is UTF-8, since we already have "cracked" that
problem, and we say that it is only ISO-8859-1 if it actually says so.
This might sound odd, but remember UTF-8 is a proper superset of ASCII
anyway, so for all us western language people wondering if our files
will look funny, they won't. However, we could do it the other way round
- I don't see any terribly strong arguments one way or the other.

I think you are right that it should default to UTF-8. I am not sure
the correct way of putting the encoding marker in - if its a standard
archetype field then the parser is obviously well into parsing the
file before it finds out what encoding the file is in? Which then
invalidates encodings such as UTF-16 because it would be impossible
to write even the first "archetype" keyword in such a way that the
parser could parse it.

I actually don't feel too strongly that ADL needs to be 7-bit safe
(i.e. I would be happy with UTF-8 as the default and leave it at that
- still including the \uxxxx rules to allow the insertion of characters
that are hard to _edit_, but assume UTF-8 can be read/transported).
Is there any web/email transport mechanism in existence now that
can't pass through an 8-bit stream untouched? Even moreso, is there
any modern environment that can't parse UTF-8?? (keeping in mind
that this is not saying that openEHR systems won't have to exchange
data with old legacy systems, but I doubt the openEHR system will be
sending the legacy systems ADL files to parse??)

Andrew

Andrew Patterson wrote:

One of Andrew's issues I don't think matters too much - the quoting of
&; this is because & is used to quote the '&' character. However, I
agree that using a more uniform '\'-based quoting approach will be
clearer, and make for easier parser construction. So let's say that we
will go for the \uHHHH \UHHHHHHHH approach.
    
Are you saying that the \u quoting will be used instead of the
XML quoting or in addition to? If you are saying the first, please
ignore the following rant :slight_smile:
  

I am following your original suggestion, to replace the current XML
quoting rules with \u and \U (since we already use \ to quote anyway,
and as you point out, the & stuff is ugly.)

I still think the & is needlessly confusing and pointless. My
issues are:

1) it is completely non-obvious - as an ADL user I would never expect to use the
XML quoting rules in the string definition in ADL because ADL
is clearly not an XML document.. sure, it has bits that are like XML, but
if you want it to be XML, then go the whole way. More importantly,
for one of the target groups of ADL, the clinicians, it is a behaviour
that I imagine could confuse them. They have never heard of XML
quoting rules and hence may just type in strings like
"term code meaning pain to head & chest" in their ADL strings.
Now this may be mitigated by the fact that they will often
be editing ADL in a tool, but if ADL is only going to be edited
by tools we should drop the human parseable format and
do the whole thing in XML.
  

agreed. I personally don't see XML as useful other than a purely literal
transfer syntax, i.e. a serialisation of objects. ADL is an abstract
syntax, which is both readable by humans, and for which abstract parsers
can be written; the parser that can read the XML form (which will be
supported fairly soon, but is completely unreadable) is a pure object
serialiser/deserialiser, not a language parser.

2) It is a pain to implement - now every ADL parser needs to
have an XML entity converter built in as well - which entities are
included - just the XML ones (< > &)? What about the
HTML/SGML ones (´ `)?? Does every ADL
implementation need to have the table of standard unicode
names built in to be able to parse strings? Do angle brackets
need to be quoted - they do in XML but that is because they
have special meaning. Yet within ADL strings they don't. Of course,
the two characters that do need to be quoted are the \ and the
quotation mark. Are these quoted in XML? Not by default, and
so now the XML programmers are confused :slight_smile:
  

yes, I also agree with this :wink:

  

choose the allowable encoding names (UTF-8 is the default in openEHR for
true unicode; the other will presumably be ISO-8859-1); we then need to
specify which encoding is assumed for an ADL file with no encoding
marker; I propose that it is UTF-8, since we already have "cracked" that
problem, and we say that it is only ISO-8859-1 if it actually says so.
This might sound odd, but remember UTF-8 is a proper superset of ASCII
anyway, so for all us western language people wondering if our files
will look funny, they won't. However, we could do it the other way round
- I don't see any terribly strong arguments one way or the other.
    
I think you are right that it should default to UTF-8. I am not sure
the correct way of putting the encoding marker in - if its a standard
archetype field then the parser is obviously well into parsing the
file before it finds out what encoding the file is in? Which then
invalidates encodings such as UTF-16 because it would be impossible
to write even the first "archetype" keyword in such a way that the
parser could parse it.
  

It probably has to be on the first line, which is easy enough to deal
with. At this stage, I think it s reasonable to just allow UTF-8 and
ISO-8859-1 only. UTF-16 et al need byte order markers at the start of
the file (which removes the need for the encoding indicator in the file
I guess); but let's not go there yet.

I actually don't feel too strongly that ADL needs to be 7-bit safe
(i.e. I would be happy with UTF-8 as the default and leave it at that
- still including the \uxxxx rules to allow the insertion of characters
that are hard to _edit_, but assume UTF-8 can be read/transported).
Is there any web/email transport mechanism in existence now that
can't pass through an 8-bit stream untouched? Even moreso, is there
any modern environment that can't parse UTF-8?? (keeping in mind
that this is not saying that openEHR systems won't have to exchange
data with old legacy systems, but I doubt the openEHR system will be
sending the legacy systems ADL files to parse??)
  

well, Notepad and gvim on Windows don't get it right....but that may
just be display...

- thomas

Thomas Beale wrote:

Andrew Patterson wrote:
  

- use the \uNNNN approach Andrew suggests (is this hex or decimal?)
    

This is hexadecimal (as per the unicode spec for unicode codepoints).
C# and Java use this notation - C# extends it to also have \UXXXXXXXX
for 32 bit codepoints (as per the new unicode versions)

One of Andrew's issues I don't think matters too much - the quoting of
&; this is because & is used to quote the '&' character. However, I
agree that using a more uniform '\'-based quoting approach will be
clearer, and make for easier parser construction. So let's say that we
will go for the \uHHHH \UHHHHHHHH approach.

Onto more important issues. When do we use real unicode, and when do we
use ASCII files containing quoted unicode? Currently we have made real
unicode work in the ADL workbench and Archetype Editor, and I would not
anticipate any problems in the Java Archetype tools. So it is mostly
  

The Java ADL parser currently uses UTF-8 as encoding for parsing. It has
been planned to support more encoding later which is quite easy to do in
Java.

likely a question not of archetype tools, but of sharing ADL files. With
no unicode, and assuming latin-1 based languages, ADL files are (as far
as I can tell) safe to transport as text files. However, even for
languages like Turkish (which has an odd situation to do with upper and
lower case), these files get broken, and unicode is needed; but then an
ADL file is no longer a "text file" from the point of view of file
sharing, mime-type and so on. We have not defined a mime-type, but it
would be one of the application ones I guess.

One problem is that a person receiving an ADL file under the quoting
proposal here is that they might be receiving:
- a "safe" text file with only ASCII / latin-1 alphabet characters in it
("real" ascii)
- a "safe" text file with quoted unicode, that is in fact an archetype
written in say Turkish, Farsi, Chinese etc
- a binary file containing UTF-8 unicode characters, that will look like
a text file with some funny characters in it depending on how smart your
editor is...
- or....a UTF-8 encoded file that also contained \uHHHH encoded
characters (due to cut and paste in some editor environment)

There seem to be a couple of ways of dealing with this:
- include an "encoding" attribute at the top of ADL files, indicating
how to read the file
  

I like the idea of including an "encoding" attribute in the ADL,
probably in the archetype header section. It's also good to keep the
encoding information in the archetype (in AOM form) so that ADL
serializer can use the right encoding for output.

- create a new file extension and specify that .adl is for UTF-8 encoded
files, and that (say) .uadl is for ascii encoded files containing
unicode quoting...
  

This doesn't seem as flexible as the first one. It seems that we need to
create a new file extension to support a new encoding each time. It's
good to keep all the meta data about the archetype including the
encoding in the header section.

The first is the more obvious thing to do, since it is what XML, HTML
and probably other formats (RTF?) do; this is easy to add to ADL
archetypes as a field. It would have to be an optional field, so that
all current ADL files are not invalidated. This means we a) have to
choose the allowable encoding names (UTF-8 is the default in openEHR for
true unicode; the other will presumably be ISO-8859-1); we then need to
specify which encoding is assumed for an ADL file with no encoding
marker; I propose that it is UTF-8, since we already have "cracked" that
problem, and we say that it is only ISO-8859-1 if it actually says so.
  

Agree. This is supported in the Java ADL parser.

Cheers,
Rong