# questions about string literals **Category:** [Technical (archive)](https://discourse.openehr.org/c/technical-archive/156) **Created:** 2006-09-18 12:59 UTC **Views:** 5 **Replies:** 6 **URL:** https://discourse.openehr.org/t/questions-about-string-literals/14567 --- ## Post #1 by @Andrew_Patterson I am having trouble with the exact definition of the string literal\.\. --- ## Post #2 by @thomas.beale I've been mulling over this for a while\. It seems for quoting we have the following possibilities: \- just have a couple of basic \\ rules, and avoid quoted unicode altogether, on the basis that we can use real unicode files \(which we already can in the next generation of the tools\) \- use the ISO rules, that the spec currently indicates, i\.e\. &aaaa or &\#xHHHH \- use the \\uNNNN approach Andrew suggests \(is this hex or decimal?\) We have already built some unicode archetypes in Farsi, and no quoting is needed\. The current generation of tools are not quite up to displaying them yet, but the next generation will do it\. So \- is there a strong argument for quoted unicode at all? Is it that we need to cater for tools or situations where only ascii is allowable in the saved form of the file? I'm quite happy to go with the \\uNNNN approach, but we need to be clear what it's for; and it seems to me that we need to state clearly that we support 2 kinds of serialisation: 1 to ASCII, in which anything not in the basic ISO latin\-1 charset is shown as quoted unicode, and 2, to true unicode UTF\-8 \(which is what we have stated elsewhere in openEHR we will use as the encoding\)\. As for the other quoted characters, I don't see what the need for things like \\f \(formfeed\) is; what we need is to decide a minimum set which might be: \- \\r \- carriage return \- \\n \- linefeed \- \\t \- tab \- \\\\ \- backslash \- \\" \- literal " Is anything else needed? \- thomas Andrew Patterson wrote: --- ## Post #3 by @Andrew_Patterson > \- just have a couple of basic \\ rules, and avoid quoted unicode > altogether, on the basis that we can use real unicode files \(which we > already can in the next generation of the tools\) fine by me\. > \- use the ISO rules, that the spec currently indicates, i\.e\. &aaaa or > &\#xHHHH I think this would be a nightmare \- what happens to normal &'s \- these must then be quoted\. Also, are all the symbolic unicode names supported ´ etc? > \- use the \\uNNNN approach Andrew suggests \(is this hex or decimal?\) This is hexadecimal \(as per the unicode spec for unicode codepoints\)\. C\# and Java use this notation \- C\# extends it to also have \\UXXXXXXXX for 32 bit codepoints \(as per the new unicode versions\) > As for the other quoted characters, I don't see what the need for things > like \\f \(formfeed\) is; what we need is to decide a minimum set which > might be: > \- \\r \- carriage return > \- \\n \- linefeed > \- \\t \- tab > \- \\\\ \- backslash > \- \\" \- literal " > > Is anything else needed? In characters, \\' for literal ' Andrew --- ## Post #4 by @thomas.beale Andrew Patterson wrote: >   >> \- use the \\uNNNN approach Andrew suggests \(is this hex or decimal?\) >>     > This is hexadecimal \(as per the unicode spec for unicode codepoints\)\. > C\# and Java use this notation \- C\# extends it to also have \\UXXXXXXXX > for 32 bit codepoints \(as per the new unicode versions\) > One of Andrew's issues I don't think matters too much \- the quoting of &; this is because & is used to quote the '&' character\. However, I agree that using a more uniform '\\'\-based quoting approach will be clearer, and make for easier parser construction\. So let's say that we will go for the \\uHHHH \\UHHHHHHHH approach\. Onto more important issues\. When do we use real unicode, and when do we use ASCII files containing quoted unicode? Currently we have made real unicode work in the ADL workbench and Archetype Editor, and I would not anticipate any problems in the Java Archetype tools\. So it is mostly likely a question not of archetype tools, but of sharing ADL files\. With no unicode, and assuming latin\-1 based languages, ADL files are \(as far as I can tell\) safe to transport as text files\. However, even for languages like Turkish \(which has an odd situation to do with upper and lower case\), these files get broken, and unicode is needed; but then an ADL file is no longer a "text file" from the point of view of file sharing, mime\-type and so on\. We have not defined a mime\-type, but it would be one of the application ones I guess\. One problem is that a person receiving an ADL file under the quoting proposal here is that they might be receiving: \- a "safe" text file with only ASCII / latin\-1 alphabet characters in it \("real" ascii\) \- a "safe" text file with quoted unicode, that is in fact an archetype written in say Turkish, Farsi, Chinese etc \- a binary file containing UTF\-8 unicode characters, that will look like a text file with some funny characters in it depending on how smart your editor is\.\.\. \- or\.\.\.\.a UTF\-8 encoded file that also contained \\uHHHH encoded characters \(due to cut and paste in some editor environment\) There seem to be a couple of ways of dealing with this: \- include an "encoding" attribute at the top of ADL files, indicating how to read the file \- create a new file extension and specify that \.adl is for UTF\-8 encoded files, and that \(say\) \.uadl is for ascii encoded files containing unicode quoting\.\.\. The first is the more obvious thing to do, since it is what XML, HTML and probably other formats \(RTF?\) do; this is easy to add to ADL archetypes as a field\. It would have to be an optional field, so that all current ADL files are not invalidated\. This means we a\) have to choose the allowable encoding names \(UTF\-8 is the default in openEHR for true unicode; the other will presumably be ISO\-8859\-1\); we then need to specify which encoding is assumed for an ADL file with no encoding marker; I propose that it is UTF\-8, since we already have "cracked" that problem, and we say that it is only ISO\-8859\-1 if it actually says so\. This might sound odd, but remember UTF\-8 is a proper superset of ASCII anyway, so for all us western language people wondering if our files will look funny, they won't\. However, we could do it the other way round \- I don't see any terribly strong arguments one way or the other\. further thoughts anyone? \- thomas beale --- ## Post #5 by @Andrew_Patterson > One of Andrew's issues I don't think matters too much \- the quoting of > &; this is because & is used to quote the '&' character\. However, I > agree that using a more uniform '\\'\-based quoting approach will be > clearer, and make for easier parser construction\. So let's say that we > will go for the \\uHHHH \\UHHHHHHHH approach\. Are you saying that the \\u quoting will be used instead of the XML quoting or in addition to? If you are saying the first, please ignore the following rant :\) I still think the & is needlessly confusing and pointless\. My issues are: 1\) it is completely non\-obvious \- as an ADL user I would never expect to use the XML quoting rules in the string definition in ADL because ADL is clearly not an XML document\.\. sure, it has bits that are like XML, but if you want it to be XML, then go the whole way\. More importantly, for one of the target groups of ADL, the clinicians, it is a behaviour that I imagine could confuse them\. They have never heard of XML quoting rules and hence may just type in strings like "term code meaning pain to head & chest" in their ADL strings\. Now this may be mitigated by the fact that they will often be editing ADL in a tool, but if ADL is only going to be edited by tools we should drop the human parseable format and do the whole thing in XML\. 2\) It is a pain to implement \- now every ADL parser needs to have an XML entity converter built in as well \- which entities are included \- just the XML ones \(< > &\)? What about the HTML/SGML ones \(´ `\)?? Does every ADL implementation need to have the table of standard unicode names built in to be able to parse strings? Do angle brackets need to be quoted \- they do in XML but that is because they have special meaning\. Yet within ADL strings they don't\. Of course, the two characters that do need to be quoted are the \\ and the quotation mark\. Are these quoted in XML? Not by default, and so now the XML programmers are confused :\) > choose the allowable encoding names \(UTF\-8 is the default in openEHR for > true unicode; the other will presumably be ISO\-8859\-1\); we then need to > specify which encoding is assumed for an ADL file with no encoding > marker; I propose that it is UTF\-8, since we already have "cracked" that > problem, and we say that it is only ISO\-8859\-1 if it actually says so\. > This might sound odd, but remember UTF\-8 is a proper superset of ASCII > anyway, so for all us western language people wondering if our files > will look funny, they won't\. However, we could do it the other way round > \- I don't see any terribly strong arguments one way or the other\. I think you are right that it should default to UTF\-8\. I am not sure the correct way of putting the encoding marker in \- if its a standard archetype field then the parser is obviously well into parsing the file before it finds out what encoding the file is in? Which then invalidates encodings such as UTF\-16 because it would be impossible to write even the first "archetype" keyword in such a way that the parser could parse it\. I actually don't feel too strongly that ADL needs to be 7\-bit safe \(i\.e\. I would be happy with UTF\-8 as the default and leave it at that \- still including the \\uxxxx rules to allow the insertion of characters that are hard to \_edit\_, but assume UTF\-8 can be read/transported\)\. Is there any web/email transport mechanism in existence now that can't pass through an 8\-bit stream untouched? Even moreso, is there any modern environment that can't parse UTF\-8?? \(keeping in mind that this is not saying that openEHR systems won't have to exchange data with old legacy systems, but I doubt the openEHR system will be sending the legacy systems ADL files to parse??\) Andrew --- ## Post #6 by @thomas.beale Andrew Patterson wrote: >> One of Andrew's issues I don't think matters too much \- the quoting of >> &; this is because & is used to quote the '&' character\. However, I >> agree that using a more uniform '\\'\-based quoting approach will be >> clearer, and make for easier parser construction\. So let's say that we >> will go for the \\uHHHH \\UHHHHHHHH approach\. >>     > Are you saying that the \\u quoting will be used instead of the > XML quoting or in addition to? If you are saying the first, please > ignore the following rant :\) >   I am following your original suggestion, to replace the current XML quoting rules with \\u and \\U \(since we already use \\ to quote anyway, and as you point out, the & stuff is ugly\.\) > I still think the & is needlessly confusing and pointless\. My > issues are: > > 1\) it is completely non\-obvious \- as an ADL user I would never expect to use the > XML quoting rules in the string definition in ADL because ADL > is clearly not an XML document\.\. sure, it has bits that are like XML, but > if you want it to be XML, then go the whole way\. More importantly, > for one of the target groups of ADL, the clinicians, it is a behaviour > that I imagine could confuse them\. They have never heard of XML > quoting rules and hence may just type in strings like > "term code meaning pain to head & chest" in their ADL strings\. > Now this may be mitigated by the fact that they will often > be editing ADL in a tool, but if ADL is only going to be edited > by tools we should drop the human parseable format and > do the whole thing in XML\. >   agreed\. I personally don't see XML as useful other than a purely literal transfer syntax, i\.e\. a serialisation of objects\. ADL is an abstract syntax, which is both readable by humans, and for which abstract parsers can be written; the parser that can read the XML form \(which will be supported fairly soon, but is completely unreadable\) is a pure object serialiser/deserialiser, not a language parser\. > 2\) It is a pain to implement \- now every ADL parser needs to > have an XML entity converter built in as well \- which entities are > included \- just the XML ones \(< > &\)? What about the > HTML/SGML ones \(´ `\)?? Does every ADL > implementation need to have the table of standard unicode > names built in to be able to parse strings? Do angle brackets > need to be quoted \- they do in XML but that is because they > have special meaning\. Yet within ADL strings they don't\. Of course, > the two characters that do need to be quoted are the \\ and the > quotation mark\. Are these quoted in XML? Not by default, and > so now the XML programmers are confused :\) >   yes, I also agree with this ;\-\) >   >> choose the allowable encoding names \(UTF\-8 is the default in openEHR for >> true unicode; the other will presumably be ISO\-8859\-1\); we then need to >> specify which encoding is assumed for an ADL file with no encoding >> marker; I propose that it is UTF\-8, since we already have "cracked" that >> problem, and we say that it is only ISO\-8859\-1 if it actually says so\. >> This might sound odd, but remember UTF\-8 is a proper superset of ASCII >> anyway, so for all us western language people wondering if our files >> will look funny, they won't\. However, we could do it the other way round >> \- I don't see any terribly strong arguments one way or the other\. >>     > I think you are right that it should default to UTF\-8\. I am not sure > the correct way of putting the encoding marker in \- if its a standard > archetype field then the parser is obviously well into parsing the > file before it finds out what encoding the file is in? Which then > invalidates encodings such as UTF\-16 because it would be impossible > to write even the first "archetype" keyword in such a way that the > parser could parse it\. >   It probably has to be on the first line, which is easy enough to deal with\. At this stage, I think it s reasonable to just allow UTF\-8 and ISO\-8859\-1 only\. UTF\-16 et al need byte order markers at the start of the file \(which removes the need for the encoding indicator in the file I guess\); but let's not go there yet\. > I actually don't feel too strongly that ADL needs to be 7\-bit safe > \(i\.e\. I would be happy with UTF\-8 as the default and leave it at that > \- still including the \\uxxxx rules to allow the insertion of characters > that are hard to \_edit\_, but assume UTF\-8 can be read/transported\)\. > Is there any web/email transport mechanism in existence now that > can't pass through an 8\-bit stream untouched? Even moreso, is there > any modern environment that can't parse UTF\-8?? \(keeping in mind > that this is not saying that openEHR systems won't have to exchange > data with old legacy systems, but I doubt the openEHR system will be > sending the legacy systems ADL files to parse??\) >   well, Notepad and gvim on Windows don't get it right\.\.\.\.but that may just be display\.\.\. \- thomas --- ## Post #7 by @system Thomas Beale wrote: > Andrew Patterson wrote: >   >>> \- use the \\uNNNN approach Andrew suggests \(is this hex or decimal?\) >>>     >> >> This is hexadecimal \(as per the unicode spec for unicode codepoints\)\. >> C\# and Java use this notation \- C\# extends it to also have \\UXXXXXXXX >> for 32 bit codepoints \(as per the new unicode versions\) >> > > One of Andrew's issues I don't think matters too much \- the quoting of > &; this is because & is used to quote the '&' character\. However, I > agree that using a more uniform '\\'\-based quoting approach will be > clearer, and make for easier parser construction\. So let's say that we > will go for the \\uHHHH \\UHHHHHHHH approach\. > > Onto more important issues\. When do we use real unicode, and when do we > use ASCII files containing quoted unicode? Currently we have made real > unicode work in the ADL workbench and Archetype Editor, and I would not > anticipate any problems in the Java Archetype tools\. So it is mostly >   The Java ADL parser currently uses UTF\-8 as encoding for parsing\. It has been planned to support more encoding later which is quite easy to do in Java\. > likely a question not of archetype tools, but of sharing ADL files\. With > no unicode, and assuming latin\-1 based languages, ADL files are \(as far > as I can tell\) safe to transport as text files\. However, even for > languages like Turkish \(which has an odd situation to do with upper and > lower case\), these files get broken, and unicode is needed; but then an > ADL file is no longer a "text file" from the point of view of file > sharing, mime\-type and so on\. We have not defined a mime\-type, but it > would be one of the application ones I guess\. > > One problem is that a person receiving an ADL file under the quoting > proposal here is that they might be receiving: > \- a "safe" text file with only ASCII / latin\-1 alphabet characters in it > \("real" ascii\) > \- a "safe" text file with quoted unicode, that is in fact an archetype > written in say Turkish, Farsi, Chinese etc > \- a binary file containing UTF\-8 unicode characters, that will look like > a text file with some funny characters in it depending on how smart your > editor is\.\.\. > \- or\.\.\.\.a UTF\-8 encoded file that also contained \\uHHHH encoded > characters \(due to cut and paste in some editor environment\) > > There seem to be a couple of ways of dealing with this: > \- include an "encoding" attribute at the top of ADL files, indicating > how to read the file >   I like the idea of including an "encoding" attribute in the ADL, probably in the archetype header section\. It's also good to keep the encoding information in the archetype \(in AOM form\) so that ADL serializer can use the right encoding for output\. > \- create a new file extension and specify that \.adl is for UTF\-8 encoded > files, and that \(say\) \.uadl is for ascii encoded files containing > unicode quoting\.\.\. >   This doesn't seem as flexible as the first one\. It seems that we need to create a new file extension to support a new encoding each time\. It's good to keep all the meta data about the archetype including the encoding in the header section\. > The first is the more obvious thing to do, since it is what XML, HTML > and probably other formats \(RTF?\) do; this is easy to add to ADL > archetypes as a field\. It would have to be an optional field, so that > all current ADL files are not invalidated\. This means we a\) have to > choose the allowable encoding names \(UTF\-8 is the default in openEHR for > true unicode; the other will presumably be ISO\-8859\-1\); we then need to > specify which encoding is assumed for an ADL file with no encoding > marker; I propose that it is UTF\-8, since we already have "cracked" that > problem, and we say that it is only ISO\-8859\-1 if it actually says so\. >   Agree\. This is supported in the Java ADL parser\. Cheers, Rong --- **Canonical:** https://discourse.openehr.org/t/questions-about-string-literals/14567 **Original content:** https://discourse.openehr.org/t/questions-about-string-literals/14567