character sets and languages in openEHR

Hoylen Sue wrote:

Thomas Beale <thomas@deepthought.com.au> writes:

I wonder if this is true for people using openEHR-based components via
an API rather than communicating via data messages. I assume that the
unicode implemementation used in the String type in most of today's
languages make it easy to determine what width unicode characters you
have in the data?
   
In all the case I know of, once the data has been read into
the native string type, discovering its "width" is no longer
an issue. This is because the native string type is defined
to support only a single encoding. Data is converted into
that encoding when it is read in.

so if it is stored in UTF-16 say, the library in a unix application will detect the relevant byte pattern and dispatch the appropriate conversion routine to do utf-16 -> utf-8, or utf-32 -> utf-8? If I remember correctly, this is possible because it is possible to tell from the actual binary data which width encoding is being used; since there is no guarantee that the data is in XML form, with its encoding stated.

If the above is true, we still presumably need to mark the data as being Unicode v 2, 3 or 4 etc? This would only be necessary if the earlier versions were not pure subsets of the later ones. Can you clarify this Hoylen?

P.S. As an aside, Java 1.1 and later uses Unicode 2.0
character set. So if Unicode 3.0 or Unicode 4.0 is the
target character set, implementations may be forced to
implement their own string class rather than using the
native java.lang.String. Something to consider when picking
which Unicode version as the standard character set.

I would suggest that the requirements with respect to string representation are:

    * systems get to store their data in whatever is most convenient locally (e.g. whatever the dbms wants to use)
    * it must be possible for a 3rd party application which is openEHR compliant to read the data in a system, even it if it not in its own "preferred" form
       (vertical interoperability)
    * it must be possible for the data to be exported in a way that it can be universally read or transformed into a readable form for use in another system
       (horizontal interoperability)
    * the specifications commit implementors to as little as possible, while allowing the above requirements to be met.

Based on Hoylen's more recent info, a draft modelling solution seems to be:

1. openEHR states that all strings are in Unicode in its abstract specifications.

Question: if there are no "simple strings" at all in the data and everything is a unicoded string, is this safe?

2. that the following abstract model is used (improved from version a few weeks ago):
    ENTRY class has
        - a mandatory language attribute
        - a mandatory character encoding attribute (says which VERSION and which flavour of unicode).
            This forces the whole ENTRY to be encoded the same way no matter what,
            but also allows distinct ENTRYs to be encoded in e.g. Unicode 3.0/UTF-8, Unicode 4.0/UTF-16.

    DV_TEXT class has
        - an optional language attribute, which is understood to override the one from its enclosing ENTRY.

3. Implementation specifications like XML-schemas and software APIs are required to make the character
    encoding and Unicode version attributes visible, so that clients can process / convert the data properly.

further comments to clean this up will be much appreciated.

- thomas