loss of type information in ID classes

Andrew_Patterson · 7 February 2007 04:52

I am looking at a round trip of information
through an XML serialization. Normally, type
information can be retained in an xml
serializer through the use of xsi:type
(so where an attribute is defined as type
  PARTY_PROXY, we can assign an object
  of type PARTY_SELF to that attribute
  and do a round trip serialization and it will
  still be PARTY_SELF).

The OBJECT_ID hierarchy defines an
abstract root class with a single attribute
called 'value' that is a string. It then defines
other types of ID's such as VERSION_TREE_ID,
ARCHETYPE_ID etc that vary only in their
interpretation of the 'value' attribute. This is fine
because the typing information (i.e. which ID
we are dealing with) is retained in serialization.

However, there is a also a UID hierarchy that
defines classes such as INTERNET_ID (
com.microsoft.vista), UUID (ae435abc-3424-...).
It is objects of this hierarchy that VERSION_TREE_ID,
ARCHETYPE_ID etc are supposed to return in
interpreting their 'value'.

For example, an OBJECT_VERSION_ID could be
created with 'value' set to

F7C5C7B7-75DB-4b39-9A1E-C0BA9BFDBDEC::
87284370-2D4B-4e3d-A3F3-F303D2F4F34B::
2

The function 'creating_system_id' on this type is
defined to return a UID from this value. However,
without some magic, it has no way of knowing that
87284370-2D4B-4e3d.. is of type UUID and not a
ISO_OID or INTERNET_ID.

The choices I see are:
a) is a magic routine needed here that can correctly
    guess the type based on the string value? (which is probably not that
    hard to write - there aren't that many UID types)
b) should systems not need to know the types in these
    cases and therefore the loss of typing is not a problem
   (though if this is the case, UID needs to be changed to
    a concrete class rather than an abstract one)
c) is this a problem for the XML ITS rather than openehr in
    general? (the xml serialization could be augmented with extra
    typing info)

thoughts?

Andrew

thomas.beale · 7 February 2007 13:55

Andrew Patterson wrote:

I am looking at a round trip of information
through an XML serialization. Normally, type
information can be retained in an xml
serializer through the use of xsi:type
(so where an attribute is defined as type
  PARTY_PROXY, we can assign an object
  of type PARTY_SELF to that attribute
  and do a round trip serialization and it will
  still be PARTY_SELF).

The OBJECT_ID hierarchy defines an
abstract root class with a single attribute
called 'value' that is a string. It then defines
other types of ID's such as VERSION_TREE_ID,
ARCHETYPE_ID etc that vary only in their
interpretation of the 'value' attribute. This is fine
because the typing information (i.e. which ID
we are dealing with) is retained in serialization.

However, there is a also a UID hierarchy that
defines classes such as INTERNET_ID (
com.microsoft.vista), UUID (ae435abc-3424-...).
It is objects of this hierarchy that VERSION_TREE_ID,
ARCHETYPE_ID etc are supposed to return in
interpreting their 'value'.

For example, an OBJECT_VERSION_ID could be
created with 'value' set to

F7C5C7B7-75DB-4b39-9A1E-C0BA9BFDBDEC::
87284370-2D4B-4e3d-A3F3-F303D2F4F34B::
2

The function 'creating_system_id' on this type is
defined to return a UID from this value. However,
without some magic, it has no way of knowing that
87284370-2D4B-4e3d.. is of type UUID and not a
ISO_OID or INTERNET_ID.

The choices I see are:
a) is a magic routine needed here that can correctly
    guess the type based on the string value? (which is probably not that
    hard to write - there aren't that many UID types)

this is what I have envisaged; as far as I know:

Guids/uuids always follow the same pattern (I don’t have it to hand, but everyone knows it) - fixed number of segments of hexadecimal digits and only ‘-’ separators
ISO oids only have ‘.’ separators and numeric segments
domain names have ‘.’ separators and only alpha-numeric segments, and the top-level domain names are all alphabetic names (no numerics). I don’t believe there is any real danger of software getting an oid and a domain name confused, although I have not read all the relevant rules on this

b) should systems not need to know the types in these
    cases and therefore the loss of typing is not a problem
   (though if this is the case, UID needs to be changed to
    a concrete class rather than an abstract one)

this is also realistic, since the use of each kind of identifier is likely to be systematic in wide jurisdictions, e.g. whole of Australia, whole of UK. In fact, the current openEHR model is a bit of a cop-out since there are few if any standards for what kind of ids will be used in e-health networks around the world; this may change over the next few years, in which case openEHR may adjust.

Note that an openEHR EHR has an EHR_STATUS object which might be a smart place to put some kind of clues about identifier use in the EHR. We may also need some kind of application profile object to hold meta-information.

c) is this a problem for the XML ITS rather than openehr in
    general? (the xml serialization could be augmented with extra
    typing info)

although I am not an XML specialist, my general feeling is to use XML (particularly XSD) in pretty much its expected way; being too esoteric gets one into trouble. Since these fields are string fields with typing only visible in the object model, I guess completely orthodox XSD won’t help (and it doesn’t - see our current XSD - http://svn.openehr.org/specification/BRANCHES/Release-1.1-candidate/publishing/its/XML-schema/documentation/BaseTypes.xsd.html_h1689213245.html).

But I don’t know if this matters. I think it is reasonable to assume that openEHR XML will always be processed by software that knows at least the reference model classes it is expecting, so it will always do the right thing. Magic guessing or a priori site-based or EHR-based clues will still be required based on syntax in that case, which I think is safe enough if my statements above are correct.

thomas

Andrew_Patterson · 8 February 2007 03:57

a) is a magic routine needed here that can correctly
guess the type based on the string value? (which is probably not that
hard to write - there aren't that many UID types)

this is what I have envisaged; as far as I know:

Guids/uuids always follow the same pattern (I don't have it to hand, but
everyone knows it) - fixed number of segments of hexadecimal digits and only
'-' separators
ISO oids only have '.' separators and numeric segments
domain names have '.' separators and only alpha-numeric segments, and the
top-level domain names are all alphabetic names (no numerics). I don't
believe there is any real danger of software getting an oid and a domain
name confused, although I have not read all the relevant rules on this

ok - this makes sense.

Any thoughts on UID as an abstract class/vs concrete? In a system that
doesn't care about the type of UID, a receiving system still has to make
a choice as to the type of the UID just to construct one (because UID
is declared abstract). I can't see any downside to allowing UID to be
a concrete instantiable class..

Andrew

thomas.beale · 28 February 2007 14:35

Andrew Patterson wrote:

a) is a magic routine needed here that can correctly
guess the type based on the string value? (which is probably not that
hard to write - there aren't that many UID types)

this is what I have envisaged; as far as I know:

Guids/uuids always follow the same pattern (I don't have it to hand, but
everyone knows it) - fixed number of segments of hexadecimal digits and only
'-' separators
ISO oids only have '.' separators and numeric segments
domain names have '.' separators and only alpha-numeric segments, and the
top-level domain names are all alphabetic names (no numerics). I don't
believe there is any real danger of software getting an oid and a domain
name confused, although I have not read all the relevant rules on this

ok - this makes sense.

Any thoughts on UID as an abstract class/vs concrete? In a system that
doesn't care about the type of UID, a receiving system still has to make
a choice as to the type of the UID just to construct one (because UID
is declared abstract). I can't see any downside to allowing UID to be
a concrete instantiable class..

Andrew,
sorry to take so long to respond. Not sure what you want to achieve
here: even if the UID type were concrete, you still have to instantiate
it following some particular model of an id; all that we have done is
limit those to GUID, Oid and Internet name. Are you suggesting 'anything
goes' is preferable?

- thomas

Andrew_Patterson · 28 February 2007 23:45

sorry to take so long to respond. Not sure what you want to achieve
here: even if the UID type were concrete, you still have to instantiate
it following some particular model of an id; all that we have done is
limit those to GUID, Oid and Internet name. Are you suggesting 'anything
goes' is preferable?

I'm suggesting you could instantiate it with no constraints on
the model of the id - this may be what a receiving system needs
to do because it had no knowledge of the actual correct model
(given that the information has been lost in transit and it may not
want to guess). Otherwise, even though the receiving system has
no information to base it on, you force it to _chose_ one of the
concrete instantiations even though the system may want to deal
with it using only the semantics of a UID i.e. the sending system
has given me a unique identifier - I don't care how it came up with
the identifier string or what format it is I just want to use it.

Andrew

thomas.beale · 1 March 2007 13:31

Andrew Patterson wrote:

sorry to take so long to respond. Not sure what you want to achieve
here: even if the UID type were concrete, you still have to instantiate
it following some particular model of an id; all that we have done is
limit those to GUID, Oid and Internet name. Are you suggesting 'anything
goes' is preferable?

I'm suggesting you could instantiate it with no constraints on
the model of the id - this may be what a receiving system needs
to do because it had no knowledge of the actual correct model
(given that the information has been lost in transit and it may not
want to guess). Otherwise, even though the receiving system has
no information to base it on, you force it to _chose_ one of the
concrete instantiations even though the system may want to deal
with it using only the semantics of a UID i.e. the sending system
has given me a unique identifier - I don't care how it came up with
the identifier string or what format it is I just want to use it.

But you don't need to have a concrete class to do that - you can
statically declare a reference to be of an abstract type, and simply
access whatever features are defined in that class - which is just the
attribute 'value' in the case of UID. So although your variable (say
my_uid) will actually be attached to a UUID object, it will be of type
UID, and will act accordingly.

- thomas

Andrew_Patterson · 1 March 2007 14:12

But you don't need to have a concrete class to do that - you can
statically declare a reference to be of an abstract type, and simply
access whatever features are defined in that class - which is just the
attribute 'value' in the case of UID. So although your variable (say
my_uid) will actually be attached to a UUID object, it will be of type
UID, and will act accordingly.

Yes, I understand once you have a live object graph in
whatever environment, it is not important then what
the actual object type is. My specific use case is at the
boundary to a system accepting XML RM objects - the
deserializer needs to construct in memory a concrete
instance of a UID class, based purely from the XML structure
it is presented with. Given that the typing information has
been lost in the serialization _to_ XML, it is forced to
guess at the proper concrete class to instantiate based on a
magic algorithm. I don't have a problem with that, but thought
it might also be useful if it could alternatively say "hey, I
don't really care whether I was sent a UUID etc, I'll just
instantiate this concrete UID class with the relevant value
and carry on".

Andrew

thomas.beale · 1 March 2007 14:42

Andrew Patterson wrote:

But you don't need to have a concrete class to do that - you can
statically declare a reference to be of an abstract type, and simply
access whatever features are defined in that class - which is just the
attribute 'value' in the case of UID. So although your variable (say
my_uid) will actually be attached to a UUID object, it will be of type
UID, and will act accordingly.

Yes, I understand once you have a live object graph in
whatever environment, it is not important then what
the actual object type is. My specific use case is at the
boundary to a system accepting XML RM objects - the
deserializer needs to construct in memory a concrete
instance of a UID class, based purely from the XML structure
it is presented with. Given that the typing information has
been lost in the serialization _to_ XML, it is forced to
guess at the proper concrete class to instantiate based on a
magic algorithm. I don't have a problem with that, but thought
it might also be useful if it could alternatively say "hey, I
don't really care whether I was sent a UUID etc, I'll just
instantiate this concrete UID class with the relevant value
and carry on".

Sorry, my mistake - I had forgotten that this was the root question you
were asking in the original mail. My natural response is: why should the
XML tail wag the dog? It doesn't usually do software any good....but
thinking practically...the real problem is that we are using 'efficient'
XSD as shown in
http://svn.openehr.org/specification/BRANCHES/Release-1.1-candidate/publishing/its/XML-schema/documentation/BaseTypes.xsd.html_h619733846.html
- this is space-efficient, but loses typing at the leaf level as you
say. Solutions seem to be:

I am still not convinced this is a problem however; the deserialiser
will deserialise an entire ObjectId in one go, and for that it does have
typing informtion. So let's say a HIER_OBJECT_ID is found; the string
value is given to a constructor of the HIER_OBJECT_ID class which pulls
it apart, according to the syntax of that class (see online spec
http://svn.openehr.org/specification/BRANCHES/Release-1.1-candidate/publishing/architecture/rm/support_im.pdf
for details). The constructor will have to figure out what to do with
what it sees - in root. I don't see how it could help for it to
instantiate a UID when it is likely to have to know what it has - it
will have to use the magic code to determine what it is looking at and
create the right thing. At least that magic code will be shared across
all OBJECT_ID subtypes (or more).

- thomas

Heath_Frankel2 · 1 March 2007 14:47

Tom,
I can see what Andrew is saying here. We either need to have some fancy
logic to determine which sub-class of UID to construct, make UID concrete or
just treat the attributes of type UID as strings. I guess the patterns for
the 3 sub-types are reasonably well know and different to be able to
determine which sub-class to create in a UID factory.

The alternative is to change OBJECT_ID to not have a value attribute and
specify the more precise attributes in the sub-classes so that the UID type
can be provided in the XML. However this will make the XML more verbose.

I really wonder what is the value of having the UID subtypes at all apart
from pattern validation?

Heath

thomas.beale · 1 March 2007 21:46

Heath Frankel wrote:

Tom,
I can see what Andrew is saying here. We either need to have some fancy
logic to determine which sub-class of UID to construct, make UID concrete or
just treat the attributes of type UID as strings. I guess the patterns for
the 3 sub-types are reasonably well know and different to be able to
determine which sub-class to create in a UID factory.

The alternative is to change OBJECT_ID to not have a value attribute and
specify the more precise attributes in the sub-classes so that the UID type
can be provided in the XML. However this will make the XML more verbose.

well, we don't want to do this - the whole idea is that we have
efficient representation without losing semantics. The real problem is
that XML is not good at doing these two together, and that's where the
problem has to be solved in my view. The object model is 'right' in its
own terms.

I really wonder what is the value of having the UID subtypes at all apart
from pattern validation?

well at some point in the system, the logic is going to need to be able
to create a new Guid, dereference an Oid etc. The alternative would be
to have a type UID that was concrete, just having a String value, and a
bunch of functions of the form is_uuid, is_iso_oid, is_internet_id etc.
This is a hack, and is non-extensible (since adding a new subtype means
changing and re-deploying the UID class), whereas the current solution
is extensible (just add a new subclass).

Currently the only solution I can see that doesn't break the object
model (and remember, XML is just one serialised form - maybe the hype
will be over in 5 years time and we can get on using something that
actually works;-) - is to use string based pattern matching as
discussed earlier in the thread. It seems solid to me. If we don't do
that then we get this:

* system A has an OBJECT_ID containing a UUID value
* the object network of the Composition gets serialised and sent to system B
* system B deserialises but all the OBJECT_ID.values end up just as
UIDs, not UUIDs, ISO_OIDs etc.

So we lose information. I don't see this as acceptable....

- thomas

Andrew_Patterson · 1 March 2007 23:23

well, we don't want to do this - the whole idea is that we have
efficient representation without losing semantics. The real problem is
that XML is not good at doing these two together, and that's where the
problem has to be solved in my view. The object model is 'right' in its
own terms.

I'm not sure that this is purely an XML problem - a Java implementation
will have to go to reasonably extraordinary lengths to internally
maintain the correct type information as well. From a typing point
of view, the OBJECT_ID hierarchy has the attributes and functions
the wrong way around (the problem is not so much in the UID
hierarchy, it's in the classes that reference the UID hierarchy)

OBJECT_ID
attribute value : string

UID_BASED_ID
function root : UID
function extension : string

The implementation of UID_BASED_ID has to duplicate the storage
of data, both setting the 'value' attribute to be xxxxx::yyyyy and
also maintaining the actual object reference for 'root' so that it can
be returned in the function call.

The XML serializer has a problem here because it has no way of
storing the 'meta' information of the UID type (which is why the
problem is most noticeable in XML).

I would suggest the correct model should be

OBJECT_ID
abstract function value : string

UID_BASED_ID
         attribute root : UID
         attribute extension : string
         redefine function value : string
            (to return the value of 'root' and 'extension' separated by '::')

Andrew

Andrew_Patterson · 1 March 2007 23:28

btw just looking at the draft for Support, and INTERNET_ID
seems to be interspersed with all the OBJECT_ID
types, when it is actually a UID subtype (so probably should
be moved earlier in the section - assuming all the UID types
are meant to be documented together)

Andrew

thomas.beale · 2 March 2007 03:10

Andrew Patterson wrote:

well, we don't want to do this - the whole idea is that we have
efficient representation without losing semantics. The real problem is
that XML is not good at doing these two together, and that's where the
problem has to be solved in my view. The object model is 'right' in its
own terms.

I'm not sure that this is purely an XML problem - a Java implementation
will have to go to reasonably extraordinary lengths to internally
maintain the correct type information as well. From a typing point
of view, the OBJECT_ID hierarchy has the attributes and functions
the wrong way around (the problem is not so much in the UID
hierarchy, it's in the classes that reference the UID hierarchy)

well, let's go back to the requirements. The design intent is to have
String identifiers that are efficient for storage and serialisation,
while being able to treat them (or subparts) as properly typed
artefacts. Doing it the other way round means that there is no way in
openEHR to treat ids as Strings - they are always multi-attribute items.
In XML this will make for a lot of unnecessary volume. So the choice we
made quite a long time ago was to use String representation and internal
parsing to access the bits and pieces - just like for the ISO 8601
date/time types. The current model does this - I wouldn't say it is the
wrong way round - it is just a different design decision than for
higher-level objects.

OBJECT_ID
      attribute value : string

UID_BASED_ID
      function root : UID
      function extension : string

The implementation of UID_BASED_ID has to duplicate the storage
of data, both setting the 'value' attribute to be xxxxx::yyyyy and
also maintaining the actual object reference for 'root' so that it can
be returned in the function call.

I must be missing something here; all it does in my implementation is
extract the piece before (or after for extension) the '::' when you call
the function.

The XML serializer has a problem here because it has no way of
storing the 'meta' information of the UID type (which is why the
problem is most noticeable in XML).

but all it has to do is inspect the string. I have a dirty bit of code
as follows:

    string_to_uid(s: STRING): UID is
            -- The identifier of the conceptual namespace in which the
object exists,
            -- within the identification scheme. Returns the part to
the left of the
            -- first '::' separator, if any, or else the whole string.
        require
            string_valid: s /= Void and then not s.is_empty
        do
            create {UUID} Result.default_create
            if Result.valid_id (s) then
                create {UUID} Result.make(s)
            else
                create {ISO_OID} Result.default_create
                if Result.valid_id (s) then
                    create {ISO_OID} Result.make(s)
                else
                    create {INTERNET_ID} Result.default_create
                    if Result.valid_id (s) then
                        create {INTERNET_ID} Result.make(s)
                    else
                        -- error
                    end
                end
            end
        end

(there are nicer ways to do this obviously).

I would suggest the correct model should be

OBJECT_ID
         abstract function value : string

UID_BASED_ID
         attribute root : UID
         attribute extension : string
         redefine function value : string
            (to return the value of 'root' and 'extension' separated by '::')

this is exactly what we are trying to avoid. But I don't have any
difficulty implementing it either so maybe there is a misunderstanding.

- thomas

Andrew_Patterson · 2 March 2007 03:45

artefacts. Doing it the other way round means that there is no way in
openEHR to treat ids as Strings - they are always multi-attribute items.

Well, for serialisation it might not be able to treat them as strings
but the abstract UID class could always have a 'value' function
that returns the data in string form.. other aspects of the system could
use that regardless of whether internally they were stored as
multi attribute items..

In XML this will make for a lot of unnecessary volume. So the choice we
made quite a long time ago was to use String representation and internal
parsing to access the bits and pieces - just like for the ISO 8601
date/time types. The current model does this - I wouldn't say it is the
wrong way round - it is just a different design decision than for
higher-level objects.

Fair enough - the result of this decision is that typing information is
lost - I think the trade-off needs to be documented explicitly in
the spec.

I must be missing something here; all it does in my implementation is
extract the piece before (or after for extension) the '::' when you call
the function.

If you're not worried about guessing at the UID type, this is
the way to do it..

but all it has to do is inspect the string. I have a dirty bit of code
as follows:

We seem to have come to the agreement then that
some form of string_to_uid() function is not just one
way of implementing an openehr system, but is actually
_required_ in any openehr system. I think some mention
of this should be in the section on UIDs.

Andrew

thomas.beale · 2 March 2007 05:27

Andrew Patterson wrote:

artefacts. Doing it the other way round means that there is no way in
openEHR to treat ids as Strings - they are always multi-attribute items.

Well, for serialisation it might not be able to treat them as strings
but the abstract UID class could always have a 'value' function
that returns the data in string form.. other aspects of the system could
use that regardless of whether internally they were stored as
multi attribute items..

In XML this will make for a lot of unnecessary volume. So the choice we
made quite a long time ago was to use String representation and internal
parsing to access the bits and pieces - just like for the ISO 8601
date/time types. The current model does this - I wouldn't say it is the
wrong way round - it is just a different design decision than for
higher-level objects.

Fair enough - the result of this decision is that typing information is
lost - I think the trade-off needs to be documented explicitly in
the spec.

There is another design reason I forgot to mention to use Strings: it
allows identification schemes to change over time, without invalidating
existing data. This might happen with Archetype_ids, and we will need a
Template_id, which we have not defined yet - but if we follow the
current design approach, it won't matter - the Ids will just be strings
as stored.

We seem to have come to the agreement then that
some form of string_to_uid() function is not just one
way of implementing an openehr system, but is actually
_required_ in any openehr system. I think some mention
of this should be in the section on UIDs.

Sure - but I don't see this as controversial - it seems pretty minor.
But it is no problem to add some implementation notes.

- thomas

Andrew_Patterson · 2 March 2007 05:54

Sure - but I don't see this as controversial - it seems pretty minor.

Yes, no problems from me - I just like arguing..

Andrew

thomas.beale · 2 March 2007 15:50

Andrew Patterson wrote:

Sure - but I don't see this as controversial - it seems pretty minor.

Yes, no problems from me - I just like arguing..

remind me never to be in court with you ;-0 (unless you are defending me;-)

I will make some additions to the text in the Support IM identification
package around this thread and upload in the next day or two.

thanks for the input.

- thomas

Gerke_Geurts · 3 March 2007 23:10

Hello all,

It seems to me that various standard URN schemes can be used to unambiguously define unique identifiers in a single string, for example:

urn:uuid: scheme for UUIDs
urn:oid: scheme for OIDs

Using URIs in the XML documents is the XML way for the ‘fairly concise’ but extensible serialisation of identifiers and for that reason seems a feasible solution for openEHR XML serialisations.

Kind Regards,
Gerke Geurts.

thomas.beale · 4 March 2007 00:27

I have now uploaded a version of the Support IM that contains some
design & implementation paragraphs on the topic of this thread. See
http://svn.openehr.org/specification/BRANCHES/Release-1.1-candidate/publishing/architecture/rm/support_im.pdf

- thomas

Topic		Replies	Views
openEHR artefact namespace identifiers Technical (archive)	26	0	29 April 2011
constraint binding error Technical (archive)	35	0	24 February 2011
Use of Identifiers in archetypes Technical (archive)	19	0	19 January 2011
character sets and languages in openEHR Technical (archive)	19	4	6 April 2004
Proposed slightly radical change to CODE_PHRASE in Text package in openEHR Technical (archive)	9	0	23 January 2006
AOM 1.4 - Archetype.uid a UUID or OID? Technical (archive)	17	0	18 June 2017
Could the specs group consider making uid mandatory? Technical (archive)	20	0	20 December 2016
CEN meeting and data types Clinical (archive)	14	0	7 March 2007
questions about string literals Technical (archive)	6	0	8 October 2006
text and description Technical (archive)	32	0	3 December 2008

loss of type information in ID classes

Related topics