Suggestion wrt XML Archetypes & Templates

Dear All,

1) Problem statement
2) Solution
3) Points to Note
4) XSLT Sheet
5) Summary

1) Problem statement

I have been writing an OpenEHR publishing & QA routine which is basically Ant, which includes running XSLT tasks for the NHS.

There is a problem with the current structure of the XML archetypes & templates which is that the values are contained as a text() child of an element & sometimes as the text() child of a value child of the element.

This is dangerous & (IMHO) wrong.
The reasons being that :
A) a single value of that sort should be contained in an attribute.
B) It leads to a world of pain wrt "pretty-print"/indentation.

As an example, XMLSpy will automatically pretty print XML because that makes it readable to the (human) reader. Equally XSLT sheets often use the

indent="yes"in the output declaration.

    <xsl:output method="xml" version="1.0" encoding="utf-8"
        indent="yes" />

Firstly it means that what looks like
<rm_type_name>
                            ELEMENT
</rm_type_name>

is actually:

&#xA;&#x9;&#x9;&#x9;&#x9;&#x9;&#x9;&#x9;&#x9;&#x9;&#x9;&#x9;&#x9;&#x9;ELEMENT&#xA;&#x9;&#x9;&#x9;&#x9;&#x9;&#x9;&#x9;&#x9;&#x9;&#x9;&#x9;&#x9;

As a really quick example of this, get an XML Archetype, open it in XMLSpy, press save, now open it in the Ocean Archetypes editor.

Admire the way the text now is all over the place and has empty square boxes for the line endings (i.e. &#xA:wink:

Now try and save as an ADL.

If you save as an XML, the formatting etc is retained. Basically open up an xml archetype in XMLSpy, click save and you have a corrupt archetype.

Before people decry pretty printing per se bear in mind that :

i) single long string in not readable
ii) the adl is pretty printed....i.e. your adl files do not come as one long string but are formatted in much the same way as XML is pretty printed. The adl takes care of this in basically the same way I am going to suggest that the XML does ie.

description = <"Clinical description of the meconium">
vs
description = Clinical description of the meconium
or
description =
                                Clinical description of the meconium

etc.

Any XLST which tries to extract values from the present structure must engage in code such as:

<xsl:variable name="tab">'&#x9;'</xsl:variable>
       <xsl:variable name="nl">'&#xA;'</xsl:variable>
       <xsl:variable name="v_rm_type_name_no_pp"
           select="translate(translate($v_rm_type_name/text(),$tab,''),$nl,'')" />

& that in itself is dangerous as some editor might put in some formatting chars which are not being filtered out.

2) Solution:

Instead of using a text child, any value should go in a value attribute e.g.
<items id="description">
Clinical description of the meconium
</items>

becomes:

  <items value="Clinical description of the meconium" id="description"/>

3) Points to Note:

A) The result is actually closer to the adl e.g.

         <items code="at0061">
            <items value="Clinical description of the meconium" id="description"/>
            <items value="Description" id="text"/>
         </items>
         <items code="at0062">
            <items value="Colour of meconium" id="description"/>
            <items value="Colour" id="text"/>
         </items>

vs

                ["at0061"] = <
                    description = <"Clinical description of the meconium">
                    text = <"Description">
                >
                ["at0062"] = <
                    description = <"Colour of meconium">
                    text = <"Colour">
                >

B) The files are approximately 2/3'rds the size of the originals. This could be reduced further by using a smaller attribute name (e.g. val or even v).

C) The Archetypes are much more readable to the average human e.g.

<details>
         <language>
            <terminology_id value="ISO_639-1"/>
            <code_string value="en"/>
         </language>
         <purpose value="To describe body fluids and secretions"/>
         <use/>
         <misuse/>
</details>

vs:

<details>
        <language>
                <terminology_id>
                    <value>ISO_639-1</value>
                </terminology_id>
                <code_string>en</code_string>
            </language>
            <purpose>To describe body fluids and secretions</purpose>
            <use/>
            <misuse/>
</details>

or
     <occurrences>
         <lower_included value="true"/>
         <upper_included value="true"/>
         <lower_unbounded value="false"/>
         <upper_unbounded value="false"/>
         <lower value="1"/>
         <upper value="1"/>
      </occurrences>

vs:

        <occurrences>
            <lower_included>true</lower_included>
            <upper_included>true</upper_included>
            <lower_unbounded>false</lower_unbounded>
            <upper_unbounded>false</upper_unbounded>
            <lower>1</lower>
            <upper>1</upper>
        </occurrences>

4) XSLT Sheet

I have attached a mini-xslt sheet which takes a template or XML Archetype & renders it into this fomat.

Run the XSLT with saxon as Xalan....shows how fragile the current situation is as it picks up the "pretty-print" chars as text children & puts them in where there is no text child except the formatting chars.

5) Summary

i) The present situation/structure is dangerous.
ii) Pretty-print is the norm & even the ADL is pretty printed and has adopted a similar method to cope.
iii) The solution simplifies the XML in terms of both processing and human readability.
iv) The solution shrinks the file sizes.

Yours

Adam Flinton

(attachments)

setTextAsVal.xslt (1.55 KB)

Adam,

i) The present situation/structure is dangerous.

You need to get a better tool, Oxygen never splits an element value over
multiple lines or adds whitespace. A tool that automatically does this is
dangerous. I used to use XMLSpy and never experienced this, but after
hearing this I am glad I was convinced to move to Oxygen.

ii) Pretty-print is the norm & even the ADL is pretty printed and has

adopted

a similar method to cope.

Sure, but the tool should never add whitespace to a value, that is not the
norm, it is simply wrong.

iii) The solution simplifies the XML in terms of both processing and human
readability.

I do not see this at all, in fact your solution breaks much processing which
is derived directly from the Archetype Model.

Microsoft uses a lot of XML documents in its products and many of them use
elements to contain values. In fact if you go to W3Cschools you will see
the majority of examples using element values, and this is a resource
teaching the basics of XML.

iv) The solution shrinks the file sizes.

Turning an element value into an attribute with name value saves a very
minimal set of characters, I find it hard to see how you save a third. In
some cases you might save a third (such as lower_included) but in others you
solution actually increase the size. Take you example of lower and upper, a
start tag of 5 characters, add the angle brackets and you have 7 characters.
Using your solution, you have the attribute name of value, which is 5 plus 2
quotes, an equals sign and a space between the tag and the attribute,
totalling 9 characters.

In the case of occurrences (or DV_INTERVALs in general), I think we should
treat the unbounded and included properties as attributes because they
provide meta data about how to interpret the real data, lower and upper.
You will never utilise the unbounded and included values in isolation, they
are always used in conjunction with the lower and upper. So I would suggest
a change as follows:

        <occurrences>
            <lower included="true" >1</lower>
            <upper unbounded="true"/>
        </occurrences>

The included and unbounded attributes exist for both lower and upper with
default values of false. Due to the openEHR assertions, you will never need
more than 1 attribute on each element as included and unbounded cannot be
both true.

The thing is, if we start entertaining these kinds of changes we will end up
in endless debates based on the religious beliefs of XML style. Xml is just
another computer language, all computer professionals have different styles
when using those languages. There is no right and wrong style, just
guidelines, but these are usually employed for consistency purposes
assisting the readability, not that one style is more ready than another.
Currently, the schema is as consistent as you will ever get.

If anything is going to be changed, then the representation of INTERVAL is
probably the only candidate (there may be another one or two in similar
vein, meta data assisting in the interpretation of the value).

Regards

Heath

Dear Adam,

I totally understand the XML issues that you described in your previous
email. However, this problem doesn't exist if you use oXygen xml editor. I
just downloaded Altova XMLSpy 2008. I opened an archetype XML file using
XMLSpy 2008 and did pretty-print and then saved it. I don't have any issues
to open the saved xml using Ocean Archetype Editor (Release 1 candidate
(1241)). Additionally, putting element text value as an attribute value
would make the xml file looks very ugly when the value is a long string,
e.g. people can put very long string (100 words or 200 words or even more)
for the purpose, description, and use fields.

Regards,

Chunlan

Heath Frankel wrote:

Adam,

i) The present situation/structure is dangerous.
    
You need to get a better tool, Oxygen never splits an element value over
multiple lines or adds whitespace. A tool that automatically does this is
dangerous. I used to use XMLSpy and never experienced this, but after
hearing this I am glad I was convinced to move to Oxygen.

I like oxygen but

A) XMLSpy is our std tool
B) http://www.oxygenxml.com/xml_pretty_print.html
C) Anything doing pretty print (inc Oxygen) does the same things.

To quote from the oxygen xml page above:

"Although writing documents with no indentation is a perfectly
acceptable practice, it makes editing difficult and is error prone. It
also makes the identification of exact error positions difficult.
Formatting and Indenting, also called "Pretty Print", enables the XML
documents to be neatly arranged in a manner that is consistent and
promotes easier reading."

ii) Pretty-print is the norm & even the ADL is pretty printed and has
    

adopted
  

a similar method to cope.
    

Sure, but the tool should never add whitespace to a value, that is not the
norm, it is simply wrong.

Not true.

See above wrt Oxygen XML's view. I can quote you the relevant sections
from the XML docs e.g.

http://www.w3.org/TR/2001/REC-xmlschema-1-20010502/

http://www.w3.org/TR/xmlschema-2/#rf-whiteSpace

iii) The solution simplifies the XML in terms of both processing and human
readability.
    

I do not see this at all, in fact your solution breaks much processing which
is derived directly from the Archetype Model.

Why is that?

Microsoft uses a lot of XML documents in its products and many of them use
elements to contain values. In fact if you go to W3Cschools you will see
the majority of examples using element values, and this is a resource
teaching the basics of XML.

For instructional documents aimed at those learning XML it is nice and
simple.

If however you are looking to create a bullet proof serialization in XML
where the values matter then it is a poor design.

iv) The solution shrinks the file sizes.
    

Turning an element value into an attribute with name value saves a very
minimal set of characters, I find it hard to see how you save a third. In
some cases you might save a third (such as lower_included) but in others you
solution actually increase the size. Take you example of lower and upper, a
start tag of 5 characters, add the angle brackets and you have 7 characters.
Using your solution, you have the attribute name of value, which is 5 plus 2
quotes, an equals sign and a space between the tag and the attribute,
totalling 9 characters.

Run the XSLT on set of files so as to get a reasonable average. I have
done so on the NHS ones. it is about 2/3'rds.

In the case of occurrences (or DV_INTERVALs in general), I think we should
treat the unbounded and included properties as attributes because they
provide meta data about how to interpret the real data, lower and upper.
You will never utilise the unbounded and included values in isolation, they
are always used in conjunction with the lower and upper. So I would suggest
a change as follows:

        <occurrences>
            <lower included="true" >1</lower>
            <upper unbounded="true"/>
        </occurrences>
  
How about in a template e.g.

<Items archetype_id="openEHR-EHR-CLUSTER.symptom.v2"
path="/data[at0001]/events[at0002]/data[at0003]/items[at0005]/items"
xsi:type="CLUSTER">

vs say in a archetype where the same thing would be shown as:

<archetype_id><value>openEHR-EHR-ACTION.procedure.v1draft</value></archetype_id>

So are templates wrong & archetypes right or vice versa?

The included and unbounded attributes exist for both lower and upper with
default values of false. Due to the openEHR assertions, you will never need
more than 1 attribute on each element as included and unbounded cannot be
both true.

The thing is, if we start entertaining these kinds of changes we will end up
in endless debates based on the religious beliefs of XML style.

This is not about style it's about safety.

I have been involved in many large scale XML projects. I have seen this
before & it ends up with ugly situations. You can not assume whitespace
will not be added as it is legitimate to pretty print a document.

If you are serious about a singular value it goes in an attribute.

Xml is just
another computer language, all computer professionals have different styles
when using those languages. There is no right and wrong style, just
guidelines, but these are usually employed for consistency purposes
assisting the readability, not that one style is more ready than another.
Currently, the schema is as consistent as you will ever get.

If anything is going to be changed, then the representation of INTERVAL is
probably the only candidate (there may be another one or two in similar
vein, meta data assisting in the interpretation of the value).

Regards

Heath

Then at each stage involving the use of Archetypes and templates you are
going to have to build in text normalization routines as per:

http://www.w3.org/TR/xpath#function-normalize-space

&

http://www.w3.org/TR/2001/REC-xmlschema-1-20010502/

"[Definition:] The *normalized value* of an element or attribute
information item is an ·initial value·
<http://www.w3.org/TR/2001/REC-xmlschema-1-20010502/#key-iv&gt; whose white
space, if any, has been normalized according to the value of the
whiteSpace facet
<http://www.w3.org/TR/2001/REC-xmlschema-2-20010502/datatypes.html#rf-whiteSpace&gt;
of the simple type definition used in its ·validation·
<http://www.w3.org/TR/2001/REC-xmlschema-1-20010502/#key-vn&gt;:

*preserve*
   No normalization is done, the value is the ·normalized value·
   <http://www.w3.org/TR/2001/REC-xmlschema-1-20010502/#key-nv&gt; *replace*
   All occurrences of |#x9| (tab), |#xA| (line feed) and |#xD|
   (carriage return) are replaced with |#x20| (space). *collapse*
   Subsequent to the replacements specified above under *replace*,
   contiguous sequences of |#x20|s are collapsed to a single |#x20|,
   and initial and/or final |#x20|s are deleted."

Also:

http://www.w3.org/TR/xmlschema-2/#rf-whiteSpace

&

http://www.w3.org/TR/REC-xml/#NT-S

"

      2.3 Common Syntactic Constructs

This section defines some symbols used widely in the grammar.

S <http://www.w3.org/TR/REC-xml/#NT-S&gt; (white space) consists of one or
more space (#x20) characters, carriage returns, line feeds, or tabs.

          White Space

[3] |S| ::= |(#x20 | #x9 | #xD | #xA)+|

*Note:*

The presence of #xD in the above production is maintained purely for
backward compatibility with the First Edition
<http://www.w3.org/TR/1998/REC-xml-19980210&gt;\. As explained in *2.11
End-of-Line Handling* <http://www.w3.org/TR/REC-xml/#sec-line-ends&gt;, all
#xD characters literally present in an XML document are either removed
or replaced by #xA characters before any other processing is done. The
only way to get a #xD character to match this production is to use a
character reference in an entity value literal."

Adam

Note that that means that you would almost certainly have to specify
collapse & that no values could ever start or end with a space or
contain more than one contiguous space.

My question:

  • What is your justification for your statement?

"

| Microsoft uses a lot of XML documents in its products and many of them use

elements to contain values. In fact if you go to W3Cschools you will see

the majority of examples using element values, and this is a resource

teaching the basics of XML.

For instructional documents aimed at those learning XML it is nice and
simple. If however you are looking to create a bullet proof serialization in XML

where the values matter then it is a poor design."

In my opinion things can equally expressed in attributes and the ‘elements’, as this is subject to (local) agreements.
Although CEN/tc251 has published a report (CEN/tc251 TS 15211) some years ago where they proposed to express data values as an attribute, I have my doubts.
I think it is more correct to reserve attributes to express meta-data about the date value in the ‘XML-element’.
Attributes to express: language, coding system, precision, etc.

Gerard Freriks

– –
Gerard Freriks, MD
Huigsloterdijk 378
2158 LR Buitenkaag
The Netherlands

T: +31 252544896
M: +31 620347088
E: gfrer@luna.nl

Those who would give up essential Liberty, to purchase a little temporary
Safety, deserve neither Liberty nor Safety. Benjamin Franklin 11 Nov 1755

Note that that means that you would almost certainly have to specify
collapse & that no values could ever start or end with a space or
contain more than one contiguous space.

This is what is specified in the openehr schema for most of the elements
you are talking about.

For instance, all types derived from OBJECT_ID have a value element
with type xs:token - this has the whitespace facet set to collapse and is
why I presume XMLSpy is feeling free to pretty print using extra spaces
and tabs. In the case of where the type is xs:string, I'm guessing XMLSpy
wouldn't add extra leading or trailing spaces (as that would be clearly
changing the meaning of any element with whitespace set to
preserve).

I don't use oxygen or xmlspy so I can't really test this out.

Rather than the question of whether to use attributes or elements, my
question is - do we have xs:token or xs:string set correctly as
the type in the schema for all elements - I'm not sure anyone has really
gone through and systematically determined what the whitespace
ramifications are for each element (should LOCATABLE_REF/path be
a string or a token for instance? - without having the formal spec here
with me I would think that a path would normally ignore leading and
trailing spaces so perhaps it should be a token)

Andrew

Adam Flinton wrote:

To quote from the oxygen xml page above:

"Although writing documents with no indentation is a perfectly
acceptable practice, it makes editing difficult and is error prone. It
also makes the identification of exact error positions difficult.
Formatting and Indenting, also called "Pretty Print", enables the XML
documents to be neatly arranged in a manner that is consistent and
promotes easier reading."
  

but no-one is advocating creating documents with no whitespace,
particularly, although many tools do, since the XML is intended for
consumption by computers, not people. But whitespace between Elements is
not the same as white space in an Element value.

Sure, but the tool should never add whitespace to a value, that is not the
norm, it is simply wrong.

Not true.

See above wrt Oxygen XML's view. I can quote you the relevant sections
from the XML docs e.g.

http://www.w3.org/TR/2001/REC-xmlschema-1-20010502/

http://www.w3.org/TR/xmlschema-2/#rf-whiteSpace
  

well what this tells me is that if the whitespace facet of the type in a
schema is set to 'preserve' then the whitespace is not changed. What
happens to whitespace _between_ Elements doesn't matter too much (i.e.
between tag end and new tag start), since this is just a question of
indented formatting. What the debate here is about, as far as I
understand, is about whitespace within textual Element values - which
should of course be preserved, else XML can't be used to send normal
documentary text around.

If however you are looking to create a bullet proof serialization in XML
where the values matter then it is a poor design.
  

well - let's have some evidence of that. If it is true, then change
needs to be considered. But let's have the hard evidence first.

- thomas beale

Just for info, I have the latest version of XMLSpy 2008 and cannot reproduce the problem with Pretty-printing adding whitespace to element values. Although XMLspy rather nicely word breaks long text lines and indents appropriately, none of this whitesapce appears to be saved.

Incidentally, I personally find element-preponderant XML easier to read than the attribute laden equivalent. Chaque a son gout!

Ian

Chunlan Ma wrote:

Dear Adam,

I totally understand the XML issues that you described in your previous
email. However, this problem doesn't exist if you use oXygen xml
editor. I
just downloaded Altova XMLSpy 2008. I opened an archetype XML file using
XMLSpy 2008 and did pretty-print and then saved it. I don't have any
issues
to open the saved xml using Ocean Archetype Editor (Release 1 candidate
(1241)). Additionally, putting element text value as an attribute value
would make the xml file looks very ugly when the value is a long string,
e.g. people can put very long string (100 words or 200 words or even
more)
for the purpose, description, and use fields.

Regards,

Chunlan

A) If you don't care about the layout of the text e.g. the introduction
of line endings & tabs etc then wrt text that is true.
If you do care about the layout then it would be best to have a <text>
</text> child which contains a markup/layout dialect such as XHTML.
B) The ADL is pretty printed & deals with this by in effect using the
same markup as an XML attribute:
e.g.:

               ["at0002"] = <
                   description = <"*">
                   text = <"Procedure started date time">

C) Yes Oxygen can/does do pretty print. It is a std part of XML & has
been since before 1.0.

e.g.

http://www.oxygenxml.com/xml_pretty_print.html

Adam

Gerard Freriks wrote:

My question:

- What is your justification for your statement?

a) Safety
b) Efficiency
c) Best practice

"

/| Microsoft uses a lot of XML documents in its products and many of
them use/

/elements to contain values. In fact if you go to W3Cschools you
will see
/
/the majority of examples using element values, and this is a resource
/
/teaching the basics of XML.
/
/
/
/
/

/For instructional documents aimed at those learning XML it is nice
and simple. //If however you are looking to create a bullet proof
serialization in XML /

/where the values matter then it is a poor design."/

In my opinion things can equally expressed in attributes and the
'elements', as this is subject to (local) agreements.
Although CEN/tc251 has published a report (CEN/tc251 TS 15211) some
years ago where they proposed to express data values as an attribute,
I have my doubts.
I think it is more correct to reserve attributes to express meta-data
about the date value in the 'XML-element'.
Attributes to express: language, coding system, precision, etc.

   <definition
archetype_id="openEHR-EHR-EVALUATION.check_list-condition-third_party.v1"
xsi:type="EVALUATION">
       <Rule name="Has anyone in your family had:"
path="/data[at0001]/items[at0004]"/>
       <Rule name="Diabetes" path="/data[at0001]/items[at0004 and
name/value='Question group']/items[at0002]"/>
   </definition>

???

From an OpenEHR Template.

Is this wrong? I would argue this is much better XML than that found in
the XML serialization of an Archetype.

Adam

Gerard Freriks wrote:

Thanks.

But I'm curious in:
Why?

Why is you solution more safe?

A) You are definitively bookending the string.

This is exactly the same as you do within the ADL e.g.

                ["at0002"] = <
                    description = <"*">
                    text = <"Procedure started date time">
                >

The adl above does not say:

description = *
text = Procedure started due time.

etc.

why is that?

& Would that be the same as:

description = *
text =

Procedure started due time.

?

B) Even worse is the fact that an XML element can contain many text
children even where it may look like there is just one. This can cause
all sorts of fun.

e.g.

http://www.informit.com/articles/article.aspx?p=31273&seqNum=12&rl=1

"The text of an element is considered *normalized* when it contains no
two adjacent Text nodes, as was shown above. In general, deserializing
an XML document into a DOM will yield normalized elements. However, when
new Text nodes are inserted into the hierarchy, one can wind up with a
denormalized element. While completely legal, various XML technologies
have a difficult time handling denormalized elements. XPath, for
example, depends on a normalized document tree structure to behave
properly. Performing an XPath traversal against a document with
denormalized elements would yield unexpected results. This can be
prevented using the Node.normalize method, which recursively normalizes
all ancestor Text nodes. Consider the following Java code:

import org.w3c.dom.*;
void appendText(Document doc, Node elem) {
  int nChildren = elem.getChildNodes().getLength();
  Node text1 = doc.createTextNode("hello ");
  Node text2 = doc.createTextNode("world");
  elem.appendChild(text1);
  elem.appendChild(text2);
  text2.splitText(2);
  assert(elem.getChildNodes().getLength() == nChildren + 3);
  elem.normalize();
  assert(elem.getChildNodes().getLength() == nChildren + 1);
}

As shown in Figure 2.12
<javascript:popUp('/content/images/chap2_0201709147/elementLinks/02fig12.gif')>,
after the call to Text.splitText, there are three new Text node
children. However, after the call to Node.normalize, the three adjacent
Text nodes are folded into a single node containing the string "hello,
world"."

Why is your solution more efficient?

A) File sizes are smaller/the XML is less verbose.
B) The fact that you know exactly where the string starts and finishes
means that using Sax etc can be much faster as there is no need to
normalize.
i.e. at present you would already have more verbose xml & then the only
safe option is to always normalize the whole document before processing it.
C) XML attribute values are structural vs a function in most of the XML
processing languages e.g. XSLT or XPath.

e.g. compare /a/b/@c vs /a/b/text() or /a/b[@c="bob"] vs /a/b[text() =
"bob"]

Why is your solution a better Best Practice?

In part for the reasons above.
In part because experiences of failures because of the ambiguities wrt
the text child in XML have driven people to be pretty careful about
using text unless you really need to.

If you want a single string containing a value which will not contain
child elements e.g.

Good use for text child:

some <strong>bold text</strong> in some documentation

Bad use for text child:

at003

Again I would refer you yo your very own ADL which in essence has
adopted the exact same solution to avoiding an textual ambiguities via
markup such as:

                ["at0030"] = <
                    description = <"*">
                    text = <"Material used">
                >
                ["at0031"] = <
                    description = <"*">
                    text = <"Procedure comments">
                >
                ["at0032"] = <
                    description = <"*">
                    text = <"Procedure comments">
                >
                ["at0033"] = <
                    description = <"*">
                    text = <"Procedure end date time">
                >

were say the first element above to be rewritten it could be seen as

<at0030 description="*" text="Material used"/>

Adam

Ian McNicoll wrote:

Just for info, I have the latest version of XMLSpy 2008 and cannot
reproduce the problem with Pretty-printing adding whitespace to
element values. Although XMLspy rather nicely word breaks long text
lines and indents appropriately, none of this whitesapce appears to be
saved.

It can pretty print as that's a std part of XML e.g.

http://www.altova.com/manual2008/XMLSpy/spyenterprise/pretty_printxmltext.htm

Incidentally, I personally find element-preponderant XML easier to
read than the attribute laden equivalent. Chaque a son gout!

It's not about human readability. It is about having to normalize every
Archetype & template prior to loading it.

e.g. right now the Ocean Archetype editor doesn't normalize & thus it
breaks.

Adam

Thomas Beale wrote:

Adam Flinton wrote:
  

To quote from the oxygen xml page above:

"Although writing documents with no indentation is a perfectly
acceptable practice, it makes editing difficult and is error prone. It
also makes the identification of exact error positions difficult.
Formatting and Indenting, also called "Pretty Print", enables the XML
documents to be neatly arranged in a manner that is consistent and
promotes easier reading."
  

but no-one is advocating creating documents with no whitespace,
particularly, although many tools do, since the XML is intended for
consumption by computers, not people. But whitespace between Elements is
not the same as white space in an Element value.
  
http://www.w3.org/TR/2001/REC-xmlschema-2-20010502/#dt-whiteSpace

*"whiteSpace* is applicable to all ·atomic·
<http://www.w3.org/TR/2001/REC-xmlschema-2-20010502/#dt-atomic&gt; and
·list· <http://www.w3.org/TR/2001/REC-xmlschema-2-20010502/#dt-list&gt;
datatypes. For all ·atomic·
<http://www.w3.org/TR/2001/REC-xmlschema-2-20010502/#dt-atomic&gt;
datatypes other than string
<http://www.w3.org/TR/2001/REC-xmlschema-2-20010502/#string&gt; (and types
·derived·
<http://www.w3.org/TR/2001/REC-xmlschema-2-20010502/#dt-derived&gt; by
·restriction·
<http://www.w3.org/TR/2001/REC-xmlschema-2-20010502/#dt-restriction&gt;
from it) the value of *whiteSpace* is |collapse| and cannot be changed
by a schema author; for string
<http://www.w3.org/TR/2001/REC-xmlschema-2-20010502/#string&gt; the value
of *whiteSpace* is |preserve|; for any type ·derived·
<http://www.w3.org/TR/2001/REC-xmlschema-2-20010502/#dt-derived&gt; by
·restriction·
<http://www.w3.org/TR/2001/REC-xmlschema-2-20010502/#dt-restriction&gt;
from string <http://www.w3.org/TR/2001/REC-xmlschema-2-20010502/#string&gt;
the value of *whiteSpace* can be any of the three legal values. For all
datatypes ·derived·
<http://www.w3.org/TR/2001/REC-xmlschema-2-20010502/#dt-derived&gt; by
·list· <http://www.w3.org/TR/2001/REC-xmlschema-2-20010502/#dt-list&gt; the
value of *whiteSpace* is |collapse| and cannot be changed by a schema
author. For all datatypes ·derived·
<http://www.w3.org/TR/2001/REC-xmlschema-2-20010502/#dt-derived&gt; by
·union· <http://www.w3.org/TR/2001/REC-xmlschema-2-20010502/#dt-union&gt;
*whiteSpace* does not apply directly; however, the normalization
behavior of ·union·
<http://www.w3.org/TR/2001/REC-xmlschema-2-20010502/#dt-union&gt; types is
controlled by the value of *whiteSpace* on that one of the ·memberTypes·
<http://www.w3.org/TR/2001/REC-xmlschema-2-20010502/#dt-memberTypes&gt;
against which the ·union·
<http://www.w3.org/TR/2001/REC-xmlschema-2-20010502/#dt-union&gt; is
successfully validated."

So for string it's fine to format the text for readability by
introducing tabs, linefeeds etc.

Sure, but the tool should never add whitespace to a value, that is not the
norm, it is simply wrong.

Not true.

See above wrt Oxygen XML's view. I can quote you the relevant sections
from the XML docs e.g.

http://www.w3.org/TR/2001/REC-xmlschema-1-20010502/

http://www.w3.org/TR/xmlschema-2/#rf-whiteSpace
  

well what this tells me is that if the whitespace facet of the type in a
schema is set to 'preserve' then the whitespace is not changed. What
happens to whitespace _between_ Elements doesn't matter too much (i.e.
between tag end and new tag start), since this is just a question of
indented formatting. What the debate here is about, as far as I
understand, is about whitespace within textual Element values - which
should of course be preserved, else XML can't be used to send normal
documentary text around.
  

Or it has to be normalized at every point it is read in.

That means (for example) that you can never have text starting with or
ending with a space & possibly you can never include tabs, linefeeds etc
in your text.

If however you are looking to create a bullet proof serialization in XML
where the values matter then it is a poor design.
  

well - let's have some evidence of that. If it is true, then change
needs to be considered. But let's have the hard evidence first.

Look at every other major standard for XML.

e.g. XMI:

  <eAnnotations xmi:id="_2IHQUQ3aEdy0fvloa5NWrg" source="uml2.diagrams"/>
  <ownedComment xmi:id="_ABfwQA3bEdy0fvloa5NWrg" body="Advanced Trace
allows an external system to identify a Patient based on an NHS Number
or a variety of search criteria including name, address and a date range
for either birth or death. Historic data may optionally also be
searched, and/or returned. Current and future dated data is always
returned. Multiple matching records may be returned, along with a
MatchingLevel (this will only be populated for an algorithmic search)
for each match, indicating the confidence of the match.&#xD;&#xA;See the
PDS [SSRS] for details of the responses returned by Advanced
Trace.&#xD;&#xA;" annotatedElement="_2IHQUA3aEdy0fvloa5NWrg">
    <eAnnotations xmi:id="_ABfwQQ3bEdy0fvloa5NWrg"
source="appliedStereotypes">
      <contents xmi:type="Default_0:Default__Documentation"
xmi:id="_ABfwQg3bEdy0fvloa5NWrg"/>
    </eAnnotations>
  </ownedComment>
  <packageImport xmi:type="uml:ProfileApplication"
xmi:id="_2IHQVA3aEdy0fvloa5NWrg">
    <eAnnotations xmi:id="_2IHQVQ3aEdy0fvloa5NWrg" source="attributes">
      <details xmi:id="_2IHQVg3aEdy0fvloa5NWrg" key="version" value="0"/>
    </eAnnotations>
    <importedPackage xmi:type="uml:Profile"
href="pathmap://UML2_PROFILES/Basic.profile.uml2#_6mFRgK86Edih9-GG5afQ0g"/>
    <importedProfile
href="pathmap://UML2_PROFILES/Basic.profile.uml2#_6mFRgK86Edih9-GG5afQ0g"/>
  </packageImport>

Note how the documentation includes the formatting entities but that
they are intended to be there.

SVG:

<g fill="rgb(232,232,255)" stroke-miterlimit="0" font-family="'Arial'"
stroke-linejoin="round" stroke="rgb(232,232,255)">
<rect x="10" y="10" clip-path="url(#clipPath1)" width="1500"
height="202" stroke="none"/>
<rect x="10" y="10" clip-path="url(#clipPath1)" fill="none" width="1500"
height="202" stroke="black"/>
<image stroke="black" transform="matrix(1,0,0,1,21,97)" width="15"
xlink:show="embed" xlink:type="simple" fill="black"
clip-path="url(#clipPath2)" preserveAspectRatio="none" height="23" x="0"
y="0"
xlink:href="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAA8AAAAXCAYAAADUUxW8AAAAgUlEQVR42s1UwQnA&#13;&#10;MAiMwSGy/3TZIq0PQcRqVQoV8hBz5jw1sPc5o2hzNKwFRumsBSHgLtMGU4ASyAsp&#13;&#10;2pygXHP55ZbaRJmPFXPVlpTZZ5AuByNqnvoz09efjqelqPbd8fyEttV7jAAeK4wA&#13;&#10;Xp/x7UCEYL2OUSLwPsB0zU+ts5bjArh6RxD5kW1tAAAAAElFTkSuQmCC"
xlink:actuate="onLoad"/>
<rect x="130" y="89" clip-path="url(#clipPath3)" fill="white"
width="111" rx="4.5" ry="4.5" height="61" stroke="none"/>
<rect x="130" y="89" clip-path="url(#clipPath3)" fill="none" width="110"
rx="4" ry="4" height="60" stroke="black"/>
<text x="153" y="124" clip-path="url(#clipPath4)" fill="black"
stroke="none" xml:space="preserve">AR1_Task1</text>
<circle clip-path="url(#clipPath5)" fill="white" r="14.5" cx="70.5"
cy="119.5" stroke="none"/>
<circle clip-path="url(#clipPath5)" fill="none" r="14.5" cx="70.5"
cy="119.5" stroke="black"/>
<text x="58" y="147" clip-path="url(#clipPath6)" fill="black"
stroke="none" xml:space="preserve">Start</text>
<line clip-path="url(#clipPath7)" fill="none" x1="36" x2="36" y1="11"
y2="210" stroke="rgb(169,169,169)"/>
<rect x="10" y="10" clip-path="url(#clipPath1)" fill="none" width="1500"
height="202" stroke="rgb(169,169,169)"/>
<rect x="10" y="265" clip-path="url(#clipPath8)" width="1500"
height="200" stroke="none"/>
<rect x="10" y="265" clip-path="url(#clipPath8)" fill="none"
width="1500" height="200" stroke="black"/>
<image stroke="black" transform="matrix(1,0,0,1,21,351)" width="15"
xlink:show="embed" xlink:type="simple" fill="black"
clip-path="url(#clipPath2)" preserveAspectRatio="none" height="23" x="0"
y="0"
xlink:href="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAA8AAAAXCAYAAADUUxW8AAAAhElEQVR42s1TQQ7A&#13;&#10;IAgbpo/w/6/zF25eFsKQyuZhJBwEiy2ItNb78dKKDdQqt3s5bbDJi8njso6FL2sb&#13;&#10;oOEeCwrWRbwCZeXSjLps7fZ/RqX7YHWDUbMFU5pnwP0NyxhYR+1Zy8Cqvk+0vdmD&#13;&#10;ASJWYIBozlj9EBRsV5IVClcyrXk2Om85TlGWUZ+e/kVBAAAAAElFTkSuQmCC"
xlink:actuate="onLoad"/>
<rect x="293" y="294" clip-path="url(#clipPath9)" fill="white"
width="153" rx="4.5" ry="4.5" height="94" stroke="none"/>
<rect x="293" y="294" clip-path="url(#clipPath9)" fill="none"
width="152" rx="4" ry="4" height="93" stroke="black"/>

XSLT:

    <xsl:strip-space elements="*" />
    <xsl:param name="p_FileListDoc" />
    <xsl:param name="p_ProcFileListDoc" />
    <xsl:param name="p_ReportDoc" />

    <xsl:template match="/">
        <!-- <xsl:message>
            p_configDoc = <xsl:value-of select="$p_configDoc" />
            </xsl:message> -->
        <xsl:element name="root" namespace="">

            <xsl:element name="errors" namespace="">
                <xsl:call-template name="ValidateADL">
                    <xsl:with-param name="p_FileListDoc"
                        select="$p_FileListDoc" />
                </xsl:call-template>
                <xsl:call-template
                    name="Process_Contains-draft-archetype">
                    <xsl:with-param name="p_rootNode" select="." />
                </xsl:call-template>
            </xsl:element>
            <xsl:element name="files" namespace="">
                <xsl:call-template name="ProcessFileNames">
                    <xsl:with-param name="p_rootNode" select="." />
                </xsl:call-template>
            </xsl:element>
        </xsl:element>
    </xsl:template>

XSD:

<xs:complexType name="ApplicationRolesType">
        <xs:annotation>
            <xs:documentation>Collects the application roles defined in
this file</xs:documentation>
        </xs:annotation>
        <xs:sequence>
            <xs:element name="ApplicationRole" type="RimArtefactType"
minOccurs="2" maxOccurs="unbounded">
                <xs:annotation>
                    <xs:documentation>A definition of an application
role.</xs:documentation>
                </xs:annotation>
            </xs:element>
        </xs:sequence>
    </xs:complexType>

ebXML:

  <BinaryCollaboration name="Request Catalog">
    <AuthorizedRole name="requestor"/>
    <AuthorizedRole name="provider"/>
    <BusinessTransactionActivity name="Catalog Request"
                                 businessTransaction="Catalog Request"
                                 fromAuthorizedRole="requestor"
                                 toAuthorizedRole="provider"/>
  </BinaryCollaboration>

Take an easy example which is (X)HTML

<a class="docSubTitle" href="#Table_of_Contents:"
    name="Getting_ANT">(1) Getting ANT </a>

looks exactly the same to the viewer as:

<a class="docSubTitle" href="#Table_of_Contents:"
    name="Getting_ANT">

(1)

            Getting

                                                        ANT
</a>

Because the value contained as the text child has no used or interest in
any textual formatting at that level i.e. a new line is a <br/> etc.

Adam

Hi Adam and all

I think there might have been some misunderstanding regarding the problem you raised. Yes Oxygen and XmlSpy will add unpredictable whitespace characters into an element when pretty printed and resaved. The problem is that the XML tools don’t know they are dealing with elements whose space must be preserved. That can be fixed using xml:space=“preserve” attributes on all leaf elements which are values (http://www.w3.org/TR/2000/REC-xml-20001006#sec-white-space). Then the archetype XML can safely be pretty printed or collapsed back into ugly print without affecting the content or meaning.

There doesn’t appear to be any ‘best practice’ on this particular question. I don’t think this naturally means we should be using XML attributes to store the values. There is only a tiny ‘efficiency’ gain by using attributes (approx. 6 characters less per value) which I don’t think offsets the value of having elements consistently used all the way throughout the serialisation.

What I think it does mean is that we should add xml:space=“preserve” attributes to our free-text leaf elements at the time of serialisation so that the current problem is resolved.

Lisa

Adam Flinton wrote:

Lisa Thurston wrote:

Hi Adam and all

I think there might have been some misunderstanding regarding the
problem you raised. Yes Oxygen and XmlSpy will add unpredictable
whitespace characters into an element when pretty printed and resaved.
The problem is that the XML tools don't know they are dealing with
elements whose space must be preserved. That can be fixed using
xml:space="preserve" attributes on all leaf elements which are values
(http://www.w3.org/TR/2000/REC-xml-20001006#sec-white-space). Then the
archetype XML can safely be pretty printed or collapsed back into ugly
print without affecting the content or meaning.

There doesn't appear to be any 'best practice' on this particular
question. I don't think this naturally means we should be using XML
attributes to store the values. There is only a tiny 'efficiency' gain
by using attributes (approx. 6 characters less per value) which I
don't think offsets the value of having elements consistently used all
the way throughout the serialisation.

What I think it does mean is that we should add xml:space="preserve"
attributes to our free-text leaf elements at the time of serialisation
so that the current problem is resolved.

I reserve my views wrt attributes vs text() however that would do on the
proviso of a bit of testing with many tools as it used to be patchily
supported by different tools.

I accept that was a few years back & things may well have improved.

So then next question then is when will the tools support this?

Adam

Adam Flinton wrote:

I reserve my views wrt attributes vs text() however that would do on the
proviso of a bit of testing with many tools as it used to be patchily
supported by different tools.

I accept that was a few years back & things may well have improved.

So then next question then is when will the tools support this?
  

looks like we have arrived at a useful point - first thing we need is an
analysis of changes to the XML-schemas. If Lisa's change is all that is
needed and someone wants to update the current schemas to make thi work,
we can put it on the main TRUNK so that everyone can have access to it.

Further analysis will be needed for the tools, but I would not expect
big problems. Generally they are using orthodox XML parsers whcih I
assume respect the whitespace settings in an XML schema...

- thomas

No XML Schema changes required, in fact the schema already indicates that
the string data should have space preserved as per the W3C references
provided by Adam. The problem is that because the schema specifies
something is a string type it is not required to be specified in the XML
document and when a tool such as XMLSpy reads the document it doesn't know
what type the element is without referencing the schema, so it doesn't apply
the default space='preserve' attribute when it does a pretty-print.

So technically there is nothing wrong with the current XML. However, to
support these tools that apply pretty print before checking the schema to
determine if they are allowed too, we could explicitly add this space
attribute in the data (alternately, we might be able to provide the type
attribute instead, but we haven't tested this yet). The problem is forcing
the XML serialiser to put these explicit attributes in the data. We will
explore this.

Stepping back a bit, would it be sufficient (in the short term at least) to
just have the XML pretty printed out of the tools rather than a single line
so that you are not inclined to use the problematic XMLSpy pretty print?

Heath

Thomas Beale wrote:

Adam Flinton wrote:

I reserve my views wrt attributes vs text() however that would do on
the proviso of a bit of testing with many tools as it used to be
patchily supported by different tools.

I accept that was a few years back & things may well have improved.

So then next question then is when will the tools support this?
  

looks like we have arrived at a useful point - first thing we need is
an analysis of changes to the XML-schemas. If Lisa's change is all
that is needed and someone wants to update the current schemas to make
thi work, we can put it on the main TRUNK so that everyone can have
access to it.

Further analysis will be needed for the tools, but I would not expect
big problems. Generally they are using orthodox XML parsers whcih I
assume respect the whitespace settings in an XML schema...

- thomas

I would like though to enquire wrt the rationale of containing _id info
in a separate <value/> element.

If you are being consistent
instead of :

       <terminology_id>
           <value>ISO_639-1</value>
       </terminology_id>

it should be simply:

       <terminology_id>ISO_639-1</terminology_id>

or <terminology_id value="ISO_639-1"/>

Adam

Heath Frankel wrote:

No XML Schema changes required, in fact the schema already indicates that
the string data should have space preserved as per the W3C references
provided by Adam. The problem is that because the schema specifies
something is a string type it is not required to be specified in the XML
document and when a tool such as XMLSpy reads the document it doesn't
know
what type the element is without referencing the schema, so it doesn't
apply
the default space='preserve' attribute when it does a pretty-print.

So technically there is nothing wrong with the current XML. However, to
support these tools that apply pretty print before checking the schema to
determine if they are allowed too, we could explicitly add this space
attribute in the data (alternately, we might be able to provide the type
attribute instead, but we haven't tested this yet). The problem is
forcing
the XML serialiser to put these explicit attributes in the data. We will
explore this.

Stepping back a bit, would it be sufficient (in the short term at
least) to
just have the XML pretty printed out of the tools rather than a single
line
so that you are not inclined to use the problematic XMLSpy pretty print?

That might work. I say might as

A) XMLSpy pretty prints by default & it might still think that the
pretty printed doc isn't pretty enough.

B) Ditto wrt XSLT with indent set to true.

Adam

All - this is not my area of expertise, but please find attached a response
from a colleague in the UK - for consideration, and some information about
W3C direction.

Best regards,
Laura