Markdown in archetype definitions?

siljelb · 9 January 2025 10:17

Hi all!

Are there any plans for including markdown in archetype definitions? In general it would be very useful to be able to format the larger text blocks of a lot of archetypes, but my main use case right now is to be able to use subscript for things other than numbers.

For example in cancer staging with molecular classification, the molecular classification part is often styled in subscript, for example “IAm_POLEmut”.

joostholslag · 9 January 2025 11:45

Anything inside a dv_text can be markdown.Do you mean the name of an element? That inherits from locatable, the name is dv_text, so markdown should be possible right?

siljelb · 9 January 2025 12:34

In the archetype definition, not in the data. In my example it would be in the description or comment of a data element, or in ‘Use’. I don’t think it’s a good idea to use markdown in a data element name

ian.mcnicoll · 9 January 2025 14:23

It was discussed before with mixed views. I think it is a good idea, especially if we stick the to Github Markdown limits used by FHIR (no tables).

The main issue would clearly be in tooling - @sebastian.garde and I discussed the impact and it is likely to be dobable in CKM without too much disruption, especially if as an interim, we flagged an archetype as potentially containing Markdown in the descriptions/ comments etc.

I think it is particularly useful in PROMS to be able to add simple formatting like Bold/ italics etc to help keep aligned with formatting rules.

siljelb · 9 January 2025 14:45

I’d be happy with the Github limits

sebastian.garde · 9 January 2025 15:15

For DV_TEXT the specs says that for markdown the “use of CommonMark [is] strongly recommended”.

GitHub Flavoured Markdown is a bit more powerful - it has a few extensions (such as tables, strikethrough, autolinks).

From FHIR Datatypes - FHIR v6.0.0-ballot2

…requires and uses the GFM (Github Flavored Markdown) extensions on CommonMark format, with the exception of support for inline HTML which is not supported.

I guess nobody is stopping anyone from putting markdown in the archetype in these places (yes, probably not in the data element name !)

It is just a matter of how this is interpreted and rendered everywhere. But even if it is just rendered as plain text by all or some tooling at the moment, it may be ok.

There are likely some edge cases where things may go wrong if special chars such as *, &, #, [, <, >, and ’ are used and we don’t know for sure if this is meant to be markdown or plain text.

Whether tables, strikethrough etc. or not, I certainly agree with FHIR that inline HTML elements should not be used.
Tables in use, purpose, misuse etc. may be ok, in ontology/description or comment fields they may be a bit over the top (also considering typical screen real estate).

siljelb · 9 January 2025 15:40

I don’t think tables, headings or strikethrough are particularly relevant, and underline is IMHO a legacy thing from before we had bold and italic.

I think the following are likely relevant though:

Numbered lists

Bullet lists
italic text
bold text
bold italic text
^superscript
_subscript
links
inline preformatted text

The only ones of these that require html tags are subscript and superscript. Could those be allowed specifically, without allowing any other html tags?

ian.mcnicoll · 9 January 2025 16:03

The other html tag might be underline - quite common in PROMS documents. It is probably not a good idea visually because of the confusion with links but we can;t control what the PROMS authors write.

grahamegrieve · 12 January 2025 02:54

there’s lot of legacy italics and underlining out there, so the GFM extensions are useful.

With regard to markdown, special chars such as *, &, #, [, <, >, and ’ appear in some units etc, and there’s been some problems in the FHIR ecosystem with people seeing something that might be markdown, and not being sure. I wrote this routine for use in the java implementation community:


  /**
   * Returns true if this is intended to be processed as markdown
   * 
   * this is guess, based on textual analysis of the content. 
   * 
   * Uses of this routine:
   *   In general, the main use of this is to decide to escape the string so erroneous markdown processing doesn't munge characters
   *   If it's a plain string, and it's being put into something that's markdown, then you should escape the content
   *   If it's markdown, but you're not sure whether to process it as markdown
   *   
   * The underlying problem is that markdown processing plain strings is problematic because some technical characters might 
   * get lost. So it's good to escape them... but if it's meant to be markdown, then it'll get trashed. 
   * 
   * This method works by looking for character patterns that are unlikely to occur outside markdown - but it's still only unlikely
   *  
   * @param content
   * @return
   */
  // todo: dialect dependency?
  public boolean isProbablyMarkdown(String content, boolean mdIfParagrapghs) {
    if (content == null) {
      return false;
    }
    if (mdIfParagrapghs && content.contains("\n")) {
      return true;
    }
    String[] lines = content.split("\\r?\\n");
    for (String s : lines) {
      if (s.startsWith("* ") || isHeading(s) || s.startsWith("1. ") || s.startsWith("    ")) {
        return true;
      }
      if (s.contains("```") || s.contains("~~~") || s.contains("[[[")) {
        return true;
      }
      if (hasLink(s)) {
        return true;
      }
      if (hasTextSpecial(s, '*') || hasTextSpecial(s, '_') ) {
        return true;
      }
    }
      
    return false;
  }
  
  private boolean isHeading(String s) {
    if (s.length() > 7 && s.startsWith("###### ") && !Character.isWhitespace(s.charAt(7))) {
      return true;
    }
    if (s.length() > 6 && s.startsWith("##### ") && !Character.isWhitespace(s.charAt(6))) {
      return true;
    }
    if (s.length() > 5 && s.startsWith("#### ") && !Character.isWhitespace(s.charAt(5))) {
      return true;
    }
    if (s.length() > 4 && s.startsWith("### ") && !Character.isWhitespace(s.charAt(4))) {
      return true;
    }
    if (s.length() > 3 && s.startsWith("## ") && !Character.isWhitespace(s.charAt(3))) {
      return true;
    }
    //
    // not sure about this one. # [string] is something that could easily arise in non-markdown, 
    // so this appearing isn't enough to call it markdown
    //
//    if (s.length() > 2 && s.startsWith("# ") && !Character.isWhitespace(s.charAt(2))) {
//      return true;
//    }
    return false;
  }


  private boolean hasLink(String s) {
    int left = -1;
    int mid = -1;
    for (int i = 0; i < s.length(); i++) {
      char c = s.charAt(i);
      if (c == '[') {
        mid = -1;
        left = i;
      } else if (left > -1 && i < s.length()-1 && c == ']' && s.charAt(i+1) == '(') {
        mid = i;
      } else if (left > -1 && c == ']') {
        left = -1;
      } else if (left > -1 && mid > -1 && c == ')') {
        return true;
      } else if (mid > -1 && c == '[' || c == ']' || (c == '(' && i > mid+1)) {
        left = -1;
        mid = -1;
      }
    }

    // Detect autolinks, which should start with a scheme, followed by a colon, followed by some content. Whitespace
    // is not allowed and for practical purposes, the scheme is considered to consist of lowercase ASCII characters
    // only.
    Pattern autolinkPattern = Pattern.compile("<[a-z]+:[^\\s]+>");
    Matcher autolinkMatcher = autolinkPattern.matcher(s);
    return autolinkMatcher.find();
  }


  private boolean hasTextSpecial(String s, char c) {
    boolean second = false;
    for (int i = 0; i < s.length(); i++) {
      char prev = i == 0 ? ' ' : s.charAt(i-1);
      char next = i < s.length() - 1 ? s.charAt(i+1) : ' ';
      if (s.charAt(i) != c) {
        // nothing
      } else if (second) {
        if (Character.isWhitespace(next) && (isPunctation(prev) || Character.isLetterOrDigit(prev))) {
          return true;
        }
        second = false;        
      } else {
        if (Character.isWhitespace(prev) && (isPunctation(next) || Character.isLetterOrDigit(next))) {
          second = true;
        }            
      }
    }
    return false;
  }


  private boolean isPunctation(char ch) {
    return Utilities.existsInList(ch, '.', ',', '!', '?');
  }

ian.mcnicoll · 12 January 2025 10:58

@sebastian.garde and I had a wee look at this issue and the latest GFM does allow some limited HTML tags to support underline, subscript, superscript.

Supporting MD technically in tooling is not particularly an issue, and if more complex constructs like Tables are excluded, any MD text will still remain very human readable.

As @grahamegrieve has said in his pst, the most likely area of confusion is where a new or existing archetype has some characters which mimic MD markup but incompletely e.g **hello which will confuse MD parser which is looking for the terminating ** to signify bold.

Perhaps we could introduce support, as follows…

A new tag in top-level other_details to signify usesMarkdown
Restrict use, for now, to Use, Misue, Purpose
Add a new node-level annotation that can be used for licensed PROMS text but which allows markdown, and if present, is used in tooling/ UI instead of description.

We can also almost certainly run current CKM archetypes through markdown validation to see which are potentially problematic, and in future, I think MD should be the default for tooling i.e use of MD reserved characters is flagged.

I should add that MD should never be supported in Node name/values

siljelb · 13 January 2025 08:13

Is this another limitation resulting from the flat json format?

ian.mcnicoll · 13 January 2025 09:49

In part yes, but what ever the merits of FLAT format or otherwise, I feel we need to keep the node name as being brief and ‘semi-technical’ to allow it to be tokenised to a more technical format. AQL node name aliases are another use-case.

Do we allow Tables or Links in Node names?

If there’s a need to somehow support e,g bold or italic, for UI purposes , I would far prefer to use a separate annotation that could be picked up by UI tools.

grahamegrieve · 13 January 2025 10:00

You also said node value (as opposed to node name) which was more of a surprise, and I don’t think you’ve answered that part

ian.mcnicoll · 13 January 2025 10:16

I was not wrong, just confusing!!

The attribute we are talking about is actually ELEMENT/name/value which is inherited from LOCATABLE.

ELEMENT/name/value carries the textual representation of the node name.

whereas the ELEMENT value is carried at ELEMENT/value

thomas.beale · 16 January 2025 17:56

There’s nothing to stop it right now - if there were a common agreement across all tools to accept some flavour / subset of markdown, as discussed below (personally i would also allow tables).

Remember that the idea of markdown is that you can read it in its raw form as well as the rendered form (although links are pretty annoying in raw form). So a tool that didn’t support rendering should still just display the raw form with no problems.

The meta-data fields you are talking about are covered by the ODIN spec for string data.

This allows nearly anything but does have a couple of rules:

Section 3.2: some old skool backslash quoting is allowed, because everyone knows it e.g. ‘\t’ etc;
Section 3.1: non ASCII chars have to be represented with UTF8 char strings

There’s nothing to say we couldn’t change any of these rules today, but I think they are sufficient to allow Markdown.

According to most markdown documentation, subscript and superscript are done with either <sub></sub> and <sup></sup> or (as with Asciidoc and pandoc), a pair of tilde characters subscript and a pair of carat chars for super-script characters. Aesthetically, much nicer.

Raw form:

"`Well the H~2~O formula written on their whiteboard could be part
of a shopping list, but I don't think the local bodega sells
E=mc^2^,`" Lazarus replied.

Rendered:
“Well the H₂O formula written on their whiteboard could be part
of a shopping list, but I don’t think the local bodega sells
E=mc²,” Lazarus replied.

I would allow both forms.

Since we expect to replace ODIN in archetypes with JSON or YAML in the near future, we would want to make sure we know how to process ODIN encapsulated Markdown into those formats.

varntzen · 17 January 2025 15:28

Why not in data element name? It’s in the Cluster Archetype: Inspired oxygen [openEHR Clinical Knowledge Manager]

sebastian.garde · 17 January 2025 15:54

Hah - I was sure someone would find an example!
This one though is simply a Unicode char I think "₂" U+2082: Subscript Two (Unicode Character)