Regex pattern matching in ArchetypeHRID class

Most likely a question for @pieterbos or @MattijsK

Here in the ArchetypeHRID class in Archie we find the following regex patterns as args to the compile() call:

    private static final Pattern namespacePattern = Pattern.compile("((?<namespace>.*)::)?");
    private static final Pattern publisherPattern = Pattern.compile("(?<publisher>[^.-]*)");
    private static final Pattern packagePattern = Pattern.compile("(?<package>[^.-]*)");
    private static final Pattern classPattern = Pattern.compile("(?<class>[^.-]*)");
    private static final Pattern conceptPattern = Pattern.compile("(?<concept>[^.]*)");
    private static final Pattern releaseVersionPattern = Pattern.compile("(\\.v(?<version>[^-+]*))?");
    private static final Pattern versionStatusPattern = Pattern.compile("(?<versionStatus>[^.\\d]*)?");
    private static final Pattern buildStatusPattern = Pattern.compile("(\\.?(?<buildCount>\\d*))");

How do the "?<namespace>" patterns work?

I am getting a stack overflow with a very benign Archetype HRID (“s2-EHR-Node.structured_address.v0”) and trying to work out why.

Regex looks good to me. I’ve tested it on https://regex101.com:

  • Regular expression:
    ((?<namespace>.*)::)?(?<publisher>[^.-]*)-(?<package>[^.-]*)-(?<class>[^.-]*)\.(?<concept>[^.]*)(\.v(?<version>[^-+]*))?(?<versionStatus>[^.\d]*)?(\.?(?<buildCount>\d*))$
  • Test string:
    s2-EHR-Node.structured_address.v0
    openEHR-EHR-CLUSTER.structured_address.v0.0.1

@thomas.beale Could the stack overflow be caused by some other code you added to Archie?

1 Like

It could be the way the regexes are pre-compiled, because it happens during a run of 1400 archetypes. So it’s possibly related to call volume.

I discovered this is .Net flavor named back-references, of which I was not previously aware.

These are named groups: Named capturing group: (?<name>...) - JavaScript | MDN

Edit: …and I learned they are also called back-references as you said.

If I do a common optimisation, which is to create the regex matcher once, and then reset it every time it is used, the stack overflow problem goes away - I can now compile 1500 archetypes with no errors.

The change looks like this:

    private static final Matcher m = archetypeHRIDPattern.matcher(""); // ADDED

    @JsonCreator
    public ArchetypeHRID(String value) {
       // Matcher m = archetypeHRIDPattern.matcher(value); // REMOVED
        m.reset(); // ADDED
        if(!m.matches()) {
            throw new IllegalArgumentException(value + " is not a valid archetype human readable id");
        }
        namespace = m.group("namespace");
        ...
    }

I read online that Pattern.matcher() is not thread-safe however, so this fix might not be safe for some users.

Anyone on the Archie team got a better solution?

EDIT: looks like I spoke too soon. I missed the argument value to m.reset(). When I put that back, I get the same stack overflow error, with a different archetype id. The ids are clearly matching.

EDIT2: finally figured it out. It’s nothing to do with regex matching, it’s just that the stack runs out when regex matching is going on. The problem is OPT creation, due to our models containing recursive use_archetype referencing. The depth of populating those references has to be limited.