# Regex pattern matching in ArchetypeHRID class **Category:** [Archie](https://discourse.openehr.org/c/archie-support/137) **Created:** 2024-08-23 00:35 UTC **Views:** 54 **Replies:** 4 **URL:** https://discourse.openehr.org/t/regex-pattern-matching-in-archetypehrid-class/5601 --- ## Post #1 by @thomas.beale Most likely a question for @pieterbos or @MattijsK ... Here in the [ArchetypeHRID class in Archie](https://github.com/openEHR/archie/blob/8df5c640c614914e836e98c0354627c015c8761b/aom/src/main/java/com/nedap/archie/aom/ArchetypeHRID.java#L52) we find the following regex patterns as args to the compile() call: ``` private static final Pattern namespacePattern = Pattern.compile("((?.*)::)?"); private static final Pattern publisherPattern = Pattern.compile("(?[^.-]*)"); private static final Pattern packagePattern = Pattern.compile("(?[^.-]*)"); private static final Pattern classPattern = Pattern.compile("(?[^.-]*)"); private static final Pattern conceptPattern = Pattern.compile("(?[^.]*)"); private static final Pattern releaseVersionPattern = Pattern.compile("(\\.v(?[^-+]*))?"); private static final Pattern versionStatusPattern = Pattern.compile("(?[^.\\d]*)?"); private static final Pattern buildStatusPattern = Pattern.compile("(\\.?(?\\d*))"); ``` How do the `"?"` patterns work? I am getting a stack overflow with a very benign Archetype HRID ("s2-EHR-Node.structured_address.v0") and trying to work out why. --- ## Post #2 by @borut.jures Regex looks good to me. I've tested it on https://regex101.com: * Regular expression: ``` ((?.*)::)?(?[^.-]*)-(?[^.-]*)-(?[^.-]*)\.(?[^.]*)(\.v(?[^-+]*))?(?[^.\d]*)?(\.?(?\d*))$ ``` * Test string: ``` s2-EHR-Node.structured_address.v0 openEHR-EHR-CLUSTER.structured_address.v0.0.1 ``` @thomas.beale Could the stack overflow be caused by some other code you added to Archie? --- ## Post #3 by @thomas.beale It could be the way the regexes are pre-compiled, because it happens during a run of 1400 archetypes. So it's possibly related to call volume. [quote="thomas.beale, post:1, topic:5601"] How do the `"?"` patterns work? [/quote] I discovered this is .Net flavor named back-references, of which I was not previously aware. --- ## Post #4 by @borut.jures [quote="borut.jures, post:2, topic:5601"] `(?.*)::)` [/quote] These are named groups: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Regular_expressions/Named_capturing_group Edit: ...and I learned they are also called back-references as you said. --- ## Post #5 by @thomas.beale If I do a common optimisation, which is to create the regex matcher once, and then reset it every time it is used, the stack overflow problem goes away - I can now compile 1500 archetypes with no errors. The change looks like this: ``` private static final Matcher m = archetypeHRIDPattern.matcher(""); // ADDED @JsonCreator public ArchetypeHRID(String value) { // Matcher m = archetypeHRIDPattern.matcher(value); // REMOVED m.reset(); // ADDED if(!m.matches()) { throw new IllegalArgumentException(value + " is not a valid archetype human readable id"); } namespace = m.group("namespace"); ... } ``` I read online that `Pattern.matcher()` is not thread-safe however, so this fix might not be safe for some users. Anyone on the Archie team got a better solution? EDIT: looks like I spoke too soon. I missed the argument `value` to `m.reset()`. When I put that back, I get the same stack overflow error, with a different archetype id. The ids are clearly matching. EDIT2: finally figured it out. It's nothing to do with regex matching, it's just that the stack runs out when regex matching is going on. The problem is OPT creation, due to our models containing recursive `use_archetype` referencing. The depth of populating those references has to be limited. --- **Canonical:** https://discourse.openehr.org/t/regex-pattern-matching-in-archetypehrid-class/5601 **Original content:** https://discourse.openehr.org/t/regex-pattern-matching-in-archetypehrid-class/5601