Revised AQL modular grammar?

thomas.beale · 4 September 2021 16:47

All,

Please read this post on new Antlr4 grammars for openEHR for background.

This post is for everyone, but particularly for @sebastian.iancu , @Teun , @pieterbos , @christian , @michael.boeckers , as listed on the AQL grammar authors/contribs list.

I have added AQL to the set of grammars, with the idea of making it re-use common lexer and rule sub-grammars from the main set, which includes ADL, CADL, ODIN, EL and various common & general pattern recognisers. As per the other post, the repo is here, but if you want to just see the AQL grammars, look here. RIght now, these are a copy of those published in the AQL 1.1.0 spec.

Unfortunately this review is coming a few months after the current Antlr4 AQL grammars were developed, and I did not have chance to do a sufficiently deep review then to get more commonality across the grammars. Anyway, there are obvious possibilities for reuse of existing grammars, but there are also associated issues. Here are a few that are initially obvious to me:

Base patterns reuse

The AqlLexer could reuse the BaseLexer grammar to get ISO8601 date times, URIs, and a fair bit of other stuff. However…

AQL ISO8601 date times include the non-extended form, i.e. the form without ‘-’ and ‘:’ characters. We don’t support these as naked values in places like ODIN or ADL because they can’t be distinguished from normal numbers. AQL gets around this by defining them as Antlr fragments, which require higher-level rules using them; in AQL, ISO8601 values are enclosed in "" and treated as special strings.

Probably we should support the compact format of ISO8601 in our base lexer, but we can’t do it the way the AQL lexer does it, which is like this:

fragment ISO8601_DATE_TIME
    : YEAR MONTH DAY ('T' HOUR MINUTE SECOND ('.' MICROSECOND)? TIMEZONE?)?
    | YEAR '-' MONTH '-' DAY ('T' HOUR ':' MINUTE ':' SECOND ('.' MICROSECOND)? TIMEZONE?)?
    ;

Instead, we’d need to distinguish something like ISO8601_DATE_TIME_EXTENDED and ISO8601_DATE_TIME_COMPACT. This would be easy enough to do, and would require fairly easy modifications to both AQL and Base, ODIN grammars, e.g. the following:

// ----------- Cadl parser -----------
dateTimeValue : ISO8601_DATE_TIME_EXTENDED ;
dateTimeListValue : dateTimeValue ( ( ',' dateTimeValue )+ | ',' SYM_LIST_CONTINUE ) ;
dateTimeIntervalValue :
      '|' SYM_GT? dateTimeValue '..' SYM_LT? dateTimeValue '|'
    | '|' relop? dateTimeValue '|'
    | '|' dateTimeValue SYM_PLUS_OR_MINUS durationValue '|'
    ;
dateTimeIntervalListValue : dateTimeIntervalValue ( ( ',' dateTimeIntervalValue )+ | ',' SYM_LIST_CONTINUE ) ;


// ----------- lexer rules ----------
STRING_DATE: '"' (ISO8601_DATE_TIME_COMPACT | ISO8601_DATE_TIME_EXTENDED ) '"' ;

fragment ISO8601_DATE_TIME_COMPACT: YEAR MONTH DAY ('T' HOUR MINUTE SECOND ('.' MICROSECOND)? TIMEZONE?)? ;

fragment ISO8601_DATE_TIME_EXTENDED:  YEAR '-' MONTH '-' DAY ('T' HOUR ':' MINUTE ':' SECOND ('.' MICROSECOND)? TIMEZONE?)? ;

openEHR Identifiers and patterns

Patterns like ARCHETYPE_HRID can straightforwardly be re-used from the OpenehrPatterns.g4 grammar.

Whitespace, comments & BOM handling

The AQL grammar handles this as follows:

// SKIP
WS: [ \t\r\n]+ -> skip;
UNICODE_BOM: (
    '\uEFBBBF' // UTF-8 BOM
    | '\uFEFF' // UTF16_BOM
    | '\u0000FEFF' // UTF32_BOM
    ) -> skip;
COMMENT: (
    SYM_DOUBLE_DASH ' ' ~[\r\n]* ('\r'? '\n' | EOF)
    | SYM_DOUBLE_DASH ('\r'? '\n' | EOF)
    ) -> channel(COMMENT_CHANNEL);

In the other grammars, we don’t handle BOMs, and we handle whitespace and comments more simply:

CMT_LINE   : '--' .*? EOL -> channel(COMMENT) ;
EOL        : '\r'? '\n'   -> channel(HIDDEN) ;  // throw out EOLs in default mode
WS         : [ \t\r]+     -> channel(HIDDEN) ;

We also handle EOF only at the top level.

Line handling is a bit less simple than one might think when you have embedded syntax blocks, which AQL has. When this happens, you have to go into a dedicated mode and handle WS and EOLs in a different way. Examples in the Adl2 lexer grammar.

I can propose improvements here, but I’d like to know from the Aql lexer grammar authors if they had specific reasons to do it as we see above.

Keyword & symbol definition

In the combined grammars, we do keywords as follows, using the name SYM_XXX for all symbols:

SYM_ARCHETYPE            : [Aa][Rr][Cc][Hh][Ee][Tt][Yy][Pp][Ee];

In the AQL grammar, the style is as follows:

// Common Keywords
SELECT: S E L E C T ;
AS: A S ;
FROM: F R O M ;

// ------------------- Fragment letters ---------------------
fragment A: [aA];
fragment B: [bB];
fragment C: [cC];
fragment D: [dD];
fragment E: [eE];
etc

I’m not sure that this gains anything, since the name of the symbol makes it obvious what it is, and I also wonder if the latter approach is likely to be slower at runtime because of all the extra fragment rules. I don’t have a strong preference however, but I do think we should use the same style.

URIs

URIs are fragments in the combined grammars, but not in the AQL grammar, but otherwise it’s the same spec, which either @pieterbos or I found online somewhere and re-used. We can easily make AQL re-use the same rules from BaseLexer here.

AQL Parser

There is a fair bit of re-use of the EL grammar that could be made in the AQL parser, but we’d need to be careful of the leaf type differences, i.e. things like stringified date/times etc. This is a bigger discussion however, which I will leave until later.

There’s lots more to discuss, but I think the above is a good start. All responses welcome.

Teun · 6 September 2021 07:40

Hi,

I was mostly involved in this on the level of ‘making this Antlr4’, and not so much in the decision making w.r.t. why we do certain things the way we do. I do agree that in particular the lexer does still have some issues we should resolve in the future. I don’t have a lot of time right now, but here are some initial comments:

Also I did not look at other grammars in the OpenEHR specification for this. I understand that some parts of the lexer were copied from other OpenEHR grammars, and I agree we should reuse as much as possible, in particular for things like Archetype IDs and date-time-formats.

I think we can just not handle BOMs, similar to the other grammars. They should not be used in UTF-8 anyway. Similarly, we should be able to handle EOFs top-level only.

As for your comment concerning line endings and embedded syntax blocks: could you give an example of where this would be an issue currently? My expertise is more in the field of ANTLR, rather than AQL, and when assisting with this I mostly worked of the examples in the specification, where this apparently is no issue. If there is a problem here, we should obviously fix that.

As far as your point related to keyword and symbol definition: the style used in the current AQL grammar is quite common (with fragments for [aA] etc.), and I personally find it more readable, and in particular much easier to write new symbols. Fragments are resolved when compiling the grammar, and not when parsing, so they have no performance impact.

thomas.beale · 6 September 2021 09:07

Hi Teun,

No criticism of the existing AQL parser/lexer (I reviewed the parser as well), just a question of improving maintainability via better reuse.

To handle an embedded block of another syntax, which more or less describes the top-level ADL syntax, but also the following embeddings:

ODIN, JSON etc in CADL, to represent default values
CADL in AQL and EL, to represent Boolean conditions defined as constraint matches (AQL example)

The CadlLexer handles embedded Odin, Json by line-oriented lexing, see at bottom. Here, whitespace and line ends are not thrown away but used to find lines and keep them intact for another parser. The ADL lexer has a lot more of this, used to pull an ADL file apart.

It might make sense to handle the embedded Cadl (as per the example above) by this approach, since currently the AQL parser has to replicate (some of) the rules of the CadlParser, which will handle any Cadl; on the other hand it may complicate things, if only a small subset of Cadl is supposed to be supported. In any case, embedded ODIN is also allowed, and this is much easier to capture as lines and send to the ODIN parser.

I am mainly interested in re-using the common patterns, aligning styles etc.

thomas.beale · 7 September 2021 15:02

Another query:
The AQL grammar has the following for integer and real matching:

INTEGER: DIGIT+;
REAL: DIGIT* '.' DIGIT+;
SCI_INTEGER: INTEGER E_SUFFIX;
SCI_REAL: REAL E_SUFFIX;
fragment E_SUFFIX: E [-+]? DIGIT+ ;

Here e-notation integers and reals are matched separately, so that the use of ‘INTEGER’ in a parser grammar would probably not have the desired result. Was there any reason to separate these out? Normally I think they would be both read by any scanf() style function with no problem, so I am not sure of the need. Maybe to preserve the string format for later use in serialising, which could make sense.

Currently the combined BaseLexer just does this.

INTEGER : DIGIT+ E_SUFFIX? ;
REAL :    DIGIT+ '.' DIGIT+ E_SUFFIX? ;
fragment E_SUFFIX : [eE][+-]? DIGIT+ ;

I can go either way.

thomas.beale · 7 September 2021 16:20

Some initial progress:

here are 2 commits with diffs showing how to re-use the BaseLexer and OpenehrPatterns lexers from the combined project. This shrinks down the Aql Lexer a lot, and also gives us just one place to maintain all those definitions for now.

Commit 2 - reuse rest of combined lexer patterns
Commit 1 - re-use Date/times

The above required me to add a few missing things to the BaseLexer from AqlLexer, so already a few benefits

In terms of going any further, there are various changes I would suggest. One is that symbols should be named in the lexer in the SYM_xxx style, because otherwise in the parser it is very hard to tell whether you are looking at symbols or values e.g. here (but also in most rules):

classExprOperand
    : IDENTIFIER variable=IDENTIFIER? pathPredicate?  #classExpression
    | VERSION variable=IDENTIFIER? (SYM_LBRACKET versionPredicate SYM_RBRACKET)?  #versionClassExpr
    ;

I think the following is easier to understand:

classExprOperand
    : IDENTIFIER variable=IDENTIFIER? pathPredicate?  
    | SYM_VERSION variable=IDENTIFIER? (SYM_LBRACKET versionPredicate SYM_RBRACKET)?  
    ;

I also tend to use inline literals for brackets etc to make the parser easier to read, e.g.

classExprOperand
    : IDENTIFIER variable=IDENTIFIER? pathPredicate?  
    | SYM_VERSION variable=IDENTIFIER? ( '(' versionPredicate ')' )? 
    ;

Similarly,

matchesOperand
    : SYM_LCURLY valueListItem (SYM_COMMA valueListItem)* SYM_RCURLY
    | terminologyFunction
    | SYM_LCURLY AQL_URI SYM_RCURLY
    ;

becomes

matchesOperand
    : '{' valueListItem ( ',' valueListItem)* '}'
    | terminologyFunction
    | '{' AQL_URI '}'
    ;

You still need the SYM_LCURLY etc in the lexer, but the parser rules are a lot easier to read.

The above are suggestions for discussion!