All,
Please read this post on new Antlr4 grammars for openEHR for background.
This post is for everyone, but particularly for @sebastian.iancu , @Teun , @pieterbos , @christian , @michael.boeckers , as listed on the AQL grammar authors/contribs list.
I have added AQL to the set of grammars, with the idea of making it re-use common lexer and rule sub-grammars from the main set, which includes ADL, CADL, ODIN, EL and various common & general pattern recognisers. As per the other post, the repo is here, but if you want to just see the AQL grammars, look here. RIght now, these are a copy of those published in the AQL 1.1.0 spec.
Unfortunately this review is coming a few months after the current Antlr4 AQL grammars were developed, and I did not have chance to do a sufficiently deep review then to get more commonality across the grammars. Anyway, there are obvious possibilities for reuse of existing grammars, but there are also associated issues. Here are a few that are initially obvious to me:
Base patterns reuse
The AqlLexer could reuse the BaseLexer grammar to get ISO8601 date times, URIs, and a fair bit of other stuff. However…
- AQL ISO8601 date times include the non-extended form, i.e. the form without ‘-’ and ‘:’ characters. We don’t support these as naked values in places like ODIN or ADL because they can’t be distinguished from normal numbers. AQL gets around this by defining them as Antlr fragments, which require higher-level rules using them; in AQL, ISO8601 values are enclosed in
""
and treated as special strings.
Probably we should support the compact format of ISO8601 in our base lexer, but we can’t do it the way the AQL lexer does it, which is like this:
fragment ISO8601_DATE_TIME
: YEAR MONTH DAY ('T' HOUR MINUTE SECOND ('.' MICROSECOND)? TIMEZONE?)?
| YEAR '-' MONTH '-' DAY ('T' HOUR ':' MINUTE ':' SECOND ('.' MICROSECOND)? TIMEZONE?)?
;
Instead, we’d need to distinguish something like ISO8601_DATE_TIME_EXTENDED
and ISO8601_DATE_TIME_COMPACT
. This would be easy enough to do, and would require fairly easy modifications to both AQL and Base, ODIN grammars, e.g. the following:
// ----------- Cadl parser -----------
dateTimeValue : ISO8601_DATE_TIME_EXTENDED ;
dateTimeListValue : dateTimeValue ( ( ',' dateTimeValue )+ | ',' SYM_LIST_CONTINUE ) ;
dateTimeIntervalValue :
'|' SYM_GT? dateTimeValue '..' SYM_LT? dateTimeValue '|'
| '|' relop? dateTimeValue '|'
| '|' dateTimeValue SYM_PLUS_OR_MINUS durationValue '|'
;
dateTimeIntervalListValue : dateTimeIntervalValue ( ( ',' dateTimeIntervalValue )+ | ',' SYM_LIST_CONTINUE ) ;
// ----------- lexer rules ----------
STRING_DATE: '"' (ISO8601_DATE_TIME_COMPACT | ISO8601_DATE_TIME_EXTENDED ) '"' ;
fragment ISO8601_DATE_TIME_COMPACT: YEAR MONTH DAY ('T' HOUR MINUTE SECOND ('.' MICROSECOND)? TIMEZONE?)? ;
fragment ISO8601_DATE_TIME_EXTENDED: YEAR '-' MONTH '-' DAY ('T' HOUR ':' MINUTE ':' SECOND ('.' MICROSECOND)? TIMEZONE?)? ;
openEHR Identifiers and patterns
Patterns like ARCHETYPE_HRID
can straightforwardly be re-used from the OpenehrPatterns.g4 grammar.
Whitespace, comments & BOM handling
The AQL grammar handles this as follows:
// SKIP
WS: [ \t\r\n]+ -> skip;
UNICODE_BOM: (
'\uEFBBBF' // UTF-8 BOM
| '\uFEFF' // UTF16_BOM
| '\u0000FEFF' // UTF32_BOM
) -> skip;
COMMENT: (
SYM_DOUBLE_DASH ' ' ~[\r\n]* ('\r'? '\n' | EOF)
| SYM_DOUBLE_DASH ('\r'? '\n' | EOF)
) -> channel(COMMENT_CHANNEL);
In the other grammars, we don’t handle BOMs, and we handle whitespace and comments more simply:
CMT_LINE : '--' .*? EOL -> channel(COMMENT) ;
EOL : '\r'? '\n' -> channel(HIDDEN) ; // throw out EOLs in default mode
WS : [ \t\r]+ -> channel(HIDDEN) ;
We also handle EOF only at the top level.
Line handling is a bit less simple than one might think when you have embedded syntax blocks, which AQL has. When this happens, you have to go into a dedicated mode and handle WS and EOLs in a different way. Examples in the Adl2 lexer grammar.
I can propose improvements here, but I’d like to know from the Aql lexer grammar authors if they had specific reasons to do it as we see above.
Keyword & symbol definition
In the combined grammars, we do keywords as follows, using the name SYM_XXX
for all symbols:
SYM_ARCHETYPE : [Aa][Rr][Cc][Hh][Ee][Tt][Yy][Pp][Ee];
In the AQL grammar, the style is as follows:
// Common Keywords
SELECT: S E L E C T ;
AS: A S ;
FROM: F R O M ;
// ------------------- Fragment letters ---------------------
fragment A: [aA];
fragment B: [bB];
fragment C: [cC];
fragment D: [dD];
fragment E: [eE];
etc
I’m not sure that this gains anything, since the name of the symbol makes it obvious what it is, and I also wonder if the latter approach is likely to be slower at runtime because of all the extra fragment rules. I don’t have a strong preference however, but I do think we should use the same style.
URIs
URIs are fragments in the combined grammars, but not in the AQL grammar, but otherwise it’s the same spec, which either @pieterbos or I found online somewhere and re-used. We can easily make AQL re-use the same rules from BaseLexer here.
AQL Parser
There is a fair bit of re-use of the EL grammar that could be made in the AQL parser, but we’d need to be careful of the leaf type differences, i.e. things like stringified date/times etc. This is a bigger discussion however, which I will leave until later.
There’s lots more to discuss, but I think the above is a good start. All responses welcome.