AQL spec review for stabilization, formalization and cleanup

pablo · 12 November 2019 06:11

Hi all, I’ve reviewed the current AQL spec and the open tickets related to AQL in JIRA. Please check the comments/questions/suggestions to see if you agree (mainly for SEC members but general opinion would be good feedback) https://docs.google.com/document/d/1g8zOh06LhSNi1yFZWKuBzUX0bJN88r7mKpAFqDNi2JI/edit?usp=sharing

I’ll start making some changes to the spec this week, reordering sections and adding missing information without changing the semantics of the operators and clauses.

We need to move this forward ASAP for conformance testing, and for the HiGHmed project.

Please let me know if you have any comments or questions about the review.

Thanks,
Pablo.

pablo · 16 November 2019 01:29

I have found some other issues on the spec and added these to my report:

bd. Predicate usage clarifications

At the end of the “predicates” section it says “A node predicate defines criteria on fine-grained data. It is only used within an identified path.”

From the predicates section, it is not clear that the predicates could be used only in the WHERE clause, and I think I’ve seen some predicates at the FROM, but in the specs there are no examples.

Then on: 3.8.1. Usage
“Identified paths are used to locate data items within an archetype RM class. Except for the FROM clause, an identified path could appear in any clauses of an AQL statement. For example, it can be used to indicate the data to be returned in a SELECT clause, or the data item on which query criteria are applied in a WHERE clause.”

So if predicates are only used within identified paths and identified paths can’t be used in the FROM, so the predicates could only be within the WHERE, is that correct? That is not explicitly stated in the spec.

de. Identified paths clarifications

In section 3.8 Identified paths there is no actual definition of what an identified path is.

Also says “Identified paths are used to locate data items within an archetype RM class. Except for the FROM clause…”

Then “AQL identified path can take any of the following forms: an AQL variable name defined within the FROM clause, …”

Aren’t those contradictory statements?

pablo · 27 November 2019 05:23

There are some items on the AQL spec improvements that need to be discussed in a SEC meeting, this is the current list of things I can’t resolve without input and consensus by the SEC: (we need to schedule some calls)

Use of abstract RM types like ENTRY or EVENT in AQL
Type checking rules for comparison operators:
a) check compatible/incompatible types for comparison and correspondent errors
b) error checking capabilities (design time vs. run time)
Rules for parameter substitution on AQL: which types will add or not quotes
Define required set of functions.
a) full specification of functions including parameter types and return types/formats
b) and function possible errors
Add missing spec for the matches operator, like if it supports regexes or not.
Definition of “identified path” and “identified expression” (it seems it is the same as just “path” and “expression”).
Review of pending AQL-related tickets
a) comment on the valid ones
b) disregard the ones that are not valid.
c) speed up the approval and specification for LIMIT/OFFSET keywords to be able to formally use pagination for AQL results
Discuss descendant paths: like obs//value
Define internal items for Terminology URI, considering current discussions about that area in AQL (mainly from Luis researching this part)
Add resolution of latest discussions about TIMEWINDOW to the spec.
Decide about supporting or not cADL expressions in the matches right operand.

Details are here: https://docs.google.com/document/d/1g8zOh06LhSNi1yFZWKuBzUX0bJN88r7mKpAFqDNi2JI/edit?usp=sharing

thomas.beale · 27 November 2019 11:39

Pablo, this is a good list. Hope you don’t mind, I just edited your message to change the mark-down to format the list properly so it is easier to read.

Can we separate out anything here that is nominally 1.1.0 into a different list - I’d really like to nail down the 1.0.1 version which just fixes typos, improves description of current semantics etc, but doesn’t change anything (not even the deficiencies). Then I think we need to aim for future releases something like:

1.1.0 which deals with ‘obvious deficiencies’;
1.2.0 which adds obvious ‘relatively simple’ enhancements;
1.3.0 and later which start to add more serious enhancements, e.g. server-side functions.

This is just my reaction to the work going on now, and considering the long-term list of things that the guys at Better have been after (e.g. server-side functions), that @Seref has previously worked on, and now @pablo has generated. You guys may have better ideas about release strategy, but to me it seems it is better to stage the changes rather than get stuck forever on making a giant ‘perfect release’.

Pablo, If you agree with that, can you cut the above list up according to notional release, and let’s initially discuss 1.0.1 and get it done. However if you prefer to just discuss the whole list and use the call to do the triage, I’m fine with that as well.

pablo · 27 November 2019 15:55

I think we need to get feedback, maybe brainstorm, about each item, get all documented, then discuss about all items. After that we can organize and prioritize the work, that I don’t really care how is done, just need to know what others think and what others already implemented, to be able to make informed decisions.

In fact I would prefer not to discuss the points on the first call, just get everyone’s views on each subject and document everything, also that could help to detect coincidences without putting much time on discussing small details. That might be for a second conversation I think.

Another idea is to create a wiki page for each item so SEC members and the community in general, can put their thoughts, but that doesn’t generate much engagement. When we schedule calls and have a concrete agenda, people tend to connect and talking we move forward faster.

pablo · 28 November 2019 00:37

Adding an item:

Agree on a “LIKE”-like operator for partial text matching, maybe a combination of "matches’ and regexes.

thomas.beale · 28 November 2019 16:16

The current syntax has this:

matchesOperand
 	: valueListItems | URIVALUE;

and if you follow that through, it does not include regex matching. So we can take that out of the text as far as I can see (even if some implementations have it, it’s not yet in the spec).

The current syntax has this ‘identified’ stuff all over the place, but really the only ‘identified’ thing is a path, i.e. the association of an identifier an a predicate and/or path:

identifiedPath
	 : IDENTIFIER predicate? ('/' objectPath)? -> ^(IDENTIFIER predicate? objectPath?);

The ‘identified expression’ is just an expression. I agree with you, take out ‘identified’ from the text, and just check that it is clear in the relevant place that the conjunction of an identifier and a predicate/path has the effect of enabling the identifier to be used instead of the predicate/ path.

pablo · 13 December 2019 00:24

Hi Thomas,

I can take the examples with the regexes out of the spec and move them to the wiki for further discussion.

BTW, I can see nodePredicateRegEx uses MATCHES and REGEXPATTERN, but I don’t see nodePredicateRegEx being used anywhere. Archetype Query Language (AQL)

Related to point 12. support a “LIKE”-like operator, I think allowing regexes in matches could be a good alternative for supporting that “LIKE”.

Another question: could path predicates be used to filter on other attributes than archetype ids and node ids? For instance, is this valid?

SELECT c FROM COMPOSITION c [uid/value='_this_is_a_version_id_']

Note: testing that on EHRScape, I get 400 Bad Request with no error message, and on EHRBASE I get 400 and an antlr exception “Could not process query, reason:org.antlr.v4.runtime.misc.ParseCancellationException: AQL Parse exception: line 1: char 29 mismatched input ‘uid’ expecting ARCHETYPEID”

Or the only way of doing such thing is by using a condition in the WHERE?

SELECT c FROM COMPOSITION c WHERE c/uid/value = '_this_is_a_version_id_'

thomas.beale · 18 December 2019 20:15

Regex as an altenative to LIKE is fine, but I guess we need to allow the simplified syntax with LIKE, with is AFAIK, something like unix glob syntax, i.e. wildcards that you can use in bash command line for matching file names.

On the last Q, I would normally expect to do this in the WHERE clause.

pablo · 19 December 2019 01:53

Do you mean that is not valid or that is an alternative to something that is commonly done in WHERE but is still valid?

I know regex is wider than the scope of LIKE, I guess my point is: if we allow regexes in matches, then that will include a kind-of LIKE, since a regex can emulate the like syntax and those wildcards, plus doing more interesting/complex stuff.

ian.mcnicoll · 19 December 2019 10:17

Agree that we need to support simple wild-carding. No objection to adding regex support, as long as it does not over-complicate as I suspect the value will be somewhat limited in practice.

thomas.beale · 19 December 2019 11:33

Well be careful - unix wildcards (file-name globbing) is not the same as regex. For example ‘filename.’ matches filenames of the form ‘filename.xxx’ where ‘xxx’ is any string coming after a dot. In regex, 'filename.’ matches ‘filename’ followed by anything at all. I am not even sure what logic ‘LIKE’ follows in SQL (from memory, things like ‘XXX%’, where the ‘%’ acts like a ‘*’ wildcard), but we should make it do the same thing.

On the uid matching, I don’t think it is valid in the grammar, but I need to go through it again to double check. I think it’s better to keep the FROM part to being archetype-match based, potentially plus other predicates, but not predicates that don’t include archetypes.

Seref · 19 December 2019 11:45

I’m now going through this and my early feedback re LIKE and regex is that having a pragmatic and well defined LIKE would probably fit the 80/20 rule better. I agree with @thomas.beale’s point re being careful about wildcards.

We need to define capabilities of LIKE, ideally considering capabilities of various implementation technologies but without favouring a particular tech. It is a tricky thing but the specs should be tech-agnostic but also should not be orthogonal to them. A common denominator would let us hit the 80/20 for LIKE

pablo · 22 December 2019 17:45

I don’t think I get the analogy: one thing is matching file names, another thing is regex, in file matching “" matches anything, in regex “.” matches any char and ".” matches anything.

But in regexes we can match specific things, it will not always be used to search for “.". Just when emulating the LIKE, yes, the anything in SQL LIKE is “%” in most DBMSs, so the ".” in regex can emulate the “%”.

The use case for supporting regexes in the matches operator is for sure matching strings with certain pattern. And considering my comments about type checking in AQL, we would be able to define if regexes apply to UIDs or not.

Another point is: LIKE just works for a specific set of use cases, searching in small strings. For more powerful search in strings, we need full-text search, because exact matching and LIKE matching are not enough to search for similar phrases in a long text: MySQL Full text search - w3resource

I think there are providers that have AQL extensions for this, and IMO we need to update the spec accordingly, since current AQL works only for structured data.

thomas.beale · 23 December 2019 10:59

My point (maybe not very clear) is that a) non-tech people intuitively understand filename globbing wildcads much better than true regex and b) the ‘like’ syntax in SQL is from memory more like filename globbing than true regex. Consequently, the capability of most AQL authors to write regex is likely to be low. I agree we should support it, but we most likely need the simplified wildcard mode as well. Which means tools would need some check-box to turn on ‘full regex’ etc. It might be sensible to treat the simple wildcard / Like patterns as just a UI thing that gets converted into proper regex, and have the back-end only do regex.

Agree on the full-text search.

pablo · 30 December 2019 14:41

@thomas.beale please check that ^ I have an open issue with the conformance platform related to that specific test case.

About the regex thing, now I understand your point, you were considering the people creating the the queries. I don’t think anyone, even an expert will create a regex just by writing it it it’s a little complex, that requires an editor and tester like https://regex101.com/

We need to separate requirements for AQL from the user experience requirements, since we can’t control from the spec how and which features are exposed to users. If we have a clear separation of concerns, we can move forward faster with the spec, and later add implementation considerations, suggestions or even ITS elements that help on implementation.

About the full-text search, I’ve been doing some research on the MySQL side, there are a couple of modes and ways to tokenize the input and sort the output based on a relevance index. Need to review how other technologies do that to have some kind of proposal for the AQL spec.