Conformance data validation question: counterintuitive C_TIME.range constraints

In the conformance specification I have a test case for the check of C_TIME.range constraint over DV_TIME to validate time data.

In another thread I was arguing that there are some date/time/datetime expressions that are not strictly comparable, even though the current spec says otherwise. I have double checked semantics with ISO8601-1_2019 and this is a summary:

There is an open issue in the comparability of two date/time/datetime expressions with different precisions but shared values for the components they contain. For instance, in ISO 8601-1_2019, the expression T23:20 is referring to a specific hour and minute, and T23 is referring to a specific hour. Then, numerically, it’s not possible to say if T23 < T23:20 or if T23 > T23:20, though a string comparison could use some criteria to say if one expression is less than the other. But I think what is important is to compare what the expressions represent, not the syntax of the expressions (this is a huge difference IMO and why this problem seems extremely trivial to some or difficult to others).

That is because the expressions represent different time components, which are really intervals of time, and one interval contains the other (the 23rd hour of the day contains the minute 23:30).

Following that logic, T23 is an interval/range of all the minutes and seconds in hour 23, where T23:20 is an interval of all the seconds in minute 20 of the 23rd hour of the day. Graphically:

T23    = 23:00:00 [=============================================) 00:00:00
T23:20 =             23:20:00 [===) 23:21:00

[ marks the inclusion of the beginning
) marks the exclusion of the end

Though when the precisions are not the same but there are no shared components, then the expressions are comparable, for instance we can say T22 < T2320, because all the minutes and seconds in the 22nd hour of the day come before the minute 23:20. Graphically:

T22    = 22:00:00 [=============================================) 23:00:00
T23:20 =                                                                  23:20:00 [===) 23:21:00

Similarly we can say T22:45 < T23, since the whole minute 22:45 comes before all minutes and seconds in the 23rd hour of the day.

Then the question for C_TIME.range and this will get weird:

Besides nothing that reduced precision time expressions represent an interval or range when those reduced precision time expressions are used as limits for an openEHR Interval, then the result is the union of both interval defined by the beginning of the lower limit and the end of the upper limit (this is my interpretation).

For instance T11 represents the whole 11th hour of the day, from start to end, and T22 represents the whole 22nd hour of the day from start to end, then T11..T22 represents all hours, minutes and seconds from the start of hour 11 to the end of hour 22 (yes the end not the start!). Graphically:

T11      = 11:00:00 [====) 12:00:00
T22      =                                            22:00:00 [====) 23:00:00
T11..T22 = 11:00:00 [===============================================) 23:00:00

So something that might be counterintuitive in this notation is T22:30 would be contained in the T11..T22 interval, though it is not strictly comparable to T22. In this case, since we are checking if a time expression is contained in an interval, this works because we are not really checking if one expression is less than the other, we are using the contains operator instead of the < operator.

Then we can say T22:30 is contained in T11..T22 (interval), or even say T22:30 is contained in T22(since this is actually also an interval), but can’t say if T22:30 < T22. It is important to understand we use two notations (Txx…Tyy and Tzz) that represent the same concept: an interval of time.

Again, all that is based on what the time expressions represent, not on the expression syntax itself.

I really don’t know if I’m being really stupid and can’t realize how things work or if I really discovered something worth discussing (or maybe a mix of both!).

Until reading your post I considered T22 to be T22:00 (similar to how date libraries set missing parts to 0).

As you explained it can get complicated and confusing if T22 is treated as [T22:00…T23:00).

Can’t we define the meaning of T22 as T22:00 and if somebody wants to use it as all the minutes in 22, they can be explicit and declare it as an interval [T22:00…T23:00)?

1 Like

I don’t think you’re being stupid :slight_smile: I think your point re comparing different precisions is valid and worth discussing. There are a bazillion docs and forum posts repeating “is8601 is lexicographically sortable” but it’s not easy to find the details for the case of comparing two values with different precisions. There is one document that we can benefit from though. it’s this profile of iso8601 from W3 which says:

Different standards may need different levels of granularity in the date and time, so this profile defines six levels. Standards that reference this profile should specify one or more of these granularities. If a given standard allows more than one granularity, it should specify the meaning of the dates and times with reduced precision, for example, the result of comparing two dates with different precisions.

Nice. I interpret this as comparison between different precisions being undefined according to iso8601, based on what W3 is telling us. Relying on W3’s implied interpretation, if we replace the " a given standard" in the last sentence with openEHR and adding a little precision (no pun intended):

If openEHR allows more than one granularity, it should specify the meaning of the dates and times with reduced precision in relevant contexts, for example, the result of comparing two dates with different precisions.

This is similar to clarification I suggested based on your comments re using dates prior to October 15, 1582 in the other thread.

Alternatively, we can do what W3 does, and define a profile with some granularities, and clarifications as suggested above.

1 Like

Hi Borut and Seref!

The issue is, what I understand from ISO8601 is: the lower precision time/date expressions represent different things. So it’s not that T22 is treated as [T22:00:00..T23:00:00], is that actually represent the same thing/concept of the real world: the hour 22.

There is also an issue of notation. To refer to the hour 22, both notations above could be used, the normal time expression T22 or the interval notation with full time expressions [T22:00:00..T23:00:00]. Now, if you want to represent one hour that doesn’t start on minute and seconds 0, there is no way around but the interval notation, for instance [22:35:06..23:35:06], that is actually one hour but starts on a point in time that is not the start of the minute and second. Then, the interval notation is more consistent and doesn’t have any hidden semantics, while the normal time with reduced precision has some hidden semantics.

This is gold! thanks @Seref I was going mad about this, believing I was over-complicating things (as always).

That is also related to our exchange about what arerepresentation vs. computation items and the separation of those. So I like the profile definition idea and I really like us defining the semantics of

a. what a lower precision expression represents in the real world,
b. what is strictly comparable when reduced precision expressions come into play,
c. how to compare expressions when they are comparable, (if we compare using the syntactic expression or if we transform to a numeric time expression or if we use interval/range logic),
d. use the semantics of containment/inclusion when comparing expressions is not available and we need to deal with interval/range logic because of reduced precision expressions

I need to read through this thread carefully, but a thought outside the box… if we think in a fuzzy logic way, a query engine could return two categories of results:

  • results with certainty 100% (true results)
  • results that might be true, according to the possibility of time-points overlapping with intervals as Pablo has described.

This might be achieved by having a flag in AQL queries, something like FUZZY_MATCH_ON, which would be off by default. When it was set, the results would include possible matches as well.

There are well-known algorithms for doing fuzzy matching on text (remember SOUNDEX?), and more modern ones as well in modern text-oriented DBs, not to mention ML tricks. I don’t know of any standard approach for temporal data though.

I didn’t comment about querying date/times with reduced precision because it’s a whole new beast. With the RM review only we have enough :slight_smile:

I imagine the query syntax to have a set of operations to work with the logic I described above: that reduced precision date/time expressions are really intervals not points in time, so interval logic, and the “includes/contains” operator might be handy to filter date/time data in a query WHERE expression. Then we also need to take into account the magnitude_status and accuracy fields that DV_DATE, DB_TIME and DV_DATE_TIME inherit. If there is data in the database that is, for instance, < 2022 then if the query has WHERE date = 2022, should the data match? (note the DV_DATE value matches! but it has also a magnitude_status <).

I have more questions than answers on this area, and might need a couple of weeks with detailed analysis, use case exploration, etc. before having some kind of useful idea in the querying area. But this is a mandatory task IMO because without this we would have a weak or inconsistent way of querying when all those elements come into play: partial dates, dates with magnitude_status and dates with accuracy.

It could be done that way, but that is forcing the query author to think about the problem of fuzzy matching, which they probably won’t do or know where in the model it manifests. A query engine that knows where the fuzzy comparisons are hiding could do the checking though.

I would argue that it’s difficulty to separate comparison operators from the semantics of the data operands. Having fuzzy data to filter requires the query author to consider it if it’s not considered then the query results might not be what they expect.

But again, I didn’t do any proper analysis on that area.