I agree with Weston that ordering isn't in the scope for the Arrow format spec (*). For example, implementations are free to define UTF8 comparisons and ordering as they wish (some may want to invest in the complexity of the official Unicode collation algorithm, others may be content with a simple codepoint-wise lexicographic comparison). It doesn't prevent them from exchanging UTF8 data unambiguously using Arrow.

(*) It may be in the scope for a hypothetical Compute IR spec, however.

Regards

Antoine.


Le 14/09/2021 à 07:16, QP Hou a écrit :
Good point Weston. My proposal was written with the impression that
Arrow does want to define semantic for some of these temporal types
based on the existing comments in the Schema.fbs file.

For example, here is a quote taken from the comments for the Time time:

/// This definition doesn't allow for leap seconds. Time values from
/// measurements with leap seconds will need to be corrected when ingesting
/// into Arrow (for example by replacing the value 86400 with 86399).

Here is another quote for the Date type:

/// * Milliseconds (64 bits) indicating UNIX time elapsed since the epoch (no
/// leap seconds), where the values are evenly divisible by 86400000

For the interval type, we have:

// A "calendar" interval which models types that don't necessarily
// have a precise duration without the context of a base timestamp (e.g.
// days can differ in length during day light savings time transitions).

I think pushing the responsibility to define these semantics to the
data producer side is also a perfectly fine design with its own
trade-offs. It would make data exchange between two different systems
a little bit harder because consumers need to be aware of the
semantics defined by the producer. On the other hand, it does make the
producer implementation easier. It also makes data exchange within the
same system more efficient if that system's temporal type semantic is
different from what's defined in Arrow's spec.

Either way, I think it would be good if we can be consistent on our
temporal type semantics in the spec. If we are making the claim that
leap seconds should not be taken into account for Time, Timestamp and
Date types, then it seems natural to make this claim for Interval type
as well. Alternatively, we could update the spec to make all temporal
types leap seconds agnostics.

On Mon, Sep 13, 2021 at 12:03 PM Weston Pace <weston.p...@gmail.com> wrote:

One could define a sorting based on 30 days months, 365 day years, and
24 hour days.  It would be consistent but can lead to some surprising
results.  It appears that this is what postgres does as I got the
following ordering for an interval:

359 days, 12 months, 360 days, 1 year, 365 days, 366 days

On the other hand, Joda time forbids comparison of periods (their
version of what we call an interval) and offers three ways to convert
to a duration.  There is toDurationFrom(instant),
toDurationTo(instant) which give durations from specific calendar
ranges and then there is toStandardDuration() which converts to a
duration based on 24 hour days.  However, toStandardDuration will
still fail if the period has >0 months or years (presumably because
months and years are too inconsistent).

I'm not sure though that this is something that Arrow needs to define.
We aren't specifying any invalid ranges of values.  I don't foresee
any interoperability concerns.  A system that treated intervals as
comparable (and didn't factor in DST, leap years, etc.) will read and
write intervals the same way as a system that considers intervals
incomparable.

This question seems to fall into the "compute" space inhabited by
topics like "is 'false && null' a false value or a null value" and
"should addition overflow or throw an exception".

On Mon, Sep 13, 2021 at 6:23 AM QP Hou <houqp....@gmail.com> wrote:

On Mon, Sep 13, 2021 at 6:18 AM Antoine Pitrou <anto...@python.org> wrote:
The Duration type is defined with a TimeUnit.  You are probably thinking
about the Interval type.


Oops, my bad, yes, it should be Interval type not Duration.

Ok.  How about daylight savings? I suppose they are taken into account
as well.


Yes, the day component in both DAY_TIME and MONTH_DAY_NANO all take
into account of daylight savings.

Reply via email to