I opened this patch over 2 months ago to add some additional metadata
for intervals:

https://github.com/apache/arrow/pull/920

Java supports a two-component DAY_TIME interval type as a combo of
days and milliseconds:

https://github.com/apache/arrow/blob/402baa4ec391b61dd37c770ae7978d51b9b550fa/java/vector/src/main/codegen/data/ValueVectorTypes.tdd#L106

I propose that we change the interval representation to be a number of
elapsed units of time from a particular point in time. This unit
choices would be the same as our unit for timestamps, so an interval
can be viewed as a delta between two timestamps of some resolution
(second through nanoseconds) [1].

As context, a number of systems I have worked with deal in absolute
time deltas. In pandas, for example, the difference of timestamps
(datetime64 values) is a timedelta:

In [1]: import pandas as pd

In [2]: dr1 = pd.date_range('1/1/2000', periods=5)

In [3]: dr2 = pd.date_range('1/2/2000', periods=5)

In [4]: dr1 - dr2
Out[4]: TimedeltaIndex(['-1 days', '-1 days', '-1 days', '-1 days',
'-1 days'], dtype='timedelta64[ns]', freq=None)

In [5]: (dr1 - dr2).values
Out[5]:
array([-86400000000000, -86400000000000, -86400000000000, -86400000000000,
       -86400000000000], dtype='timedelta64[ns]')

We need to be able to represent this data coherently (up to nanosecond
resolution) with the Arrow metadata, and we will also at some point
need to perform analytics directly on this data type.

An alternative proposal to changing the DAY_TIME interval
representation is to add another kind of interval type, so instead of
only YEAR_MONTH and DAY_TIME, we have TIMEDELTA. The downside of this,
of course, is the extra implementation complexity. DAY_TIME with the
current Java representation also seems to me to be a subset of what
you can represent with TIMEDELTA.

It would be great to make a decision about this so we can get this
metadata finalized in the 0.8.0 release.

Thanks
Wes

[1]: https://github.com/apache/arrow/blob/master/format/Schema.fbs#L135

Reply via email to