+1 on this one.

My reason is this makes timestamp/interval calculation faster, i.e,
"timestamp + interval < timestamp" should be faster without dealing with
two component in interval. Although I am not quite sure about the rational
behind the two component representation, which seems to be what is used in
Spark:

https://github.com/apache/spark/blob/master/common/unsafe/src/main/java/org/apache/spark/unsafe/types/CalendarInterval.java

I am interested in hearing reasoning behind two component.

On Wed, Oct 18, 2017 at 8:32 PM, Wes McKinney <wesmck...@gmail.com> wrote:

> I opened this patch over 2 months ago to add some additional metadata
> for intervals:
>
> https://github.com/apache/arrow/pull/920
>
> Java supports a two-component DAY_TIME interval type as a combo of
> days and milliseconds:
>
> https://github.com/apache/arrow/blob/402baa4ec391b61dd37c770ae7978d
> 51b9b550fa/java/vector/src/main/codegen/data/ValueVectorTypes.tdd#L106
>
> I propose that we change the interval representation to be a number of
> elapsed units of time from a particular point in time. This unit
> choices would be the same as our unit for timestamps, so an interval
> can be viewed as a delta between two timestamps of some resolution
> (second through nanoseconds) [1].
>
> As context, a number of systems I have worked with deal in absolute
> time deltas. In pandas, for example, the difference of timestamps
> (datetime64 values) is a timedelta:
>
> In [1]: import pandas as pd
>
> In [2]: dr1 = pd.date_range('1/1/2000', periods=5)
>
> In [3]: dr2 = pd.date_range('1/2/2000', periods=5)
>
> In [4]: dr1 - dr2
> Out[4]: TimedeltaIndex(['-1 days', '-1 days', '-1 days', '-1 days',
> '-1 days'], dtype='timedelta64[ns]', freq=None)
>
> In [5]: (dr1 - dr2).values
> Out[5]:
> array([-86400000000000, -86400000000000, -86400000000000, -86400000000000,
>        -86400000000000], dtype='timedelta64[ns]')
>
> We need to be able to represent this data coherently (up to nanosecond
> resolution) with the Arrow metadata, and we will also at some point
> need to perform analytics directly on this data type.
>
> An alternative proposal to changing the DAY_TIME interval
> representation is to add another kind of interval type, so instead of
> only YEAR_MONTH and DAY_TIME, we have TIMEDELTA. The downside of this,
> of course, is the extra implementation complexity. DAY_TIME with the
> current Java representation also seems to me to be a subset of what
> you can represent with TIMEDELTA.
>
> It would be great to make a decision about this so we can get this
> metadata finalized in the 0.8.0 release.
>
> Thanks
> Wes
>
> [1]: https://github.com/apache/arrow/blob/master/format/Schema.fbs#L135
>

Reply via email to