I'm all for moving interval to the new definition. I think we should avoid
introducing a timedelta type until it is really important. We need several
users demanding a type before we should implement it. Otherwise, we have
huge amounts of type bloat (which means nothing will fully implement the
spec and be able to interoperate).

On Sat, Nov 4, 2017 at 3:46 PM, Julian Hyde <jh...@apache.org> wrote:

> As I understand it, the proposal is to have both an interval data type[1]
> and a timedelta type[2].  The interval is compatible with the SQL standard
> (but not Postgres) and can be implemented with a single numeric value
> representing a particular time unit (year, month, day, hour, minute,
> second, and possibly fractional seconds); timedelta is an array of numeric
> values, one for a set of time units.
>
> I think we should have both, and operators to convert between them.
> Interval is certainly efficient, and is what some applications need, but
> some applications need timedelta.
>
> Julian
>
> [1] https://issues.apache.org/jira/browse/ARROW-352 <
> https://issues.apache.org/jira/browse/ARROW-352>
>
> [2] https://issues.apache.org/jira/browse/ARROW-835 <
> https://issues.apache.org/jira/browse/ARROW-835>
>
> > On Nov 4, 2017, at 1:26 PM, Wes McKinney <wesmck...@gmail.com> wrote:
> >
> > It seems like we don't have enough input on this topic to make a
> > decision right now. I placed the JIRA ARROW-352 in the 0.9.0
> > milestone, but we really should try to get this done soon so that
> > downstream users are not blocked on using Arrow to send around
> > interval data.
> >
> > - Wes
> >
> > On Fri, Oct 20, 2017 at 12:34 AM, Li Jin <ice.xell...@gmail.com> wrote:
> >> +1 on this one.
> >>
> >> My reason is this makes timestamp/interval calculation faster, i.e,
> >> "timestamp + interval < timestamp" should be faster without dealing with
> >> two component in interval. Although I am not quite sure about the
> rational
> >> behind the two component representation, which seems to be what is used
> in
> >> Spark:
> >>
> >> https://github.com/apache/spark/blob/master/common/
> unsafe/src/main/java/org/apache/spark/unsafe/types/CalendarInterval.java
> >>
> >> I am interested in hearing reasoning behind two component.
> >>
> >> On Wed, Oct 18, 2017 at 8:32 PM, Wes McKinney <wesmck...@gmail.com>
> wrote:
> >>
> >>> I opened this patch over 2 months ago to add some additional metadata
> >>> for intervals:
> >>>
> >>> https://github.com/apache/arrow/pull/920
> >>>
> >>> Java supports a two-component DAY_TIME interval type as a combo of
> >>> days and milliseconds:
> >>>
> >>> https://github.com/apache/arrow/blob/402baa4ec391b61dd37c770ae7978d
> >>> 51b9b550fa/java/vector/src/main/codegen/data/ValueVectorTypes.tdd#L106
> >>>
> >>> I propose that we change the interval representation to be a number of
> >>> elapsed units of time from a particular point in time. This unit
> >>> choices would be the same as our unit for timestamps, so an interval
> >>> can be viewed as a delta between two timestamps of some resolution
> >>> (second through nanoseconds) [1].
> >>>
> >>> As context, a number of systems I have worked with deal in absolute
> >>> time deltas. In pandas, for example, the difference of timestamps
> >>> (datetime64 values) is a timedelta:
> >>>
> >>> In [1]: import pandas as pd
> >>>
> >>> In [2]: dr1 = pd.date_range('1/1/2000', periods=5)
> >>>
> >>> In [3]: dr2 = pd.date_range('1/2/2000', periods=5)
> >>>
> >>> In [4]: dr1 - dr2
> >>> Out[4]: TimedeltaIndex(['-1 days', '-1 days', '-1 days', '-1 days',
> >>> '-1 days'], dtype='timedelta64[ns]', freq=None)
> >>>
> >>> In [5]: (dr1 - dr2).values
> >>> Out[5]:
> >>> array([-86400000000000, -86400000000000, -86400000000000,
> -86400000000000,
> >>>       -86400000000000], dtype='timedelta64[ns]')
> >>>
> >>> We need to be able to represent this data coherently (up to nanosecond
> >>> resolution) with the Arrow metadata, and we will also at some point
> >>> need to perform analytics directly on this data type.
> >>>
> >>> An alternative proposal to changing the DAY_TIME interval
> >>> representation is to add another kind of interval type, so instead of
> >>> only YEAR_MONTH and DAY_TIME, we have TIMEDELTA. The downside of this,
> >>> of course, is the extra implementation complexity. DAY_TIME with the
> >>> current Java representation also seems to me to be a subset of what
> >>> you can represent with TIMEDELTA.
> >>>
> >>> It would be great to make a decision about this so we can get this
> >>> metadata finalized in the 0.8.0 release.
> >>>
> >>> Thanks
> >>> Wes
> >>>
> >>> [1]: https://github.com/apache/arrow/blob/master/format/
> Schema.fbs#L135
> >>>
>
>

Reply via email to