Re: [DISCUSS] Expanding Arrow interval type metadata, changing Java memory representation

Wes McKinney Wed, 08 Nov 2017 14:48:49 -0800

Makes sense. The key question is whether the data is represented as a
single 64-bit integer or as effectively a C struct


struct {
  int32_t days;
  int32_t milliseconds;
}

The struct representation cannot accommodate higher resolution units
like microseconds and nanoseconds. From my perspective, if we use the
64-bit integer representation (which is for the millis case days *
86400000 + milliseconds) then whether we call it Interval or Timedelta
is sort of immaterial to the immediate use cases I have, where users
have a timedelta64[UNIT] type (which results from any arithmetic
between timestamp values)

On Wed, Nov 8, 2017 at 5:38 PM, Julian Hyde <jh...@apache.org> wrote:
> I don't know many examples of interval being used in the real world.
> But here's the kind of thing: the policy is that an offer is open for
> 60 hours, so if the offer is made to a particular customer at 12:34pm
> on Sunday, you want to compute that it ends at 12:34am on Wednesday.
> The interval "60 hours" is really just syntactic sugar for 216,000
> seconds. You could write it as interval '60' hour, or interval '2:12'
> day to hour, or interval '129600' second, but the values and
> underlying representation are the same. (Interval '1:36' day to hour
> is not a valid value, because 36 is out of the valid hour range 0..23,
> but you could construct the value using interval '1' day + 36 *
> interval '1' hour.)
>
> My understanding is that a timedelta (2 day 12 hours) is different
> from timedelta (60 hours) and timedelta (1 day 36 hours), but all are
> valid timedelta values.
>
> For my offer expiration example the SQL-style interval is sufficient,
> because there is no material difference between 2:12 and 1:36.
>
> But I am sure you can provide use cases where timedelta is necessary.
>
> I don't claim one is better than the other, and I'm not volunteering
> to implement either of them, so I don't have a say which you should do
> first. But please keep the names "interval" and "timedelta", so the
> various communities aren't confused about semantics.
>
>
> On Wed, Nov 8, 2017 at 2:15 PM, Wes McKinney <wesmck...@gmail.com> wrote:
>> Pleading ignorance on use of the SQL interval type, my prior would be
>> that many algorithms would first convert the interval components into
>> an absolute timedelta. Is that not the case?
>>
>> My preference right now would be to have a single Interval type, where
>> the DAY_TIME type actually contains an absolute delta based on the
>> indicated unit. It is true that the divmod operation to decompose into
>> number of days and intraday units (milliseconds, nanoseconds, etc.) is
>> not the cheapest, but I don't know the use cases for the type well
>> enough to judge.
>>
>> On Wed, Nov 8, 2017 at 5:10 PM, Jacques Nadeau <jacq...@apache.org> wrote:
>>> I'm all for moving interval to the new definition. I think we should avoid
>>> introducing a timedelta type until it is really important. We need several
>>> users demanding a type before we should implement it. Otherwise, we have
>>> huge amounts of type bloat (which means nothing will fully implement the
>>> spec and be able to interoperate).
>>>
>>> On Sat, Nov 4, 2017 at 3:46 PM, Julian Hyde <jh...@apache.org> wrote:
>>>
>>>> As I understand it, the proposal is to have both an interval data type[1]
>>>> and a timedelta type[2].  The interval is compatible with the SQL standard
>>>> (but not Postgres) and can be implemented with a single numeric value
>>>> representing a particular time unit (year, month, day, hour, minute,
>>>> second, and possibly fractional seconds); timedelta is an array of numeric
>>>> values, one for a set of time units.
>>>>
>>>> I think we should have both, and operators to convert between them.
>>>> Interval is certainly efficient, and is what some applications need, but
>>>> some applications need timedelta.
>>>>
>>>> Julian
>>>>
>>>> [1] https://issues.apache.org/jira/browse/ARROW-352 <
>>>> https://issues.apache.org/jira/browse/ARROW-352>
>>>>
>>>> [2] https://issues.apache.org/jira/browse/ARROW-835 <
>>>> https://issues.apache.org/jira/browse/ARROW-835>
>>>>
>>>> > On Nov 4, 2017, at 1:26 PM, Wes McKinney <wesmck...@gmail.com> wrote:
>>>> >
>>>> > It seems like we don't have enough input on this topic to make a
>>>> > decision right now. I placed the JIRA ARROW-352 in the 0.9.0
>>>> > milestone, but we really should try to get this done soon so that
>>>> > downstream users are not blocked on using Arrow to send around
>>>> > interval data.
>>>> >
>>>> > - Wes
>>>> >
>>>> > On Fri, Oct 20, 2017 at 12:34 AM, Li Jin <ice.xell...@gmail.com> wrote:
>>>> >> +1 on this one.
>>>> >>
>>>> >> My reason is this makes timestamp/interval calculation faster, i.e,
>>>> >> "timestamp + interval < timestamp" should be faster without dealing with
>>>> >> two component in interval. Although I am not quite sure about the
>>>> rational
>>>> >> behind the two component representation, which seems to be what is used
>>>> in
>>>> >> Spark:
>>>> >>
>>>> >> https://github.com/apache/spark/blob/master/common/
>>>> unsafe/src/main/java/org/apache/spark/unsafe/types/CalendarInterval.java
>>>> >>
>>>> >> I am interested in hearing reasoning behind two component.
>>>> >>
>>>> >> On Wed, Oct 18, 2017 at 8:32 PM, Wes McKinney <wesmck...@gmail.com>
>>>> wrote:
>>>> >>
>>>> >>> I opened this patch over 2 months ago to add some additional metadata
>>>> >>> for intervals:
>>>> >>>
>>>> >>> https://github.com/apache/arrow/pull/920
>>>> >>>
>>>> >>> Java supports a two-component DAY_TIME interval type as a combo of
>>>> >>> days and milliseconds:
>>>> >>>
>>>> >>> https://github.com/apache/arrow/blob/402baa4ec391b61dd37c770ae7978d
>>>> >>> 51b9b550fa/java/vector/src/main/codegen/data/ValueVectorTypes.tdd#L106
>>>> >>>
>>>> >>> I propose that we change the interval representation to be a number of
>>>> >>> elapsed units of time from a particular point in time. This unit
>>>> >>> choices would be the same as our unit for timestamps, so an interval
>>>> >>> can be viewed as a delta between two timestamps of some resolution
>>>> >>> (second through nanoseconds) [1].
>>>> >>>
>>>> >>> As context, a number of systems I have worked with deal in absolute
>>>> >>> time deltas. In pandas, for example, the difference of timestamps
>>>> >>> (datetime64 values) is a timedelta:
>>>> >>>
>>>> >>> In [1]: import pandas as pd
>>>> >>>
>>>> >>> In [2]: dr1 = pd.date_range('1/1/2000', periods=5)
>>>> >>>
>>>> >>> In [3]: dr2 = pd.date_range('1/2/2000', periods=5)
>>>> >>>
>>>> >>> In [4]: dr1 - dr2
>>>> >>> Out[4]: TimedeltaIndex(['-1 days', '-1 days', '-1 days', '-1 days',
>>>> >>> '-1 days'], dtype='timedelta64[ns]', freq=None)
>>>> >>>
>>>> >>> In [5]: (dr1 - dr2).values
>>>> >>> Out[5]:
>>>> >>> array([-86400000000000, -86400000000000, -86400000000000,
>>>> -86400000000000,
>>>> >>>       -86400000000000], dtype='timedelta64[ns]')
>>>> >>>
>>>> >>> We need to be able to represent this data coherently (up to nanosecond
>>>> >>> resolution) with the Arrow metadata, and we will also at some point
>>>> >>> need to perform analytics directly on this data type.
>>>> >>>
>>>> >>> An alternative proposal to changing the DAY_TIME interval
>>>> >>> representation is to add another kind of interval type, so instead of
>>>> >>> only YEAR_MONTH and DAY_TIME, we have TIMEDELTA. The downside of this,
>>>> >>> of course, is the extra implementation complexity. DAY_TIME with the
>>>> >>> current Java representation also seems to me to be a subset of what
>>>> >>> you can represent with TIMEDELTA.
>>>> >>>
>>>> >>> It would be great to make a decision about this so we can get this
>>>> >>> metadata finalized in the 0.8.0 release.
>>>> >>>
>>>> >>> Thanks
>>>> >>> Wes
>>>> >>>
>>>> >>> [1]: https://github.com/apache/arrow/blob/master/format/
>>>> Schema.fbs#L135
>>>> >>>
>>>>
>>>>

Re: [DISCUSS] Expanding Arrow interval type metadata, changing Java memory representation

Reply via email to