Thanks all. Can we scope the work that would be required to implement the described change (making interval a 64-bit integer for the DAY_TIME variety, and adding unit metadata)? I suppose mainly Dremio would be moderately disrupted by this
On Wed, Nov 8, 2017 at 6:05 PM, Jacques Nadeau <jacq...@apache.org> wrote: > My analysis previously (if I recall) was basically (I think I did it on the > similar Parquet PR) was that no system truly supported N fields for all > operations (postgres was closest I believe). Some basic operations would > maintain them but they would quickly not behave differently (e.g. 24 hours > is the same as one day in basically all cases). > > On Wed, Nov 8, 2017 at 2:54 PM, Julian Hyde <jh...@apache.org> wrote: > >> I have argued before on this list, and still believe, that you should >> represent an interval as you would a number. If intervals are 64 bit >> signed, then sure, use the 64 bit integer representation; if you were >> to allow intervals with fixed precision and scale, then use the same >> representation as for decimal(p, s). >> >> My understanding of timedelta is that you can't use a 2-field struct. >> You need timedelta (3 hour 20 minutes 7 seconds) to be distinguishable >> from timedelta (2 hour 79 minutes 67 seconds), correct? If so you need >> N fields. >> >> Julian >> >> >> On Wed, Nov 8, 2017 at 2:47 PM, Wes McKinney <wesmck...@gmail.com> wrote: >> > Makes sense. The key question is whether the data is represented as a >> > single 64-bit integer or as effectively a C struct >> > >> > struct { >> > int32_t days; >> > int32_t milliseconds; >> > } >> > >> > The struct representation cannot accommodate higher resolution units >> > like microseconds and nanoseconds. From my perspective, if we use the >> > 64-bit integer representation (which is for the millis case days * >> > 86400000 + milliseconds) then whether we call it Interval or Timedelta >> > is sort of immaterial to the immediate use cases I have, where users >> > have a timedelta64[UNIT] type (which results from any arithmetic >> > between timestamp values) >> > >> > On Wed, Nov 8, 2017 at 5:38 PM, Julian Hyde <jh...@apache.org> wrote: >> >> I don't know many examples of interval being used in the real world. >> >> But here's the kind of thing: the policy is that an offer is open for >> >> 60 hours, so if the offer is made to a particular customer at 12:34pm >> >> on Sunday, you want to compute that it ends at 12:34am on Wednesday. >> >> The interval "60 hours" is really just syntactic sugar for 216,000 >> >> seconds. You could write it as interval '60' hour, or interval '2:12' >> >> day to hour, or interval '129600' second, but the values and >> >> underlying representation are the same. (Interval '1:36' day to hour >> >> is not a valid value, because 36 is out of the valid hour range 0..23, >> >> but you could construct the value using interval '1' day + 36 * >> >> interval '1' hour.) >> >> >> >> My understanding is that a timedelta (2 day 12 hours) is different >> >> from timedelta (60 hours) and timedelta (1 day 36 hours), but all are >> >> valid timedelta values. >> >> >> >> For my offer expiration example the SQL-style interval is sufficient, >> >> because there is no material difference between 2:12 and 1:36. >> >> >> >> But I am sure you can provide use cases where timedelta is necessary. >> >> >> >> I don't claim one is better than the other, and I'm not volunteering >> >> to implement either of them, so I don't have a say which you should do >> >> first. But please keep the names "interval" and "timedelta", so the >> >> various communities aren't confused about semantics. >> >> >> >> >> >> On Wed, Nov 8, 2017 at 2:15 PM, Wes McKinney <wesmck...@gmail.com> >> wrote: >> >>> Pleading ignorance on use of the SQL interval type, my prior would be >> >>> that many algorithms would first convert the interval components into >> >>> an absolute timedelta. Is that not the case? >> >>> >> >>> My preference right now would be to have a single Interval type, where >> >>> the DAY_TIME type actually contains an absolute delta based on the >> >>> indicated unit. It is true that the divmod operation to decompose into >> >>> number of days and intraday units (milliseconds, nanoseconds, etc.) is >> >>> not the cheapest, but I don't know the use cases for the type well >> >>> enough to judge. >> >>> >> >>> On Wed, Nov 8, 2017 at 5:10 PM, Jacques Nadeau <jacq...@apache.org> >> wrote: >> >>>> I'm all for moving interval to the new definition. I think we should >> avoid >> >>>> introducing a timedelta type until it is really important. We need >> several >> >>>> users demanding a type before we should implement it. Otherwise, we >> have >> >>>> huge amounts of type bloat (which means nothing will fully implement >> the >> >>>> spec and be able to interoperate). >> >>>> >> >>>> On Sat, Nov 4, 2017 at 3:46 PM, Julian Hyde <jh...@apache.org> wrote: >> >>>> >> >>>>> As I understand it, the proposal is to have both an interval data >> type[1] >> >>>>> and a timedelta type[2]. The interval is compatible with the SQL >> standard >> >>>>> (but not Postgres) and can be implemented with a single numeric value >> >>>>> representing a particular time unit (year, month, day, hour, minute, >> >>>>> second, and possibly fractional seconds); timedelta is an array of >> numeric >> >>>>> values, one for a set of time units. >> >>>>> >> >>>>> I think we should have both, and operators to convert between them. >> >>>>> Interval is certainly efficient, and is what some applications need, >> but >> >>>>> some applications need timedelta. >> >>>>> >> >>>>> Julian >> >>>>> >> >>>>> [1] https://issues.apache.org/jira/browse/ARROW-352 < >> >>>>> https://issues.apache.org/jira/browse/ARROW-352> >> >>>>> >> >>>>> [2] https://issues.apache.org/jira/browse/ARROW-835 < >> >>>>> https://issues.apache.org/jira/browse/ARROW-835> >> >>>>> >> >>>>> > On Nov 4, 2017, at 1:26 PM, Wes McKinney <wesmck...@gmail.com> >> wrote: >> >>>>> > >> >>>>> > It seems like we don't have enough input on this topic to make a >> >>>>> > decision right now. I placed the JIRA ARROW-352 in the 0.9.0 >> >>>>> > milestone, but we really should try to get this done soon so that >> >>>>> > downstream users are not blocked on using Arrow to send around >> >>>>> > interval data. >> >>>>> > >> >>>>> > - Wes >> >>>>> > >> >>>>> > On Fri, Oct 20, 2017 at 12:34 AM, Li Jin <ice.xell...@gmail.com> >> wrote: >> >>>>> >> +1 on this one. >> >>>>> >> >> >>>>> >> My reason is this makes timestamp/interval calculation faster, >> i.e, >> >>>>> >> "timestamp + interval < timestamp" should be faster without >> dealing with >> >>>>> >> two component in interval. Although I am not quite sure about the >> >>>>> rational >> >>>>> >> behind the two component representation, which seems to be what >> is used >> >>>>> in >> >>>>> >> Spark: >> >>>>> >> >> >>>>> >> https://github.com/apache/spark/blob/master/common/ >> >>>>> unsafe/src/main/java/org/apache/spark/unsafe/types/ >> CalendarInterval.java >> >>>>> >> >> >>>>> >> I am interested in hearing reasoning behind two component. >> >>>>> >> >> >>>>> >> On Wed, Oct 18, 2017 at 8:32 PM, Wes McKinney < >> wesmck...@gmail.com> >> >>>>> wrote: >> >>>>> >> >> >>>>> >>> I opened this patch over 2 months ago to add some additional >> metadata >> >>>>> >>> for intervals: >> >>>>> >>> >> >>>>> >>> https://github.com/apache/arrow/pull/920 >> >>>>> >>> >> >>>>> >>> Java supports a two-component DAY_TIME interval type as a combo >> of >> >>>>> >>> days and milliseconds: >> >>>>> >>> >> >>>>> >>> https://github.com/apache/arrow/blob/ >> 402baa4ec391b61dd37c770ae7978d >> >>>>> >>> 51b9b550fa/java/vector/src/main/codegen/data/ >> ValueVectorTypes.tdd#L106 >> >>>>> >>> >> >>>>> >>> I propose that we change the interval representation to be a >> number of >> >>>>> >>> elapsed units of time from a particular point in time. This unit >> >>>>> >>> choices would be the same as our unit for timestamps, so an >> interval >> >>>>> >>> can be viewed as a delta between two timestamps of some >> resolution >> >>>>> >>> (second through nanoseconds) [1]. >> >>>>> >>> >> >>>>> >>> As context, a number of systems I have worked with deal in >> absolute >> >>>>> >>> time deltas. In pandas, for example, the difference of timestamps >> >>>>> >>> (datetime64 values) is a timedelta: >> >>>>> >>> >> >>>>> >>> In [1]: import pandas as pd >> >>>>> >>> >> >>>>> >>> In [2]: dr1 = pd.date_range('1/1/2000', periods=5) >> >>>>> >>> >> >>>>> >>> In [3]: dr2 = pd.date_range('1/2/2000', periods=5) >> >>>>> >>> >> >>>>> >>> In [4]: dr1 - dr2 >> >>>>> >>> Out[4]: TimedeltaIndex(['-1 days', '-1 days', '-1 days', '-1 >> days', >> >>>>> >>> '-1 days'], dtype='timedelta64[ns]', freq=None) >> >>>>> >>> >> >>>>> >>> In [5]: (dr1 - dr2).values >> >>>>> >>> Out[5]: >> >>>>> >>> array([-86400000000000, -86400000000000, -86400000000000, >> >>>>> -86400000000000, >> >>>>> >>> -86400000000000], dtype='timedelta64[ns]') >> >>>>> >>> >> >>>>> >>> We need to be able to represent this data coherently (up to >> nanosecond >> >>>>> >>> resolution) with the Arrow metadata, and we will also at some >> point >> >>>>> >>> need to perform analytics directly on this data type. >> >>>>> >>> >> >>>>> >>> An alternative proposal to changing the DAY_TIME interval >> >>>>> >>> representation is to add another kind of interval type, so >> instead of >> >>>>> >>> only YEAR_MONTH and DAY_TIME, we have TIMEDELTA. The downside of >> this, >> >>>>> >>> of course, is the extra implementation complexity. DAY_TIME with >> the >> >>>>> >>> current Java representation also seems to me to be a subset of >> what >> >>>>> >>> you can represent with TIMEDELTA. >> >>>>> >>> >> >>>>> >>> It would be great to make a decision about this so we can get >> this >> >>>>> >>> metadata finalized in the 0.8.0 release. >> >>>>> >>> >> >>>>> >>> Thanks >> >>>>> >>> Wes >> >>>>> >>> >> >>>>> >>> [1]: https://github.com/apache/arrow/blob/master/format/ >> >>>>> Schema.fbs#L135 >> >>>>> >>> >> >>>>> >> >>>>> >>