> > However it seems a little unfortunate that there is now way to represent a > "common" interval like "1 week and 1 hour" with native arrow types
I might have misunderstood,but at least in postgres, I thought this boils down to "0 months, 7 days, 3600 seconds". Since months is 0, this seems like it fits squarely in the existing interval type Days_Mills. I thought what can't be represented today is "1 Year 1 Hour". It seems like none of the proposals so far cover weeks as an explicit type? On Fri, Apr 2, 2021 at 2:42 PM Andrew Lamb <[email protected]> wrote: > I think it is plausible that we use Arrow structs to create a synthetic > interval type for DataFusion (I don't have a compelling usecase to store > the intervals themselves, or to expose them outside of DataFusion). > > However it seems a little unfortunate that there is now way to represent a > "common" interval like "1 week and 1 hour" with native arrow types > > > > On Fri, Apr 2, 2021 at 4:38 PM Micah Kornfield <[email protected]> > wrote: > >> The real usecase I have is "postgres compatibility" >> >> >> Yeah, I'm a little conflicted on this. A broader analysis might be >> necessary and I'd welcome others thoughts, but at what point should we >> mostly consider the type system closed? Should we be aiming for full >> parity with ANSI SQL/Postgres SQL or something else? >> >> >>> I have no known need for the actual postgres timestamp internal >>> representation. >> >> >> I suppose there is an edge case that the seconds range is larger for >> microseconds compared to nanoseconds with the simple representation. But >> that seems minor. >> >> On Fri, Apr 2, 2021 at 1:25 PM Andrew Lamb <[email protected]> wrote: >> >>> The real usecase I have is "postgres compatibility" - in the sense that >>> we can write SQL queries / expressions that use postgres interval type [1] >>> and corresponding expressions with the full postgres interval range. I have >>> no known need for the actual postgres timestamp internal representation. >>> >>> A more vague usecase I would like (but do not strictly need) is for >>> expressing intervals up to the same precision as IOx's underlying >>> timestamps (which are stored as nanosecond precision). >>> >>> [1] >>> https://www.postgresql.org/docs/current/datatype-datetime.html#DATATYPE-INTERVAL-INPUT >>> >>> "Internally interval values are stored as months, days, and seconds. >>> This is done because the number of days in a month varies, and a day can >>> have 23 or 25 hours if a daylight savings time adjustment is involved. The >>> months and days fields are integers while the seconds field can store >>> fractions. Because intervals are usually created from constant strings or >>> timestamp subtraction, this storage method works well in most cases, >>> but can cause unexpected results: >>> " >>> >>> On Fri, Apr 2, 2021 at 3:53 PM Micah Kornfield <[email protected]> >>> wrote: >>> >>>> Andrew is the use-case you have simply postgres compatibility or is it >>>> more >>>> extensive? >>>> >>>> One potential problem with combining Month and Day fields, is that the >>>> type >>>> no longer has a defined sort order (the existing Day-Millisecond type >>>> without assumptions, in particular because I don't think today there is >>>> an >>>> explicit constraint on the bounds for the millisecond component). >>>> >>>> -Micah >>>> >>>> >>>> >>>> On Wed, Mar 31, 2021 at 9:03 AM Antoine Pitrou <[email protected]> >>>> wrote: >>>> >>>> > >>>> > Le 31/03/2021 à 17:55, Micah Kornfield a écrit : >>>> > > Thanks for the feedback. A couple of points here and some responses >>>> > below. >>>> > > >>>> > > * One other question is whether the Nanoseconds should actually be >>>> > > configurable (i.e. use milliseconds or microseconds). I would lean >>>> > towards >>>> > > no. >>>> > >>>> > Same for me. >>>> > >>>> > > * I'm also still not 100% convinced we need this as a first class >>>> type in >>>> > > arrow or if we should be looking more closely at the Struct (in the >>>> Arrow >>>> > > sense) based implementation. In the future where alternative >>>> encodings >>>> > are >>>> > > supported, this could allow for much smaller footprints for this >>>> type. >>>> > >>>> > Having a "packed" first class type allows for better locality when >>>> > accessing data. It doesn't sound very likely that you'd access only >>>> one >>>> > component of the interval. >>>> > >>>> > But I have no idea how important this is, and temporal datetypes are >>>> > generally cumbersome to add support for (conversions, arithmetic, >>>> etc.), >>>> > so it would be nice to avoid adding too many of them :-) >>>> > >>>> > Regards >>>> > >>>> > Antoine. >>>> > >>>> > >>>> > >>>> > > >>>> > > The 3 >>>> > >> field implementation doesn't seem to have any way to represent >>>> integral >>>> > >> days, so I am also not sure about that one. >>>> > > >>>> > > >>>> > > Sorry this was an email gaffe. I intended Month (32 bit int), Day >>>> (32 >>>> > bit >>>> > > int), Nanosecond (64 bit int). >>>> > > >>>> > > OTOH I don't really understand the point of supporting "the most >>>> > >> reasonable ranges for Year, Month and Nanoseconds independently". >>>> What >>>> > >> does it bring to encode more than one month in the nanoseconds >>>> field? >>>> > > >>>> > > >>>> > > I'm happy with simplicity. In the past there has been some >>>> reference to >>>> > > people wanting to store very large timestamps (fall out of >>>> Nanoseconds >>>> > max >>>> > > representable value) but we've concluded that this wasn't something >>>> that >>>> > we >>>> > > wanted to really support. >>>> > > >>>> > > >>>> > > >>>> > > >>>> > > >>>> > > >>>> > > On Wed, Mar 31, 2021 at 4:49 AM Antoine Pitrou <[email protected]> >>>> > wrote: >>>> > > >>>> > >> >>>> > >> I would favour the following characteristics : >>>> > >> - support for nanoseconds (especially as other Arrow temporal types >>>> > >> support it) >>>> > >> - easy to handle (which excludes the ZetaSQL representtaion IMHO) >>>> > >> >>>> > >> OTOH I don't really understand the point of supporting "the most >>>> > >> reasonable ranges for Year, Month and Nanoseconds independently". >>>> What >>>> > >> does it bring to encode more than one month in the nanoseconds >>>> field? >>>> > >> You can already use the Duration type for that. >>>> > >> >>>> > >> Regards >>>> > >> >>>> > >> Antoine. >>>> > >> >>>> > >> >>>> > >> Le 31/03/2021 à 05:48, Micah Kornfield a écrit : >>>> > >>> To follow-up on this conversation I did some analysis on interval >>>> > types: >>>> > >>> >>>> > >>> >>>> > >> >>>> > >>>> https://docs.google.com/document/d/1i1E_fdQ_xODZcAhsV11Pfq27O50k679OYHXFJpm9NS0/edit >>>> > >> Please feel free to add more details/systems I missed. >>>> > >>> >>>> > >>> Given the disparate requirements of different systems I think the >>>> > >> following might make sense for official types (if there isn't >>>> > consensus, I >>>> > >> might try to contributation extension Array implementations for >>>> them to >>>> > >> Java and C++/Python separately). >>>> > >>> >>>> > >>> 1. 3 fields: Year (32 bit), Month (32 bit), Nanoseconds (64 bit) >>>> all >>>> > >> signed. >>>> > >>> 2. Postgres representation (Downside is it doesn't support >>>> > Nanoseconds, >>>> > >> only microseconds). >>>> > >>> 3. ZetaSQL implementation (Requires some bit manipulation) but >>>> > supports >>>> > >> the most reasonable ranges for Year, Month and Nanoseconds >>>> > independently. >>>> > >>> >>>> > >>> Thoughts? >>>> > >>> >>>> > >>> Micah >>>> > >>> >>>> > >>> On 2021/02/18 04:30:55 Micah Kornfield wrote: >>>> > >>>>> >>>> > >>>>> I didn’t find any page/documentation on how to do RFC in Arrow >>>> > >> protocol, >>>> > >>>>> so can anyone point me to it or PR with email will be enough? >>>> > >>>> >>>> > >>>> That is enough to start discussion. Before formal acceptance and >>>> > >> merging >>>> > >>>> of the PR there needs to be a Java and C++ implementations for >>>> the >>>> > type >>>> > >>>> that pass integration tests. At the time this guideline was >>>> > instituted >>>> > >>>> Java and C++ were considered the "reference" implementations (I >>>> think >>>> > >> they >>>> > >>>> still have the most complete integration test coverage). >>>> > >>>> >>>> > >>>> My understanding is that the current modelling of intervals >>>> mimics SQL >>>> > >>>> standards (e.g. SQL Server [1]). So it would also be good to >>>> step >>>> > back >>>> > >> and >>>> > >>>> understand what problem DF is trying to solve and how it differs >>>> from >>>> > >> other >>>> > >>>> SQL implementations. I'd be hesitant to accept COMPLEX as a new >>>> type >>>> > >>>> without a much deeper analysis into calendar representations >>>> within >>>> > >> Arrow >>>> > >>>> and how they relate to other existing systems (e.g. Hive and some >>>> > >>>> assortment of existing SQL databases). For instance the current >>>> > >> modelling >>>> > >>>> of timestamps does not lend itself to constructing a COMPLEX >>>> interval >>>> > >> type >>>> > >>>> particularly well. (Duration was introduced for this reason). >>>> > >>>> >>>> > >>>> I think both Wes's suggestion of FixedSizeBinary and Andrew's of >>>> > >> composing >>>> > >>>> the with a struct are good stop-gaps. These obviously have >>>> different >>>> > >>>> trade-offs. Ultimately, it would be good to define common >>>> extension >>>> > >> types >>>> > >>>> that can represent this use-case if there really is demand for >>>> it (if >>>> > it >>>> > >>>> doesn't become a top level type). >>>> > >>>> >>>> > >>>> [1] >>>> > >>>> >>>> > >> >>>> > >>>> https://docs.microsoft.com/en-us/sql/odbc/reference/appendixes/interval-data-types?view=sql-server-ver15 >>>> > >>>> >>>> > >>>> -Micah >>>> > >>>> >>>> > >>>> On Wed, Feb 17, 2021 at 2:05 PM Andrew Lamb < >>>> [email protected]> >>>> > >> wrote: >>>> > >>>> >>>> > >>>>> That is a great suggestion Wes, thank you. >>>> > >>>>> >>>> > >>>>> I wonder if we could get away with a 128 bit representation >>>> that is >>>> > the >>>> > >>>>> concatenation of the two existing interval types >>>> > (YearMonth)(DayTime). >>>> > >> Or >>>> > >>>>> maybe even define a `struct` type with those fields that is >>>> used by >>>> > >>>>> DataFusion. >>>> > >>>>> >>>> > >>>>> Basically, given our reading of the Arrow spec[1], it is >>>> currently >>>> > not >>>> > >>>>> possible to precisely represent an interval that has both >>>> monthly and >>>> > >>>>> sub-montly granularity. >>>> > >>>>> >>>> > >>>>> As Dmtry says, if you have an interval seemingly simple like 1 >>>> > month, >>>> > >> 1 >>>> > >>>>> day >>>> > >>>>> >>>> > >>>>> Using IntervalUnit(YEAR_MONTH) can't represent the 1 day >>>> > >>>>> Using IntervalUnit(DAY_TIME) can't represent the month as >>>> different >>>> > >> months >>>> > >>>>> have different numbers of days >>>> > >>>>> >>>> > >>>>> [1] >>>> > >>>>> >>>> > >> >>>> https://github.com/apache/arrow/blob/master/format/Schema.fbs#L249-L260 >>>> > >>>>> >>>> > >>>>> >>>> > >>>>> On Wed, Feb 17, 2021 at 5:01 PM Wes McKinney < >>>> [email protected]> >>>> > >> wrote: >>>> > >>>>> >>>> > >>>>>> On Wed, Feb 17, 2021 at 3:46 PM <[email protected]> wrote: >>>> > >>>>>>> >>>> > >>>>>>>> It's unclear to me that this needs to be introduced into the >>>> > >>>>> top-level >>>> > >>>>>>> >>>> > >>>>>>> Similar thing to columnar format, How to store interval like 1 >>>> > month >>>> > >> 1 >>>> > >>>>>> day 1 hour? It’s not possible to do it without converting 1 >>>> month to >>>> > >> 30 >>>> > >>>>>> days, which is a bad way. >>>> > >>>>>>> >>>> > >>>>>> >>>> > >>>>>> Presumably you can represent a complex interval in a fixed >>>> number of >>>> > >>>>>> bytes, and then embed the data in a FixedSizeBinary type. You >>>> can >>>> > >>>>>> adorn this type with extension type metadata so that >>>> DataFusion can >>>> > >>>>>> then apply Interval semantics to it. This could also serve as >>>> an >>>> > >>>>>> interim strategy for you to proceed with implementation while >>>> > >>>>>> proposing a top-level type to the Arrow format (which may or >>>> may not >>>> > >>>>>> be accepting) so you aren't blocked on acceptance of changes >>>> into >>>> > >>>>>> Schema.fbs. >>>> > >>>>>> >>>> > >>>>>>>> On 17 Feb 2021, at 21:02, Wes McKinney <[email protected]> >>>> > wrote: >>>> > >>>>>>>> >>>> > >>>>>>>> It's unclear to me that this needs to be introduced into the >>>> > >>>>> top-level >>>> > >>>>>>>> columnar format without more analysis — have you considered >>>> > >>>>>>>> implementing this for DataFusion as an extension type for >>>> the time >>>> > >>>>>>>> being? >>>> > >>>>>>>> >>>> > >>>>>>>> On Wed, Feb 17, 2021 at 11:59 AM [email protected] <mailto: >>>> > >> [email protected] >>>> > >>>>>> >>>> > >>>>>> <[email protected] <mailto:[email protected]>> wrote: >>>> > >>>>>>>>> >>>> > >>>>>>>>> Hi, >>>> > >>>>>>>>> >>>> > >>>>>>>>> For now, There are only two types of IntervalUnit inside >>>> Arrow: >>>> > >>>>>>>>> >>>> > >>>>>>>>> - YearMonth - month stored as int32 >>>> > >>>>>>>>> - DayTime - days as int32 and time in milliseconds as in32. >>>> > Total >>>> > >>>>>> (64 bites) >>>> > >>>>>>>>> >>>> > >>>>>>>>> Since DF is using Arrow, It’s not possible to store >>>> “Complex” >>>> > >>>>>> intervals such 1 MONTH 1 DAY 1 HOUR. >>>> > >>>>>>>>> I think, the best way to understand the problem will be to >>>> read a >>>> > >>>>>> comment from DF codebase: >>>> > >>>>>> >>>> > >>>>> >>>> > >> >>>> > >>>> https://github.com/apache/arrow/blob/bca7d2fe84ccd8fc1129cb4d85448eb0779c52c3/rust/datafusion/src/sql/planner.rs#L1148 >>>> > >>>>>>>>> >>>> > >>>>>>>>> // Interval is tricky thing >>>> > >>>>>>>>> // 1 day is not 24 hours because timezones, 1 year >>>> != >>>> > >>>>> 365/364! >>>> > >>>>>> 30 days != 1 month >>>> > >>>>>>>>> // The true way to store and calculate intervals >>>> is to >>>> > >> store >>>> > >>>>>> it as it defined >>>> > >>>>>>>>> // Due the fact that Arrow supports only two types >>>> > >> YearMonth >>>> > >>>>>> (month) and DayTime (day, time) >>>> > >>>>>>>>> // It's not possible to store complex intervals >>>> > >>>>>>>>> // It's possible to do select (NOW() + INTERVAL '1 >>>> > year') + >>>> > >>>>>> INTERVAL '1 day'; as workaround >>>> > >>>>>>>>> if result_month != 0 && (result_days != 0 || >>>> > result_millis >>>> > >> != >>>> > >>>>>> 0) { >>>> > >>>>>>>>> return >>>> Err(DataFusionError::NotImplemented(format!( >>>> > >>>>>>>>> "DF does not support intervals that have >>>> both a >>>> > >>>>>> Year/Month part as well as Days/Hours/Mins/Seconds: {:?}. >>>> Hint: try >>>> > >>>>>> breaking the interval into two parts, one with Year/Month and >>>> the >>>> > >> other >>>> > >>>>>> with Days/Hours/Mins/Seconds - e.g. (NOW() + INTERVAL '1 >>>> year') + >>>> > >>>>> INTERVAL >>>> > >>>>>> '1 day'", >>>> > >>>>>>>>> value >>>> > >>>>>>>>> ))); >>>> > >>>>>>>>> } >>>> > >>>>>>>>> >>>> > >>>>>>>>> >>>> > >>>>>>>>> >>>> > >>>>>>>>> I prepared a PR >>>> https://github.com/apache/arrow/pull/9516/files >>>> > < >>>> > >>>>>> https://github.com/apache/arrow/pull/9516/files> < >>>> > >>>>>> https://github.com/apache/arrow/pull/9516/files < >>>> > >>>>>> https://github.com/apache/arrow/pull/9516/files>> that >>>> introduce a >>>> > >> new >>>> > >>>>>> type for IntervalUnit called Complex, that store both >>>> YearMonth and >>>> > >>>>> DayTime >>>> > >>>>>> to support complex interval. >>>> > >>>>>>>>> I didn’t find any page/documentation on how to do RFC in >>>> Arrow >>>> > >>>>>> protocol, so can anyone point me to it or PR with email will be >>>> > >> enough? >>>> > >>>>>>>>> >>>> > >>>>>>>>> Thanks. >>>> > >>>>>>> >>>> > >>>>>> >>>> > >>>>> >>>> > >>>> >>>> > >> >>>> > > >>>> > >>>> >>>
