I agree with you that having fixed precision (e.g. postgres / zetasql) is reasonable. Variable precision fields (ala SQL Server/Oracle) seem less valuable to me.
I think support for nanosecond precision for intervals is important as there are nanosecond precision timestamps I don't think the postgres representation adds anything of practical use (it still requires 16 bytes but has only microsecond precision). The 3 field implementation doesn't seem to have any way to represent integral days, so I am also not sure about that one. Adding just the ZetaSQL Implementation would be good enough for my usecases. Andrew On Tue, Mar 30, 2021 at 11:48 PM Micah Kornfield <emkornfi...@apache.org> wrote: > To follow-up on this conversation I did some analysis on interval types: > > > https://docs.google.com/document/d/1i1E_fdQ_xODZcAhsV11Pfq27O50k679OYHXFJpm9NS0/edit > Please feel free to add more details/systems I missed. > > Given the disparate requirements of different systems I think the > following might make sense for official types (if there isn't consensus, I > might try to contributation extension Array implementations for them to > Java and C++/Python separately). > > 1. 3 fields: Year (32 bit), Month (32 bit), Nanoseconds (64 bit) all > signed. > 2. Postgres representation (Downside is it doesn't support Nanoseconds, > only microseconds). > 3. ZetaSQL implementation (Requires some bit manipulation) but supports > the most reasonable ranges for Year, Month and Nanoseconds independently. > > Thoughts? > > Micah > > On 2021/02/18 04:30:55 Micah Kornfield wrote: > > > > > > I didn’t find any page/documentation on how to do RFC in Arrow > protocol, > > > so can anyone point me to it or PR with email will be enough? > > > > That is enough to start discussion. Before formal acceptance and merging > > of the PR there needs to be a Java and C++ implementations for the type > > that pass integration tests. At the time this guideline was instituted > > Java and C++ were considered the "reference" implementations (I think > they > > still have the most complete integration test coverage). > > > > My understanding is that the current modelling of intervals mimics SQL > > standards (e.g. SQL Server [1]). So it would also be good to step back > and > > understand what problem DF is trying to solve and how it differs from > other > > SQL implementations. I'd be hesitant to accept COMPLEX as a new type > > without a much deeper analysis into calendar representations within Arrow > > and how they relate to other existing systems (e.g. Hive and some > > assortment of existing SQL databases). For instance the current > modelling > > of timestamps does not lend itself to constructing a COMPLEX interval > type > > particularly well. (Duration was introduced for this reason). > > > > I think both Wes's suggestion of FixedSizeBinary and Andrew's of > composing > > the with a struct are good stop-gaps. These obviously have different > > trade-offs. Ultimately, it would be good to define common extension > types > > that can represent this use-case if there really is demand for it (if it > > doesn't become a top level type). > > > > [1] > > > https://docs.microsoft.com/en-us/sql/odbc/reference/appendixes/interval-data-types?view=sql-server-ver15 > > > > -Micah > > > > On Wed, Feb 17, 2021 at 2:05 PM Andrew Lamb <al...@influxdata.com> > wrote: > > > > > That is a great suggestion Wes, thank you. > > > > > > I wonder if we could get away with a 128 bit representation that is the > > > concatenation of the two existing interval types (YearMonth)(DayTime). > Or > > > maybe even define a `struct` type with those fields that is used by > > > DataFusion. > > > > > > Basically, given our reading of the Arrow spec[1], it is currently not > > > possible to precisely represent an interval that has both monthly and > > > sub-montly granularity. > > > > > > As Dmtry says, if you have an interval seemingly simple like 1 month, > 1 > > > day > > > > > > Using IntervalUnit(YEAR_MONTH) can't represent the 1 day > > > Using IntervalUnit(DAY_TIME) can't represent the month as different > months > > > have different numbers of days > > > > > > [1] > > > > https://github.com/apache/arrow/blob/master/format/Schema.fbs#L249-L260 > > > > > > > > > On Wed, Feb 17, 2021 at 5:01 PM Wes McKinney <wesmck...@gmail.com> > wrote: > > > > > > > On Wed, Feb 17, 2021 at 3:46 PM <t...@dmtry.me> wrote: > > > > > > > > > > > It's unclear to me that this needs to be introduced into the > > > top-level > > > > > > > > > > Similar thing to columnar format, How to store interval like 1 > month 1 > > > > day 1 hour? It’s not possible to do it without converting 1 month to > 30 > > > > days, which is a bad way. > > > > > > > > > > > > > Presumably you can represent a complex interval in a fixed number of > > > > bytes, and then embed the data in a FixedSizeBinary type. You can > > > > adorn this type with extension type metadata so that DataFusion can > > > > then apply Interval semantics to it. This could also serve as an > > > > interim strategy for you to proceed with implementation while > > > > proposing a top-level type to the Arrow format (which may or may not > > > > be accepting) so you aren't blocked on acceptance of changes into > > > > Schema.fbs. > > > > > > > > > > On 17 Feb 2021, at 21:02, Wes McKinney <wesmck...@gmail.com> > wrote: > > > > > > > > > > > > It's unclear to me that this needs to be introduced into the > > > top-level > > > > > > columnar format without more analysis — have you considered > > > > > > implementing this for DataFusion as an extension type for the > time > > > > > > being? > > > > > > > > > > > > On Wed, Feb 17, 2021 at 11:59 AM t...@dmtry.me <mailto: > t...@dmtry.me > > > > > > > > <t...@dmtry.me <mailto:t...@dmtry.me>> wrote: > > > > > >> > > > > > >> Hi, > > > > > >> > > > > > >> For now, There are only two types of IntervalUnit inside Arrow: > > > > > >> > > > > > >> - YearMonth - month stored as int32 > > > > > >> - DayTime - days as int32 and time in milliseconds as in32. > Total > > > > (64 bites) > > > > > >> > > > > > >> Since DF is using Arrow, It’s not possible to store “Complex” > > > > intervals such 1 MONTH 1 DAY 1 HOUR. > > > > > >> I think, the best way to understand the problem will be to read > a > > > > comment from DF codebase: > > > > > > > > https://github.com/apache/arrow/blob/bca7d2fe84ccd8fc1129cb4d85448eb0779c52c3/rust/datafusion/src/sql/planner.rs#L1148 > > > > > >> > > > > > >> // Interval is tricky thing > > > > > >> // 1 day is not 24 hours because timezones, 1 year != > > > 365/364! > > > > 30 days != 1 month > > > > > >> // The true way to store and calculate intervals is to > store > > > > it as it defined > > > > > >> // Due the fact that Arrow supports only two types > YearMonth > > > > (month) and DayTime (day, time) > > > > > >> // It's not possible to store complex intervals > > > > > >> // It's possible to do select (NOW() + INTERVAL '1 > year') + > > > > INTERVAL '1 day'; as workaround > > > > > >> if result_month != 0 && (result_days != 0 || > result_millis != > > > > 0) { > > > > > >> return Err(DataFusionError::NotImplemented(format!( > > > > > >> "DF does not support intervals that have both a > > > > Year/Month part as well as Days/Hours/Mins/Seconds: {:?}. Hint: try > > > > breaking the interval into two parts, one with Year/Month and the > other > > > > with Days/Hours/Mins/Seconds - e.g. (NOW() + INTERVAL '1 year') + > > > INTERVAL > > > > '1 day'", > > > > > >> value > > > > > >> ))); > > > > > >> } > > > > > >> > > > > > >> > > > > > >> > > > > > >> I prepared a PR https://github.com/apache/arrow/pull/9516/files > < > > > > https://github.com/apache/arrow/pull/9516/files> < > > > > https://github.com/apache/arrow/pull/9516/files < > > > > https://github.com/apache/arrow/pull/9516/files>> that introduce a > new > > > > type for IntervalUnit called Complex, that store both YearMonth and > > > DayTime > > > > to support complex interval. > > > > > >> I didn’t find any page/documentation on how to do RFC in Arrow > > > > protocol, so can anyone point me to it or PR with email will be > enough? > > > > > >> > > > > > >> Thanks. > > > > > > > > > > > > > > >