I agree with you that having fixed precision (e.g. postgres / zetasql) is
reasonable. Variable precision fields (ala SQL Server/Oracle) seem less
valuable to me.

I think support for nanosecond precision for intervals is important as
there are nanosecond precision timestamps

I don't think the postgres representation adds anything of practical use
(it still requires 16 bytes but has only microsecond precision). The 3
field implementation doesn't seem to have any way to represent integral
days, so I am also not sure about that one.

Adding just the ZetaSQL Implementation would be good enough for my
usecases.

Andrew

On Tue, Mar 30, 2021 at 11:48 PM Micah Kornfield <emkornfi...@apache.org>
wrote:

> To follow-up on this conversation I did some analysis on interval types:
>
>
> https://docs.google.com/document/d/1i1E_fdQ_xODZcAhsV11Pfq27O50k679OYHXFJpm9NS0/edit
> Please feel free to add more details/systems I missed.
>
> Given the disparate requirements of different systems I think the
> following might make sense for official types (if there isn't consensus, I
> might try to contributation extension Array implementations for them to
> Java and C++/Python separately).
>
> 1.  3 fields: Year (32 bit), Month (32 bit), Nanoseconds (64 bit) all
> signed.
> 2.  Postgres representation (Downside is it doesn't support Nanoseconds,
> only microseconds).
> 3.  ZetaSQL implementation (Requires some bit manipulation) but supports
> the most reasonable ranges for Year, Month and Nanoseconds independently.
>
> Thoughts?
>
> Micah
>
> On 2021/02/18 04:30:55 Micah Kornfield wrote:
> > >
> > > I didn’t find any page/documentation on how to do RFC in Arrow
> protocol,
> > > so can anyone point me to it or PR with email will be enough?
> >
> > That is enough to start discussion.  Before formal acceptance and merging
> > of the PR there needs to be a Java and C++ implementations for the type
> > that pass integration tests.  At the time this guideline was instituted
> > Java and C++ were considered the "reference" implementations (I think
> they
> > still have the most complete integration test coverage).
> >
> > My understanding is that the current modelling of intervals mimics SQL
> > standards (e.g. SQL Server [1]).  So it would also be good to step back
> and
> > understand what problem DF is trying to solve and how it differs from
> other
> > SQL implementations.  I'd be hesitant to accept COMPLEX as a new type
> > without a much deeper analysis into calendar representations within Arrow
> > and how they relate to other existing systems (e.g. Hive and some
> > assortment of existing SQL databases).  For instance the current
> modelling
> > of timestamps does not lend itself to constructing a COMPLEX interval
> type
> > particularly well. (Duration was introduced for this reason).
> >
> > I think both Wes's suggestion of FixedSizeBinary and Andrew's of
> composing
> > the with a struct are good stop-gaps.  These obviously have different
> > trade-offs.  Ultimately, it would be good to define common extension
> types
> > that can represent this use-case if there really is demand for it (if it
> > doesn't become a top level type).
> >
> > [1]
> >
> https://docs.microsoft.com/en-us/sql/odbc/reference/appendixes/interval-data-types?view=sql-server-ver15
> >
> > -Micah
> >
> > On Wed, Feb 17, 2021 at 2:05 PM Andrew Lamb <al...@influxdata.com>
> wrote:
> >
> > > That is a great suggestion Wes, thank you.
> > >
> > > I wonder if we could get away with a 128 bit representation that is the
> > > concatenation of the two existing interval types (YearMonth)(DayTime).
> Or
> > > maybe even define a `struct` type with those fields that is used by
> > > DataFusion.
> > >
> > > Basically, given our reading of the Arrow spec[1], it is currently not
> > > possible to precisely represent an interval that has both monthly and
> > > sub-montly granularity.
> > >
> > > As Dmtry says, if you have an interval seemingly simple like  1 month,
> 1
> > > day
> > >
> > > Using IntervalUnit(YEAR_MONTH) can't represent the 1 day
> > > Using IntervalUnit(DAY_TIME) can't represent the month as different
> months
> > > have different numbers of days
> > >
> > > [1]
> > >
> https://github.com/apache/arrow/blob/master/format/Schema.fbs#L249-L260
> > >
> > >
> > > On Wed, Feb 17, 2021 at 5:01 PM Wes McKinney <wesmck...@gmail.com>
> wrote:
> > >
> > > > On Wed, Feb 17, 2021 at 3:46 PM <t...@dmtry.me> wrote:
> > > > >
> > > > > > It's unclear to me that this needs to be introduced into the
> > > top-level
> > > > >
> > > > > Similar thing to columnar format, How to store interval like 1
> month 1
> > > > day 1 hour? It’s not possible to do it without converting 1 month to
> 30
> > > > days, which is a bad way.
> > > > >
> > > >
> > > > Presumably you can represent a complex interval in a fixed number of
> > > > bytes, and then embed the data in a FixedSizeBinary type. You can
> > > > adorn this type with extension type metadata so that DataFusion can
> > > > then apply Interval semantics to it. This could also serve as an
> > > > interim strategy for you to proceed with implementation while
> > > > proposing a top-level type to the Arrow format (which may or may not
> > > > be accepting) so you aren't blocked on acceptance of changes into
> > > > Schema.fbs.
> > > >
> > > > > > On 17 Feb 2021, at 21:02, Wes McKinney <wesmck...@gmail.com>
> wrote:
> > > > > >
> > > > > > It's unclear to me that this needs to be introduced into the
> > > top-level
> > > > > > columnar format without more analysis — have you considered
> > > > > > implementing this for DataFusion as an extension type for the
> time
> > > > > > being?
> > > > > >
> > > > > > On Wed, Feb 17, 2021 at 11:59 AM t...@dmtry.me <mailto:
> t...@dmtry.me
> > > >
> > > > <t...@dmtry.me <mailto:t...@dmtry.me>> wrote:
> > > > > >>
> > > > > >> Hi,
> > > > > >>
> > > > > >> For now, There are only two types of IntervalUnit inside Arrow:
> > > > > >>
> > > > > >> - YearMonth - month stored as int32
> > > > > >> - DayTime - days as int32 and time in milliseconds  as in32.
> Total
> > > > (64 bites)
> > > > > >>
> > > > > >> Since DF is using Arrow, It’s not possible to store “Complex”
> > > > intervals such 1 MONTH 1 DAY 1 HOUR.
> > > > > >> I think, the best way to understand the problem will be to read
> a
> > > > comment from DF codebase:
> > > >
> > >
> https://github.com/apache/arrow/blob/bca7d2fe84ccd8fc1129cb4d85448eb0779c52c3/rust/datafusion/src/sql/planner.rs#L1148
> > > > > >>
> > > > > >>        // Interval is tricky thing
> > > > > >>        // 1 day is not 24 hours because timezones, 1 year !=
> > > 365/364!
> > > > 30 days != 1 month
> > > > > >>        // The true way to store and calculate intervals is to
> store
> > > > it as it defined
> > > > > >>        // Due the fact that Arrow supports only two types
> YearMonth
> > > > (month) and DayTime (day, time)
> > > > > >>        // It's not possible to store complex intervals
> > > > > >>        // It's possible to do select (NOW() + INTERVAL '1
> year') +
> > > > INTERVAL '1 day'; as workaround
> > > > > >>        if result_month != 0 && (result_days != 0 ||
> result_millis !=
> > > > 0) {
> > > > > >>            return Err(DataFusionError::NotImplemented(format!(
> > > > > >>                "DF does not support intervals that have both a
> > > > Year/Month part as well as Days/Hours/Mins/Seconds: {:?}. Hint: try
> > > > breaking the interval into two parts, one with Year/Month and the
> other
> > > > with Days/Hours/Mins/Seconds - e.g. (NOW() + INTERVAL '1 year') +
> > > INTERVAL
> > > > '1 day'",
> > > > > >>                value
> > > > > >>            )));
> > > > > >>        }
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >> I prepared a PR https://github.com/apache/arrow/pull/9516/files
> <
> > > > https://github.com/apache/arrow/pull/9516/files> <
> > > > https://github.com/apache/arrow/pull/9516/files <
> > > > https://github.com/apache/arrow/pull/9516/files>> that introduce a
> new
> > > > type for IntervalUnit called Complex, that store both YearMonth and
> > > DayTime
> > > > to support complex interval.
> > > > > >> I didn’t find any page/documentation on how to do RFC in Arrow
> > > > protocol, so can anyone point me to it or PR with email will be
> enough?
> > > > > >>
> > > > > >> Thanks.
> > > > >
> > > >
> > >
> >
>

Reply via email to