OK, I think I have completed the initial changes for the new interval type
in https://github.com/apache/arrow/pull/10177

The code changes still need to be reviewed, but I don't think that should
stop a vote.  I'll start a vote on Monday unless there are more comments on
the format changes.

Thanks,
Micah

On Wed, Aug 11, 2021 at 1:38 PM Micah Kornfield <emkornfi...@gmail.com>
wrote:

> As an update, I've gotten basic integration testing working in Java and
> C++ along with the format proposal updates [1].
>
> I have a little bit more work to do on the initial implementations (make
> CI happy, add unit tests in Java) but I think this is getting close to the
> point that we can vote on it.  For those interested, please peruse the
> implementations and leave any comments.
>
> I'm hoping to wrap up the CI and Java test sometime tomorrow and if
> reviewers for the implementations have bandwidth hopefully address any
> concerns and start a vote sometime next week.
>
> I plan on adding integration with Python/Pandas bindings in follow-up PRs
> but likely won't have bandwidth for much more work here.
>
>
> [1] https://github.com/apache/arrow/pull/10177
>
> On Thu, May 6, 2021 at 9:05 AM Wes McKinney <wesmck...@gmail.com> wrote:
>
>> Ah, that makes sense to wait then.
>>
>> On Thu, May 6, 2021 at 10:55 AM Micah Kornfield <emkornfi...@gmail.com>
>> wrote:
>> >
>> > I'll address the feedback.  I think in the past we've waited for
>> implementations in java and c++ with integration tests before formally
>> voting.  If there is no more feedback I can start looking at
>> implementations (happy to have help)
>> >
>> > On Thursday, May 6, 2021, Wes McKinney <wesmck...@gmail.com> wrote:
>> >>
>> >> The PR looks good. I just left some comments about typos. I would say
>> >> it's probably about time to call a vote. Anywhere else where we should
>> >> be soliciting feedback?
>> >>
>> >> On Mon, May 3, 2021 at 2:17 PM Jacek Pliszka <jacek.plis...@gmail.com>
>> wrote:
>> >> >
>> >> > Good idea, I've created JIRA issue:
>> >> >
>> >> > https://issues.apache.org/jira/browse/ARROW-12637
>> >> >
>> >> > And named it range to avoid confusion with intervals...
>> >> > Though confusion will stay as it is called interval in Pandas and in
>> >> > logic (Allen's interval algebra)
>> >> >
>> >> > BR,
>> >> >
>> >> > Jacek
>> >> >
>> >> > pon., 3 maj 2021 o 18:05 Micah Kornfield <emkornfi...@gmail.com>
>> napisał(a):
>> >> > >
>> >> > > Hi Jacek,
>> >> > > This seems like reasonable functionality.  I think the probably
>> comes in
>> >> > > two parts:
>> >> > > 1.  This might be a good candidate for a "Well Known"/Officially
>> supported
>> >> > > Extension type. I can think of a few different representations but
>> I would
>> >> > > guess something like Struct[start: T, struct: end]] with well
>> defined
>> >> > > extension metadata to define open/closed on start and end might be
>> the best
>> >> > > (we should probably spin this off into a separate discussion
>> thread).
>> >> > > 2.  Adding the right computation Kernels to work with the type.
>> >> > >
>> >> > > Do you want to start a new thread or open up some JIRAs to track
>> this work?
>> >> > >
>> >> > > Thanks,
>> >> > > Micah
>> >> > >
>> >> > > On Mon, May 3, 2021 at 5:32 AM Jacek Pliszka <
>> jacek.plis...@gmail.com>
>> >> > > wrote:
>> >> > >
>> >> > > > Sorry, my mistake.
>> >> > > >
>> >> > > > You are right - I meant anchored intervals as in pandas - ones
>> with
>> >> > > > defined start and end - and I think many future users will make
>> the
>> >> > > > same mistake.
>> >> > > >
>> >> > > > I would love to be able to do fast overlap joins on arrow level.
>> >> > > >
>> >> > > > Best Regards,
>> >> > > >
>> >> > > > Jacek
>> >> > > >
>> >> > > >
>> >> > > >
>> >> > > >
>> >> > > > niedz., 2 maj 2021 o 23:06 Wes McKinney <wesmck...@gmail.com>
>> napisał(a):
>> >> > > > >
>> >> > > > > I also don't understand the comment about closed / open /
>> semi-open
>> >> > > > > intervals. Perhaps there is a confusion, since "interval" as
>> we mean
>> >> > > > > it here is called a "time delta" in some other projects. An
>> interval
>> >> > > > > here does not refer to a time span with a distinct start and
>> end point
>> >> > > > > (I understand this might be confusing to a pandas user since
>> pandas
>> >> > > > > has an interval data type where each value is a tuple of
>> arbitrary
>> >> > > > > start/end).
>> >> > > > >
>> >> > > > > On Sun, May 2, 2021 at 3:46 PM Micah Kornfield <
>> emkornfi...@gmail.com>
>> >> > > > wrote:
>> >> > > > > >
>> >> > > > > > Hi Jacek,
>> >> > > > > > I'm not sure I fully understand the proposal, could you
>> elaborate with
>> >> > > > more
>> >> > > > > > examples/details?  For instance DAY_TIME isn't just a
>> UINT64, it
>> >> > > > actually
>> >> > > > > > contains 2 seperate fields (days and milliseconds).
>> >> > > > > >
>> >> > > > > > In terms of closed vs half-open, in my limited
>> understanding, that is
>> >> > > > more
>> >> > > > > > a concern of functions using interval types rather than the
>> type
>> >> > > > itself.
>> >> > > > > > For instance a quick search of postgres [1] docs only talks
>> about
>> >> > > > half-open
>> >> > > > > > in relation to the "Overlaps" operator
>> >> > > > > >
>> >> > > > > > Thanks,
>> >> > > > > > -Micah
>> >> > > > > >
>> >> > > > > > [1]
>> https://www.postgresql.org/docs/9.1/functions-datetime.html
>> >> > > > > >
>> >> > > > > >
>> >> > > > > >
>> >> > > > > > On Sun, May 2, 2021 at 12:25 AM Jacek Pliszka <
>> jacek.plis...@gmail.com
>> >> > > > >
>> >> > > > > > wrote:
>> >> > > > > >
>> >> > > > > > > Hi!
>> >> > > > > > >
>> >> > > > > > > I wonder if it were possible to have generic interval with
>> integers
>> >> > > > of
>> >> > > > > > > specified size just to have common base for interval
>> arithmetic.
>> >> > > > > > >
>> >> > > > > > > Then user can convert their period to ordinals and use the
>> arithmetic
>> >> > > > > > > (joining, deoverlapping, common parts, explosion etc.).
>> >> > > > > > >
>> >> > > > > > > So YEAR_MONTH and DAY_TIME would be just special cases of
>> >> > > > > > > INTERVAL_UINT32 and INTERVAL_UINT64
>> >> > > > > > >
>> >> > > > > > > Also I believe it is worth to state whether there are only
>> closed
>> >> > > > > > > intervals or open/semi-open ones are allowed as well.
>> >> > > > > > >
>> >> > > > > > > I believe I am just one of many reinventing the wheel here
>> and
>> >> > > > writing
>> >> > > > > > > own versions of the above.
>> >> > > > > > >
>> >> > > > > > > BR,
>> >> > > > > > >
>> >> > > > > > > Jacek
>> >> > > > > > >
>> >> > > > > > >
>> >> > > > > > > pt., 2 kwi 2021 o 21:53 Micah Kornfield <
>> emkornfi...@gmail.com>
>> >> > > > > > > napisał(a):
>> >> > > > > > > >
>> >> > > > > > > > Andrew is the use-case you have simply postgres
>> compatibility or
>> >> > > > is it
>> >> > > > > > > more
>> >> > > > > > > > extensive?
>> >> > > > > > > >
>> >> > > > > > > > One potential problem with combining Month and Day
>> fields, is that
>> >> > > > the
>> >> > > > > > > type
>> >> > > > > > > > no longer has a defined sort order (the existing
>> Day-Millisecond
>> >> > > > type
>> >> > > > > > > > without assumptions, in particular because I don't think
>> today
>> >> > > > there is
>> >> > > > > > > an
>> >> > > > > > > > explicit constraint on the bounds for the millisecond
>> component).
>> >> > > > > > > >
>> >> > > > > > > > -Micah
>> >> > > > > > > >
>> >> > > > > > > >
>> >> > > > > > > >
>> >> > > > > > > > On Wed, Mar 31, 2021 at 9:03 AM Antoine Pitrou <
>> anto...@python.org
>> >> > > > >
>> >> > > > > > > wrote:
>> >> > > > > > > >
>> >> > > > > > > > >
>> >> > > > > > > > > Le 31/03/2021 à 17:55, Micah Kornfield a écrit :
>> >> > > > > > > > > > Thanks for the feedback.  A couple of points here
>> and some
>> >> > > > responses
>> >> > > > > > > > > below.
>> >> > > > > > > > > >
>> >> > > > > > > > > > * One other question is whether the Nanoseconds
>> should
>> >> > > > actually be
>> >> > > > > > > > > > configurable (i.e. use milliseconds or
>> microseconds).  I would
>> >> > > > lean
>> >> > > > > > > > > towards
>> >> > > > > > > > > > no.
>> >> > > > > > > > >
>> >> > > > > > > > > Same for me.
>> >> > > > > > > > >
>> >> > > > > > > > > > * I'm also still not 100% convinced we need this as
>> a first
>> >> > > > class
>> >> > > > > > > type in
>> >> > > > > > > > > > arrow or if we should be looking more closely at the
>> Struct
>> >> > > > (in the
>> >> > > > > > > Arrow
>> >> > > > > > > > > > sense) based implementation.  In the future where
>> alternative
>> >> > > > > > > encodings
>> >> > > > > > > > > are
>> >> > > > > > > > > > supported, this could allow for much smaller
>> footprints for
>> >> > > > this
>> >> > > > > > > type.
>> >> > > > > > > > >
>> >> > > > > > > > > Having a "packed" first class type allows for better
>> locality
>> >> > > > when
>> >> > > > > > > > > accessing data.  It doesn't sound very likely that
>> you'd access
>> >> > > > only
>> >> > > > > > > one
>> >> > > > > > > > > component of the interval.
>> >> > > > > > > > >
>> >> > > > > > > > > But I have no idea how important this is, and temporal
>> datetypes
>> >> > > > are
>> >> > > > > > > > > generally cumbersome to add support for (conversions,
>> arithmetic,
>> >> > > > > > > etc.),
>> >> > > > > > > > > so it would be nice to avoid adding too many of them
>> :-)
>> >> > > > > > > > >
>> >> > > > > > > > > Regards
>> >> > > > > > > > >
>> >> > > > > > > > > Antoine.
>> >> > > > > > > > >
>> >> > > > > > > > >
>> >> > > > > > > > >
>> >> > > > > > > > > >
>> >> > > > > > > > > > The 3
>> >> > > > > > > > > >> field implementation doesn't seem to have any way
>> to represent
>> >> > > > > > > integral
>> >> > > > > > > > > >> days, so I am also not sure about that one.
>> >> > > > > > > > > >
>> >> > > > > > > > > >
>> >> > > > > > > > > > Sorry this was an email gaffe.  I intended Month (32
>> bit int),
>> >> > > > Day
>> >> > > > > > > (32
>> >> > > > > > > > > bit
>> >> > > > > > > > > > int), Nanosecond (64 bit int).
>> >> > > > > > > > > >
>> >> > > > > > > > > > OTOH I don't really understand the point of
>> supporting "the
>> >> > > > most
>> >> > > > > > > > > >> reasonable ranges for Year, Month and Nanoseconds
>> >> > > > independently".
>> >> > > > > > > What
>> >> > > > > > > > > >> does it bring to encode more than one month in the
>> nanoseconds
>> >> > > > > > > field?
>> >> > > > > > > > > >
>> >> > > > > > > > > >
>> >> > > > > > > > > > I'm happy with simplicity.   In the past there has
>> been some
>> >> > > > > > > reference to
>> >> > > > > > > > > > people wanting to store very large timestamps (fall
>> out of
>> >> > > > > > > Nanoseconds
>> >> > > > > > > > > max
>> >> > > > > > > > > > representable value) but we've concluded that this
>> wasn't
>> >> > > > something
>> >> > > > > > > that
>> >> > > > > > > > > we
>> >> > > > > > > > > > wanted to really support.
>> >> > > > > > > > > >
>> >> > > > > > > > > >
>> >> > > > > > > > > >
>> >> > > > > > > > > >
>> >> > > > > > > > > >
>> >> > > > > > > > > >
>> >> > > > > > > > > > On Wed, Mar 31, 2021 at 4:49 AM Antoine Pitrou <
>> >> > > > anto...@python.org>
>> >> > > > > > > > > wrote:
>> >> > > > > > > > > >
>> >> > > > > > > > > >>
>> >> > > > > > > > > >> I would favour the following characteristics :
>> >> > > > > > > > > >> - support for nanoseconds (especially as other
>> Arrow temporal
>> >> > > > types
>> >> > > > > > > > > >> support it)
>> >> > > > > > > > > >> - easy to handle (which excludes the ZetaSQL
>> representtaion
>> >> > > > IMHO)
>> >> > > > > > > > > >>
>> >> > > > > > > > > >> OTOH I don't really understand the point of
>> supporting "the
>> >> > > > most
>> >> > > > > > > > > >> reasonable ranges for Year, Month and Nanoseconds
>> >> > > > independently".
>> >> > > > > > > What
>> >> > > > > > > > > >> does it bring to encode more than one month in the
>> nanoseconds
>> >> > > > > > > field?
>> >> > > > > > > > > >> You can already use the Duration type for that.
>> >> > > > > > > > > >>
>> >> > > > > > > > > >> Regards
>> >> > > > > > > > > >>
>> >> > > > > > > > > >> Antoine.
>> >> > > > > > > > > >>
>> >> > > > > > > > > >>
>> >> > > > > > > > > >> Le 31/03/2021 à 05:48, Micah Kornfield a écrit :
>> >> > > > > > > > > >>> To follow-up on this conversation I did some
>> analysis on
>> >> > > > interval
>> >> > > > > > > > > types:
>> >> > > > > > > > > >>>
>> >> > > > > > > > > >>>
>> >> > > > > > > > > >>
>> >> > > > > > > > >
>> >> > > > > > >
>> >> > > >
>> https://docs.google.com/document/d/1i1E_fdQ_xODZcAhsV11Pfq27O50k679OYHXFJpm9NS0/edit
>> >> > > > > > > > > >> Please feel free to add more details/systems I
>> missed.
>> >> > > > > > > > > >>>
>> >> > > > > > > > > >>> Given the disparate requirements of different
>> systems I
>> >> > > > think the
>> >> > > > > > > > > >> following might make sense for official types (if
>> there isn't
>> >> > > > > > > > > consensus, I
>> >> > > > > > > > > >> might try to contributation extension Array
>> implementations
>> >> > > > for
>> >> > > > > > > them to
>> >> > > > > > > > > >> Java and C++/Python separately).
>> >> > > > > > > > > >>>
>> >> > > > > > > > > >>> 1.  3 fields: Year (32 bit), Month (32 bit),
>> Nanoseconds (64
>> >> > > > bit)
>> >> > > > > > > all
>> >> > > > > > > > > >> signed.
>> >> > > > > > > > > >>> 2.  Postgres representation (Downside is it
>> doesn't support
>> >> > > > > > > > > Nanoseconds,
>> >> > > > > > > > > >> only microseconds).
>> >> > > > > > > > > >>> 3.  ZetaSQL implementation (Requires some bit
>> manipulation)
>> >> > > > but
>> >> > > > > > > > > supports
>> >> > > > > > > > > >> the most reasonable ranges for Year, Month and
>> Nanoseconds
>> >> > > > > > > > > independently.
>> >> > > > > > > > > >>>
>> >> > > > > > > > > >>> Thoughts?
>> >> > > > > > > > > >>>
>> >> > > > > > > > > >>> Micah
>> >> > > > > > > > > >>>
>> >> > > > > > > > > >>> On 2021/02/18 04:30:55 Micah Kornfield wrote:
>> >> > > > > > > > > >>>>>
>> >> > > > > > > > > >>>>> I didn’t find any page/documentation on how to
>> do RFC in
>> >> > > > Arrow
>> >> > > > > > > > > >> protocol,
>> >> > > > > > > > > >>>>> so can anyone point me to it or PR with email
>> will be
>> >> > > > enough?
>> >> > > > > > > > > >>>>
>> >> > > > > > > > > >>>> That is enough to start discussion.  Before formal
>> >> > > > acceptance and
>> >> > > > > > > > > >> merging
>> >> > > > > > > > > >>>> of the PR there needs to be a Java and C++
>> implementations
>> >> > > > for the
>> >> > > > > > > > > type
>> >> > > > > > > > > >>>> that pass integration tests.  At the time this
>> guideline was
>> >> > > > > > > > > instituted
>> >> > > > > > > > > >>>> Java and C++ were considered the "reference"
>> >> > > > implementations (I
>> >> > > > > > > think
>> >> > > > > > > > > >> they
>> >> > > > > > > > > >>>> still have the most complete integration test
>> coverage).
>> >> > > > > > > > > >>>>
>> >> > > > > > > > > >>>> My understanding is that the current modelling of
>> intervals
>> >> > > > > > > mimics SQL
>> >> > > > > > > > > >>>> standards (e.g. SQL Server [1]).  So it would
>> also be good
>> >> > > > to step
>> >> > > > > > > > > back
>> >> > > > > > > > > >> and
>> >> > > > > > > > > >>>> understand what problem DF is trying to solve and
>> how it
>> >> > > > differs
>> >> > > > > > > from
>> >> > > > > > > > > >> other
>> >> > > > > > > > > >>>> SQL implementations.  I'd be hesitant to accept
>> COMPLEX as
>> >> > > > a new
>> >> > > > > > > type
>> >> > > > > > > > > >>>> without a much deeper analysis into calendar
>> representations
>> >> > > > > > > within
>> >> > > > > > > > > >> Arrow
>> >> > > > > > > > > >>>> and how they relate to other existing systems
>> (e.g. Hive
>> >> > > > and some
>> >> > > > > > > > > >>>> assortment of existing SQL databases).  For
>> instance the
>> >> > > > current
>> >> > > > > > > > > >> modelling
>> >> > > > > > > > > >>>> of timestamps does not lend itself to
>> constructing a COMPLEX
>> >> > > > > > > interval
>> >> > > > > > > > > >> type
>> >> > > > > > > > > >>>> particularly well. (Duration was introduced for
>> this
>> >> > > > reason).
>> >> > > > > > > > > >>>>
>> >> > > > > > > > > >>>> I think both Wes's suggestion of FixedSizeBinary
>> and
>> >> > > > Andrew's of
>> >> > > > > > > > > >> composing
>> >> > > > > > > > > >>>> the with a struct are good stop-gaps.  These
>> obviously have
>> >> > > > > > > different
>> >> > > > > > > > > >>>> trade-offs.  Ultimately, it would be good to
>> define common
>> >> > > > > > > extension
>> >> > > > > > > > > >> types
>> >> > > > > > > > > >>>> that can represent this use-case if there really
>> is demand
>> >> > > > for it
>> >> > > > > > > (if
>> >> > > > > > > > > it
>> >> > > > > > > > > >>>> doesn't become a top level type).
>> >> > > > > > > > > >>>>
>> >> > > > > > > > > >>>> [1]
>> >> > > > > > > > > >>>>
>> >> > > > > > > > > >>
>> >> > > > > > > > >
>> >> > > > > > >
>> >> > > >
>> https://docs.microsoft.com/en-us/sql/odbc/reference/appendixes/interval-data-types?view=sql-server-ver15
>> >> > > > > > > > > >>>>
>> >> > > > > > > > > >>>> -Micah
>> >> > > > > > > > > >>>>
>> >> > > > > > > > > >>>> On Wed, Feb 17, 2021 at 2:05 PM Andrew Lamb <
>> >> > > > al...@influxdata.com
>> >> > > > > > > >
>> >> > > > > > > > > >> wrote:
>> >> > > > > > > > > >>>>
>> >> > > > > > > > > >>>>> That is a great suggestion Wes, thank you.
>> >> > > > > > > > > >>>>>
>> >> > > > > > > > > >>>>> I wonder if we could get away with a 128 bit
>> >> > > > representation that
>> >> > > > > > > is
>> >> > > > > > > > > the
>> >> > > > > > > > > >>>>> concatenation of the two existing interval types
>> >> > > > > > > > > (YearMonth)(DayTime).
>> >> > > > > > > > > >> Or
>> >> > > > > > > > > >>>>> maybe even define a `struct` type with those
>> fields that
>> >> > > > is used
>> >> > > > > > > by
>> >> > > > > > > > > >>>>> DataFusion.
>> >> > > > > > > > > >>>>>
>> >> > > > > > > > > >>>>> Basically, given our reading of the Arrow
>> spec[1], it is
>> >> > > > > > > currently
>> >> > > > > > > > > not
>> >> > > > > > > > > >>>>> possible to precisely represent an interval that
>> has both
>> >> > > > > > > monthly and
>> >> > > > > > > > > >>>>> sub-montly granularity.
>> >> > > > > > > > > >>>>>
>> >> > > > > > > > > >>>>> As Dmtry says, if you have an interval seemingly
>> simple
>> >> > > > like  1
>> >> > > > > > > > > month,
>> >> > > > > > > > > >> 1
>> >> > > > > > > > > >>>>> day
>> >> > > > > > > > > >>>>>
>> >> > > > > > > > > >>>>> Using IntervalUnit(YEAR_MONTH) can't represent
>> the 1 day
>> >> > > > > > > > > >>>>> Using IntervalUnit(DAY_TIME) can't represent the
>> month as
>> >> > > > > > > different
>> >> > > > > > > > > >> months
>> >> > > > > > > > > >>>>> have different numbers of days
>> >> > > > > > > > > >>>>>
>> >> > > > > > > > > >>>>> [1]
>> >> > > > > > > > > >>>>>
>> >> > > > > > > > > >>
>> >> > > > > > >
>> >> > > >
>> https://github.com/apache/arrow/blob/master/format/Schema.fbs#L249-L260
>> >> > > > > > > > > >>>>>
>> >> > > > > > > > > >>>>>
>> >> > > > > > > > > >>>>> On Wed, Feb 17, 2021 at 5:01 PM Wes McKinney <
>> >> > > > > > > wesmck...@gmail.com>
>> >> > > > > > > > > >> wrote:
>> >> > > > > > > > > >>>>>
>> >> > > > > > > > > >>>>>> On Wed, Feb 17, 2021 at 3:46 PM <t...@dmtry.me>
>> wrote:
>> >> > > > > > > > > >>>>>>>
>> >> > > > > > > > > >>>>>>>> It's unclear to me that this needs to be
>> introduced
>> >> > > > into the
>> >> > > > > > > > > >>>>> top-level
>> >> > > > > > > > > >>>>>>>
>> >> > > > > > > > > >>>>>>> Similar thing to columnar format, How to store
>> interval
>> >> > > > like 1
>> >> > > > > > > > > month
>> >> > > > > > > > > >> 1
>> >> > > > > > > > > >>>>>> day 1 hour? It’s not possible to do it without
>> converting
>> >> > > > 1
>> >> > > > > > > month to
>> >> > > > > > > > > >> 30
>> >> > > > > > > > > >>>>>> days, which is a bad way.
>> >> > > > > > > > > >>>>>>>
>> >> > > > > > > > > >>>>>>
>> >> > > > > > > > > >>>>>> Presumably you can represent a complex interval
>> in a fixed
>> >> > > > > > > number of
>> >> > > > > > > > > >>>>>> bytes, and then embed the data in a
>> FixedSizeBinary type.
>> >> > > > You
>> >> > > > > > > can
>> >> > > > > > > > > >>>>>> adorn this type with extension type metadata so
>> that
>> >> > > > DataFusion
>> >> > > > > > > can
>> >> > > > > > > > > >>>>>> then apply Interval semantics to it. This could
>> also
>> >> > > > serve as an
>> >> > > > > > > > > >>>>>> interim strategy for you to proceed with
>> implementation
>> >> > > > while
>> >> > > > > > > > > >>>>>> proposing a top-level type to the Arrow format
>> (which may
>> >> > > > or
>> >> > > > > > > may not
>> >> > > > > > > > > >>>>>> be accepting) so you aren't blocked on
>> acceptance of
>> >> > > > changes
>> >> > > > > > > into
>> >> > > > > > > > > >>>>>> Schema.fbs.
>> >> > > > > > > > > >>>>>>
>> >> > > > > > > > > >>>>>>>> On 17 Feb 2021, at 21:02, Wes McKinney <
>> >> > > > wesmck...@gmail.com>
>> >> > > > > > > > > wrote:
>> >> > > > > > > > > >>>>>>>>
>> >> > > > > > > > > >>>>>>>> It's unclear to me that this needs to be
>> introduced
>> >> > > > into the
>> >> > > > > > > > > >>>>> top-level
>> >> > > > > > > > > >>>>>>>> columnar format without more analysis — have
>> you
>> >> > > > considered
>> >> > > > > > > > > >>>>>>>> implementing this for DataFusion as an
>> extension type
>> >> > > > for the
>> >> > > > > > > time
>> >> > > > > > > > > >>>>>>>> being?
>> >> > > > > > > > > >>>>>>>>
>> >> > > > > > > > > >>>>>>>> On Wed, Feb 17, 2021 at 11:59 AM
>> t...@dmtry.me <mailto:
>> >> > > > > > > > > >> t...@dmtry.me
>> >> > > > > > > > > >>>>>>
>> >> > > > > > > > > >>>>>> <t...@dmtry.me <mailto:t...@dmtry.me>> wrote:
>> >> > > > > > > > > >>>>>>>>>
>> >> > > > > > > > > >>>>>>>>> Hi,
>> >> > > > > > > > > >>>>>>>>>
>> >> > > > > > > > > >>>>>>>>> For now, There are only two types of
>> IntervalUnit
>> >> > > > inside
>> >> > > > > > > Arrow:
>> >> > > > > > > > > >>>>>>>>>
>> >> > > > > > > > > >>>>>>>>> - YearMonth - month stored as int32
>> >> > > > > > > > > >>>>>>>>> - DayTime - days as int32 and time in
>> milliseconds  as
>> >> > > > in32.
>> >> > > > > > > > > Total
>> >> > > > > > > > > >>>>>> (64 bites)
>> >> > > > > > > > > >>>>>>>>>
>> >> > > > > > > > > >>>>>>>>> Since DF is using Arrow, It’s not possible
>> to store
>> >> > > > “Complex”
>> >> > > > > > > > > >>>>>> intervals such 1 MONTH 1 DAY 1 HOUR.
>> >> > > > > > > > > >>>>>>>>> I think, the best way to understand the
>> problem will
>> >> > > > be to
>> >> > > > > > > read a
>> >> > > > > > > > > >>>>>> comment from DF codebase:
>> >> > > > > > > > > >>>>>>
>> >> > > > > > > > > >>>>>
>> >> > > > > > > > > >>
>> >> > > > > > > > >
>> >> > > > > > >
>> >> > > >
>> https://github.com/apache/arrow/blob/bca7d2fe84ccd8fc1129cb4d85448eb0779c52c3/rust/datafusion/src/sql/planner.rs#L1148
>> >> > > > > > > > > >>>>>>>>>
>> >> > > > > > > > > >>>>>>>>>          // Interval is tricky thing
>> >> > > > > > > > > >>>>>>>>>          // 1 day is not 24 hours because
>> timezones, 1
>> >> > > > year
>> >> > > > > > > !=
>> >> > > > > > > > > >>>>> 365/364!
>> >> > > > > > > > > >>>>>> 30 days != 1 month
>> >> > > > > > > > > >>>>>>>>>          // The true way to store and
>> calculate
>> >> > > > intervals is
>> >> > > > > > > to
>> >> > > > > > > > > >> store
>> >> > > > > > > > > >>>>>> it as it defined
>> >> > > > > > > > > >>>>>>>>>          // Due the fact that Arrow supports
>> only two
>> >> > > > types
>> >> > > > > > > > > >> YearMonth
>> >> > > > > > > > > >>>>>> (month) and DayTime (day, time)
>> >> > > > > > > > > >>>>>>>>>          // It's not possible to store
>> complex
>> >> > > > intervals
>> >> > > > > > > > > >>>>>>>>>          // It's possible to do select
>> (NOW() +
>> >> > > > INTERVAL '1
>> >> > > > > > > > > year') +
>> >> > > > > > > > > >>>>>> INTERVAL '1 day'; as workaround
>> >> > > > > > > > > >>>>>>>>>          if result_month != 0 &&
>> (result_days != 0 ||
>> >> > > > > > > > > result_millis
>> >> > > > > > > > > >> !=
>> >> > > > > > > > > >>>>>> 0) {
>> >> > > > > > > > > >>>>>>>>>              return
>> >> > > > > > > Err(DataFusionError::NotImplemented(format!(
>> >> > > > > > > > > >>>>>>>>>                  "DF does not support
>> intervals that
>> >> > > > have
>> >> > > > > > > both a
>> >> > > > > > > > > >>>>>> Year/Month part as well as
>> Days/Hours/Mins/Seconds: {:?}.
>> >> > > > Hint:
>> >> > > > > > > try
>> >> > > > > > > > > >>>>>> breaking the interval into two parts, one with
>> Year/Month
>> >> > > > and
>> >> > > > > > > the
>> >> > > > > > > > > >> other
>> >> > > > > > > > > >>>>>> with Days/Hours/Mins/Seconds - e.g. (NOW() +
>> INTERVAL '1
>> >> > > > year')
>> >> > > > > > > +
>> >> > > > > > > > > >>>>> INTERVAL
>> >> > > > > > > > > >>>>>> '1 day'",
>> >> > > > > > > > > >>>>>>>>>                  value
>> >> > > > > > > > > >>>>>>>>>              )));
>> >> > > > > > > > > >>>>>>>>>          }
>> >> > > > > > > > > >>>>>>>>>
>> >> > > > > > > > > >>>>>>>>>
>> >> > > > > > > > > >>>>>>>>>
>> >> > > > > > > > > >>>>>>>>> I prepared a PR
>> >> > > > > > > https://github.com/apache/arrow/pull/9516/files
>> >> > > > > > > > > <
>> >> > > > > > > > > >>>>>> https://github.com/apache/arrow/pull/9516/files>
>> <
>> >> > > > > > > > > >>>>>> https://github.com/apache/arrow/pull/9516/files
>> <
>> >> > > > > > > > > >>>>>> https://github.com/apache/arrow/pull/9516/files>>
>> that
>> >> > > > > > > introduce a
>> >> > > > > > > > > >> new
>> >> > > > > > > > > >>>>>> type for IntervalUnit called Complex, that
>> store both
>> >> > > > YearMonth
>> >> > > > > > > and
>> >> > > > > > > > > >>>>> DayTime
>> >> > > > > > > > > >>>>>> to support complex interval.
>> >> > > > > > > > > >>>>>>>>> I didn’t find any page/documentation on how
>> to do RFC
>> >> > > > in
>> >> > > > > > > Arrow
>> >> > > > > > > > > >>>>>> protocol, so can anyone point me to it or PR
>> with email
>> >> > > > will be
>> >> > > > > > > > > >> enough?
>> >> > > > > > > > > >>>>>>>>>
>> >> > > > > > > > > >>>>>>>>> Thanks.
>> >> > > > > > > > > >>>>>>>
>> >> > > > > > > > > >>>>>>
>> >> > > > > > > > > >>>>>
>> >> > > > > > > > > >>>>
>> >> > > > > > > > > >>
>> >> > > > > > > > > >
>> >> > > > > > > > >
>> >> > > > > > >
>> >> > > >
>>
>

Reply via email to