Le 31/03/2021 à 17:55, Micah Kornfield a écrit :
Thanks for the feedback.  A couple of points here and some responses below.

* One other question is whether the Nanoseconds should actually be
configurable (i.e. use milliseconds or microseconds).  I would lean towards
no.

Same for me.

* I'm also still not 100% convinced we need this as a first class type in
arrow or if we should be looking more closely at the Struct (in the Arrow
sense) based implementation.  In the future where alternative encodings are
supported, this could allow for much smaller footprints for this type.

Having a "packed" first class type allows for better locality when accessing data. It doesn't sound very likely that you'd access only one component of the interval.

But I have no idea how important this is, and temporal datetypes are generally cumbersome to add support for (conversions, arithmetic, etc.), so it would be nice to avoid adding too many of them :-)

Regards

Antoine.




The 3
field implementation doesn't seem to have any way to represent integral
days, so I am also not sure about that one.


Sorry this was an email gaffe.  I intended Month (32 bit int), Day (32 bit
int), Nanosecond (64 bit int).

OTOH I don't really understand the point of supporting "the most
reasonable ranges for Year, Month and Nanoseconds independently".  What
does it bring to encode more than one month in the nanoseconds field?


I'm happy with simplicity.   In the past there has been some reference to
people wanting to store very large timestamps (fall out of Nanoseconds max
representable value) but we've concluded that this wasn't something that we
wanted to really support.






On Wed, Mar 31, 2021 at 4:49 AM Antoine Pitrou <anto...@python.org> wrote:


I would favour the following characteristics :
- support for nanoseconds (especially as other Arrow temporal types
support it)
- easy to handle (which excludes the ZetaSQL representtaion IMHO)

OTOH I don't really understand the point of supporting "the most
reasonable ranges for Year, Month and Nanoseconds independently".  What
does it bring to encode more than one month in the nanoseconds field?
You can already use the Duration type for that.

Regards

Antoine.


Le 31/03/2021 à 05:48, Micah Kornfield a écrit :
To follow-up on this conversation I did some analysis on interval types:


https://docs.google.com/document/d/1i1E_fdQ_xODZcAhsV11Pfq27O50k679OYHXFJpm9NS0/edit
Please feel free to add more details/systems I missed.

Given the disparate requirements of different systems I think the
following might make sense for official types (if there isn't consensus, I
might try to contributation extension Array implementations for them to
Java and C++/Python separately).

1.  3 fields: Year (32 bit), Month (32 bit), Nanoseconds (64 bit) all
signed.
2.  Postgres representation (Downside is it doesn't support Nanoseconds,
only microseconds).
3.  ZetaSQL implementation (Requires some bit manipulation) but supports
the most reasonable ranges for Year, Month and Nanoseconds independently.

Thoughts?

Micah

On 2021/02/18 04:30:55 Micah Kornfield wrote:

I didn’t find any page/documentation on how to do RFC in Arrow
protocol,
so can anyone point me to it or PR with email will be enough?

That is enough to start discussion.  Before formal acceptance and
merging
of the PR there needs to be a Java and C++ implementations for the type
that pass integration tests.  At the time this guideline was instituted
Java and C++ were considered the "reference" implementations (I think
they
still have the most complete integration test coverage).

My understanding is that the current modelling of intervals mimics SQL
standards (e.g. SQL Server [1]).  So it would also be good to step back
and
understand what problem DF is trying to solve and how it differs from
other
SQL implementations.  I'd be hesitant to accept COMPLEX as a new type
without a much deeper analysis into calendar representations within
Arrow
and how they relate to other existing systems (e.g. Hive and some
assortment of existing SQL databases).  For instance the current
modelling
of timestamps does not lend itself to constructing a COMPLEX interval
type
particularly well. (Duration was introduced for this reason).

I think both Wes's suggestion of FixedSizeBinary and Andrew's of
composing
the with a struct are good stop-gaps.  These obviously have different
trade-offs.  Ultimately, it would be good to define common extension
types
that can represent this use-case if there really is demand for it (if it
doesn't become a top level type).

[1]

https://docs.microsoft.com/en-us/sql/odbc/reference/appendixes/interval-data-types?view=sql-server-ver15

-Micah

On Wed, Feb 17, 2021 at 2:05 PM Andrew Lamb <al...@influxdata.com>
wrote:

That is a great suggestion Wes, thank you.

I wonder if we could get away with a 128 bit representation that is the
concatenation of the two existing interval types (YearMonth)(DayTime).
Or
maybe even define a `struct` type with those fields that is used by
DataFusion.

Basically, given our reading of the Arrow spec[1], it is currently not
possible to precisely represent an interval that has both monthly and
sub-montly granularity.

As Dmtry says, if you have an interval seemingly simple like  1 month,
1
day

Using IntervalUnit(YEAR_MONTH) can't represent the 1 day
Using IntervalUnit(DAY_TIME) can't represent the month as different
months
have different numbers of days

[1]

https://github.com/apache/arrow/blob/master/format/Schema.fbs#L249-L260


On Wed, Feb 17, 2021 at 5:01 PM Wes McKinney <wesmck...@gmail.com>
wrote:

On Wed, Feb 17, 2021 at 3:46 PM <t...@dmtry.me> wrote:

It's unclear to me that this needs to be introduced into the
top-level

Similar thing to columnar format, How to store interval like 1 month
1
day 1 hour? It’s not possible to do it without converting 1 month to
30
days, which is a bad way.


Presumably you can represent a complex interval in a fixed number of
bytes, and then embed the data in a FixedSizeBinary type. You can
adorn this type with extension type metadata so that DataFusion can
then apply Interval semantics to it. This could also serve as an
interim strategy for you to proceed with implementation while
proposing a top-level type to the Arrow format (which may or may not
be accepting) so you aren't blocked on acceptance of changes into
Schema.fbs.

On 17 Feb 2021, at 21:02, Wes McKinney <wesmck...@gmail.com> wrote:

It's unclear to me that this needs to be introduced into the
top-level
columnar format without more analysis — have you considered
implementing this for DataFusion as an extension type for the time
being?

On Wed, Feb 17, 2021 at 11:59 AM t...@dmtry.me <mailto:
t...@dmtry.me

<t...@dmtry.me <mailto:t...@dmtry.me>> wrote:

Hi,

For now, There are only two types of IntervalUnit inside Arrow:

- YearMonth - month stored as int32
- DayTime - days as int32 and time in milliseconds  as in32. Total
(64 bites)

Since DF is using Arrow, It’s not possible to store “Complex”
intervals such 1 MONTH 1 DAY 1 HOUR.
I think, the best way to understand the problem will be to read a
comment from DF codebase:


https://github.com/apache/arrow/blob/bca7d2fe84ccd8fc1129cb4d85448eb0779c52c3/rust/datafusion/src/sql/planner.rs#L1148

         // Interval is tricky thing
         // 1 day is not 24 hours because timezones, 1 year !=
365/364!
30 days != 1 month
         // The true way to store and calculate intervals is to
store
it as it defined
         // Due the fact that Arrow supports only two types
YearMonth
(month) and DayTime (day, time)
         // It's not possible to store complex intervals
         // It's possible to do select (NOW() + INTERVAL '1 year') +
INTERVAL '1 day'; as workaround
         if result_month != 0 && (result_days != 0 || result_millis
!=
0) {
             return Err(DataFusionError::NotImplemented(format!(
                 "DF does not support intervals that have both a
Year/Month part as well as Days/Hours/Mins/Seconds: {:?}. Hint: try
breaking the interval into two parts, one with Year/Month and the
other
with Days/Hours/Mins/Seconds - e.g. (NOW() + INTERVAL '1 year') +
INTERVAL
'1 day'",
                 value
             )));
         }



I prepared a PR https://github.com/apache/arrow/pull/9516/files <
https://github.com/apache/arrow/pull/9516/files> <
https://github.com/apache/arrow/pull/9516/files <
https://github.com/apache/arrow/pull/9516/files>> that introduce a
new
type for IntervalUnit called Complex, that store both YearMonth and
DayTime
to support complex interval.
I didn’t find any page/documentation on how to do RFC in Arrow
protocol, so can anyone point me to it or PR with email will be
enough?

Thanks.






Reply via email to