emkornfield commented on code in PR #496:
URL: https://github.com/apache/parquet-format/pull/496#discussion_r2158096228
##########
src/main/thrift/parquet.thrift:
##########
@@ -461,6 +461,29 @@ struct GeographyType {
2: optional EdgeInterpolationAlgorithm algorithm;
}
+/**
+ * Year-Month Interval logical type annotation
+ *
+ * The data is stored as an 4 byte signed integer which represents the number
+ * of months associated with the time interval. The value can be negative to
+ * indicate a backward duration.
+ *
+ * Allowed for physical type: INT32
+ */
+struct IntervalYearMonthType {
+}
+
+/**
+ * Month-Day Interval logical type annotation
+ *
+ * The data is stored as a 16-byte signed value, which represents the number
Review Comment:
> For example, looking at the
[CalendarInterval](https://github.com/apache/spark/blob/master/common/unsafe/src/main/java/org/apache/spark/unsafe/types/CalendarInterval.java)
spark looks like they almost match the arrow MonthDayNano (they use micros
instead of nanos).
int64 for microseconds aligns with the 10K years. The arrow choice for
Nano's was somewhat arbitrary IIRC to align with the finest granularity time in
Arrow at the the time. I think it was likely a pragmatic choice at the time
and we haven't had complaints about it to my knowledge (although I think
Interval in general is fairly esoteric type I think).
> Even using an int32 of months for the YearToMonth interval doesn't match
the SQL standard exactly, but spark and many other columnar compute engines use
int32 instead of int64.
Can you elaborate on this, which part says it is an int64. an int32 easily
holds +/- 10K years when expressed in months?
> I just think that there's more benefit to prioritizing format
compatibility and reducing/avoiding conversions then to necessarily match the
SQL standard. That's my personal opinion at least. Thoughts?
So I think there are potentially 2 problems to address in terms of
compatibility:
1. On some level, both Date Time and Year-month intervals are real world
types, and engines will want to write them (e.g. Spark added them in 4.0.0, and
I would need to do some JIRA digging but I thought the intent was to replace
CalendarInterval). We need some way to make an unambiguous recovery of the
types from parquet. I think this is solvable with appropriate metadata,
regardless of encoding, so not really a blocker.
2. Being able to store the full range of the SQL standard. Pragmatically, I
think the ranges provided by Arrow's MonthDayNanos type meet most practical
purposes, but I can also see the flip-side of users getting frustrated when
they can't store something that the SQL engine supports in memory but can't be
persisted.
A strawman proposal, if we really wanted to keep three distinct types we
could encode them all in MonthDayNanos (to shred or not becomes another
question with some design issues as well, I think it likely overlaps with
DecFloat). Then the metadata at the parquet level could look like:
```
enum IntervalTypeRepresentation {
YearMonth - Only the Month field is ever set
DayTime - The full range of nanoseconds is used if possible. If the
range exceeds the representable int64 nanos range, then
day field + nanos are used. In the latter case Nanosecond's are
always sub-day and day is always interpreted as 86,400,000,000 nanoseconds.
Calendar - All fields are indepedent (e.g. match Arrow semantics exactly)
}
```
Translating to arrow could go through any of the defined interval types
(subject to truncation).
Problems with this approach off the top of my head:
1. It feels inelegant/complicated (in theory DayTime intervals have a sort
order and the sorting becomes non-trivial)
2. will likely slightly higher computation overhead for deltas exceeding
nano second range
3. I'm still not really sure it gets us fully to Arrow compatibility because
of the fallback nature for Day Field for Nanos.
@RussellSpitzer @gh-yzou thoughts here?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]