emkornfield commented on code in PR #496:
URL: https://github.com/apache/parquet-format/pull/496#discussion_r2158096228


##########
src/main/thrift/parquet.thrift:
##########
@@ -461,6 +461,29 @@ struct GeographyType {
   2: optional EdgeInterpolationAlgorithm algorithm;
 }
 
+/**
+ * Year-Month Interval logical type annotation
+ *
+ * The data is stored as an 4 byte signed integer which represents the number
+ * of months associated with the time interval. The value can be negative to
+ * indicate a backward duration.
+ *
+ * Allowed for physical type: INT32
+ */
+struct IntervalYearMonthType {
+}
+
+/**
+ * Month-Day Interval logical type annotation
+ *
+ * The data is stored as a 16-byte signed value, which represents the number

Review Comment:
   
   
   > For example, looking at the 
[CalendarInterval](https://github.com/apache/spark/blob/master/common/unsafe/src/main/java/org/apache/spark/unsafe/types/CalendarInterval.java)
 spark looks like they almost match the arrow MonthDayNano (they use micros 
instead of nanos).
   
   int64 for microseconds aligns with the 10K years.  The arrow choice for 
Nano's was somewhat arbitrary IIRC to align with the finest granularity time in 
Arrow at the the time.  I think it was likely a pragmatic choice at the time 
and we haven't had complaints about it to my knowledge (although I think 
Interval in general is fairly esoteric type I think).
   
   
   > Even using an int32 of months for the YearToMonth interval doesn't match 
the SQL standard exactly, but spark and many other columnar compute engines use 
int32 instead of int64.
   
   Can you elaborate on this, which part says it is an int64.  an int32 easily 
holds +/- 10K years when expressed in months?
   
   > I just think that there's more benefit to prioritizing format 
compatibility and reducing/avoiding conversions then to necessarily match the 
SQL standard. That's my personal opinion at least. Thoughts?
   
   So I think there are potentially 2 problems to address in terms of 
compatibility:
   
   1.  On some level, both Date Time and Year-month intervals are real world 
types, and engines will want to write them (e.g. Spark added them in 4.0.0, and 
I would need to do some JIRA digging but I thought the intent was to replace 
CalendarInterval).  We need some way to make an unambiguous recovery of the 
types from parquet.  I think this is solvable with appropriate metadata, 
regardless of encoding, so not really a blocker.
   
   2. Being able to store the full range of the SQL standard.  Pragmatically, I 
think the ranges provided by Arrow's MonthDayNanos type meet most practical 
purposes, but I can also see the flip-side of users getting frustrated when 
they can't store something that the SQL engine supports in memory but can't be 
persisted.  
   
   A strawman proposal, if we really wanted to keep three distinct types we 
could encode them all in MonthDayNanos (to shred or not becomes another 
question with some design issues as well, I think it likely overlaps with 
DecFloat).  Then the metadata at the parquet level could look like:
   
   ```
   enum IntervalTypeRepresentation {
      YearMonth - Only the Month field is ever set
      DayTime - The full range of nanoseconds is used if possible.  If the 
range exceeds the representable int64 nanos range, then 
          day field + nanos are used.  In the latter case Nanosecond's are 
always sub-day and day is always interpreted as 86,400,000,000 nanoseconds.
       Calendar - All fields are indepedent (e.g. match Arrow semantics exactly)
   }
   ```
   
   Translating to arrow could go through any of the defined interval types 
(subject to truncation).
   
   Problems with this approach off the top of my head:
   1.  It feels inelegant/complicated (in theory DayTime intervals have a sort 
order and the sorting becomes non-trivial)
   2. will likely slightly higher computation overhead for deltas exceeding 
nano second range
   3. I'm still not really sure it gets us fully to Arrow compatibility because 
of the fallback nature for Day Field for Nanos.
   
   @RussellSpitzer @gh-yzou thoughts here?
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to