Re: [DISCUSS] Define calendar used in specification?

Micah Kornfield Thu, 12 Sep 2024 13:33:58 -0700

>
> The spec purposely avoids timestamp conversion. Iceberg returns values as
> they are passed from the engine and it is the engine's responsibility to do
> any date/time conversion. I don't think that we should change this and take
> responsibility in Iceberg.



I'm not sure I understand this and I might be confusing the issues.  Two
questions:
1.   How can two different engines agree on this conversion pre-gregorian
calendar dates without a specification?  (I guess the alternative that
is being proposed to be explicit is that this is out of scope?)
2.   I could be misreading this but it seems the Java implementation
implicitly relies on a calendar system both for transforms [1][2] and
reading parquet [3].  It seems pyiceberg also seems to use proleptic
gregorian [4][5].  Doesn't this imply that there has been a specific choice
in the calendar?

Thanks,
Micah

[1]
https://github.com/apache/iceberg/blob/ab0594bf71a6884ee0e196470bfe4b4d3baa58b9/api/src/main/java/org/apache/iceberg/util/DateTimeUtil.java#L227
[2]
https://github.com/apache/iceberg/blob/ab0594bf71a6884ee0e196470bfe4b4d3baa58b9/api/src/main/java/org/apache/iceberg/transforms/Timestamps.java#L205
[3]
https://github.com/apache/iceberg/blob/ab0594bf71a6884ee0e196470bfe4b4d3baa58b9/parquet/src/main/java/org/apache/iceberg/data/parquet/BaseParquetReaders.java#L421
[4]
https://github.com/apache/iceberg-python/blob/d8d509ff1bc33040b9f6c90c28ee47ac7437945d/pyiceberg/utils/datetime.py#L122C12-L122C27
[5] datetime.datetime(1582, 10, 20) + datetime.timedelta(microseconds=-7*
86400*1000*1000) => datetime.datetime(1582, 10, 13, 0, 0)

On Thu, Sep 12, 2024 at 12:27 PM [email protected] <[email protected]> wrote:

> The spec purposely avoids timestamp conversion. Iceberg returns values as
> they are passed from the engine and it is the engine's responsibility to do
> any date/time conversion. I don't think that we should change this and take
> responsibility in Iceberg.
>
> On Thu, Sep 12, 2024 at 12:32 AM Bart Samwel <[email protected]>
> wrote:
>
>> I have some historical context that may or may not be relevant. I still
>> remember how we did the transition for Spark. This was ca. 2019, and there
>> were still many people mixing Spark 2.x and 3.0. Also, many other systems
>> were still using Java 7 which only supported Julian. As a result, Spark
>> 3.0+ can even still write with the Julian calendar, at least if using the
>> Spark-native parquet read and write path.
>>
>> 1) The parquet files written by Spark 3.0+ have metadata keys that
>> contain a Spark version ("org.apache.spark.version") and whether the
>> timestamps are in Julian a.k.a. Java 7 ("org.apache.spark.legacyDateTime").
>> There's also "org.apache.spark.legacyINT96" which is about whether INT96
>> timestamps have been written with Julian calendar in the date part.
>>
>> 2) Files that don't have a Spark version are interpreted as Julian or
>> proleptic Gregorian depending on a config
>> "spark.sql.parquet.datetimeRebaseModeInRead" /
>> "spark.sql.parquet.int96RebaseModeInRead". (There are similar configs for
>> ORC and avro.) This defaults to EXCEPTION, which means "if a date is
>> different in the two calendars, fail the write and force the users to
>> choose". If it's set to LEGACY, then Spark will actually "rebase" the dates
>> at read time because Spark 3.0+ uses the Java 8 proleptic gregorian
>> calendar internally.
>>
>> 3) Writing mode is controlled by configs
>> "spark.sql.parquet.datetimeRebaseModeInWrite" and
>> "spark.sql.parquet.int96RebaseModeInWrite". These were also until recently
>> set to EXCEPTION (i.e., force the user to choose when a value is
>> encountered where it matters). See
>> https://issues.apache.org/jira/browse/SPARK-46440.
>>
>> I'm not sure if any of this matters for Iceberg though. It may matter if
>> any Iceberg implementation writes using the Spark native parquet/orc/avro
>> write path AND the user has configured it to use LEGACY dates. Or are there
>> paths where Iceberg can convert from Parquet files? Then you might
>> encounter these metadata flags. I'm not sure if it's worth complicating the
>> spec by supporting this. :)
>>
>> On Thu, Sep 12, 2024 at 8:03 AM Micah Kornfield <[email protected]>
>> wrote:
>>
>>> At the moment, the specification is ambiguous on which calendar is used
>>> for temporal conversion/writing [1]. Reading the java code it appears it is
>>> using Java's OffsetDateTime which conforms to ISO8601 [2].  ISO8601 appears
>>> to explicitly disallow the Julian calendar (but only says proleptic
>>> gregorian can be used by mutual consent [3]).
>>>
>>> Therefore I'd propose:
>>> 1. We make the  ISO8601 + proleptic Gregorian + Gregorian calendars
>>> explicit in the specification.
>>> 2. Mention in an implementation note, that data migrated from other
>>> systems or data written by older systems might follow the Julian calendar
>>> (e.g. it looks like Spark only transitioned in 3.0 [4]).
>>>   *  Does anybody know of metadata available for systems to make this
>>> determination?
>>>   *  Or a recommendation on how to handle these?
>>>
>>> Thoughts?
>>>
>>> Thanks,
>>> Micah
>>>
>>> [1] This is esoteric but a few systems use 0001-01-01 as a sentinel
>>> value for null so does have some wider applicability
>>> [2]
>>> https://docs.oracle.com/javase/8/docs/api/java/time/OffsetDateTime.html
>>> [3] https://en.wikipedia.org/wiki/ISO_8601#Dates
>>> [4] https://issues.apache.org/jira/browse/SPARK-26651
>>>
>>>

Re: [DISCUSS] Define calendar used in specification?

Reply via email to