Re: [DISCUSS] Define calendar used in specification?

Micah Kornfield Mon, 30 Sep 2024 16:26:25 -0700

I just wanted to follow up on this.  A compromise on language here could be
that Iceberg uses ISO8601 calendar.  For dates prior to the
Julien/Gregorian calendar, implementations are encouraged to use
proleptic-gregorian but this is left unspecified by the specification.


Thoughts?

Micah





On Thu, Sep 12, 2024 at 1:33 PM Micah Kornfield <emkornfi...@gmail.com>
wrote:

> The spec purposely avoids timestamp conversion. Iceberg returns values as
>> they are passed from the engine and it is the engine's responsibility to do
>> any date/time conversion. I don't think that we should change this and take
>> responsibility in Iceberg.
>
>
> I'm not sure I understand this and I might be confusing the issues.  Two
> questions:
> 1.   How can two different engines agree on this conversion pre-gregorian
> calendar dates without a specification?  (I guess the alternative that
> is being proposed to be explicit is that this is out of scope?)
> 2.   I could be misreading this but it seems the Java implementation
> implicitly relies on a calendar system both for transforms [1][2] and
> reading parquet [3].  It seems pyiceberg also seems to use proleptic
> gregorian [4][5].  Doesn't this imply that there has been a specific choice
> in the calendar?
>
> Thanks,
> Micah
>
> [1]
> https://github.com/apache/iceberg/blob/ab0594bf71a6884ee0e196470bfe4b4d3baa58b9/api/src/main/java/org/apache/iceberg/util/DateTimeUtil.java#L227
> [2]
> https://github.com/apache/iceberg/blob/ab0594bf71a6884ee0e196470bfe4b4d3baa58b9/api/src/main/java/org/apache/iceberg/transforms/Timestamps.java#L205
> [3]
> https://github.com/apache/iceberg/blob/ab0594bf71a6884ee0e196470bfe4b4d3baa58b9/parquet/src/main/java/org/apache/iceberg/data/parquet/BaseParquetReaders.java#L421
> [4]
> https://github.com/apache/iceberg-python/blob/d8d509ff1bc33040b9f6c90c28ee47ac7437945d/pyiceberg/utils/datetime.py#L122C12-L122C27
> [5] datetime.datetime(1582, 10, 20) + datetime.timedelta(microseconds=-7*
> 86400*1000*1000) => datetime.datetime(1582, 10, 13, 0, 0)
>
> On Thu, Sep 12, 2024 at 12:27 PM rdb...@gmail.com <rdb...@gmail.com>
> wrote:
>
>> The spec purposely avoids timestamp conversion. Iceberg returns values as
>> they are passed from the engine and it is the engine's responsibility to do
>> any date/time conversion. I don't think that we should change this and take
>> responsibility in Iceberg.
>>
>> On Thu, Sep 12, 2024 at 12:32 AM Bart Samwel <b...@databricks.com.invalid>
>> wrote:
>>
>>> I have some historical context that may or may not be relevant. I still
>>> remember how we did the transition for Spark. This was ca. 2019, and there
>>> were still many people mixing Spark 2.x and 3.0. Also, many other systems
>>> were still using Java 7 which only supported Julian. As a result, Spark
>>> 3.0+ can even still write with the Julian calendar, at least if using the
>>> Spark-native parquet read and write path.
>>>
>>> 1) The parquet files written by Spark 3.0+ have metadata keys that
>>> contain a Spark version ("org.apache.spark.version") and whether the
>>> timestamps are in Julian a.k.a. Java 7 ("org.apache.spark.legacyDateTime").
>>> There's also "org.apache.spark.legacyINT96" which is about whether INT96
>>> timestamps have been written with Julian calendar in the date part.
>>>
>>> 2) Files that don't have a Spark version are interpreted as Julian or
>>> proleptic Gregorian depending on a config
>>> "spark.sql.parquet.datetimeRebaseModeInRead" /
>>> "spark.sql.parquet.int96RebaseModeInRead". (There are similar configs for
>>> ORC and avro.) This defaults to EXCEPTION, which means "if a date is
>>> different in the two calendars, fail the write and force the users to
>>> choose". If it's set to LEGACY, then Spark will actually "rebase" the dates
>>> at read time because Spark 3.0+ uses the Java 8 proleptic gregorian
>>> calendar internally.
>>>
>>> 3) Writing mode is controlled by configs
>>> "spark.sql.parquet.datetimeRebaseModeInWrite" and
>>> "spark.sql.parquet.int96RebaseModeInWrite". These were also until recently
>>> set to EXCEPTION (i.e., force the user to choose when a value is
>>> encountered where it matters). See
>>> https://issues.apache.org/jira/browse/SPARK-46440.
>>>
>>> I'm not sure if any of this matters for Iceberg though. It may matter if
>>> any Iceberg implementation writes using the Spark native parquet/orc/avro
>>> write path AND the user has configured it to use LEGACY dates. Or are there
>>> paths where Iceberg can convert from Parquet files? Then you might
>>> encounter these metadata flags. I'm not sure if it's worth complicating the
>>> spec by supporting this. :)
>>>
>>> On Thu, Sep 12, 2024 at 8:03 AM Micah Kornfield <emkornfi...@gmail.com>
>>> wrote:
>>>
>>>> At the moment, the specification is ambiguous on which calendar is used
>>>> for temporal conversion/writing [1]. Reading the java code it appears it is
>>>> using Java's OffsetDateTime which conforms to ISO8601 [2].  ISO8601 appears
>>>> to explicitly disallow the Julian calendar (but only says proleptic
>>>> gregorian can be used by mutual consent [3]).
>>>>
>>>> Therefore I'd propose:
>>>> 1. We make the  ISO8601 + proleptic Gregorian + Gregorian calendars
>>>> explicit in the specification.
>>>> 2. Mention in an implementation note, that data migrated from other
>>>> systems or data written by older systems might follow the Julian calendar
>>>> (e.g. it looks like Spark only transitioned in 3.0 [4]).
>>>>   *  Does anybody know of metadata available for systems to make this
>>>> determination?
>>>>   *  Or a recommendation on how to handle these?
>>>>
>>>> Thoughts?
>>>>
>>>> Thanks,
>>>> Micah
>>>>
>>>> [1] This is esoteric but a few systems use 0001-01-01 as a sentinel
>>>> value for null so does have some wider applicability
>>>> [2]
>>>> https://docs.oracle.com/javase/8/docs/api/java/time/OffsetDateTime.html
>>>> [3] https://en.wikipedia.org/wiki/ISO_8601#Dates
>>>> [4] https://issues.apache.org/jira/browse/SPARK-26651
>>>>
>>>>

Re: [DISCUSS] Define calendar used in specification?

Reply via email to