I have some historical context that may or may not be relevant. I still
remember how we did the transition for Spark. This was ca. 2019, and there
were still many people mixing Spark 2.x and 3.0. Also, many other systems
were still using Java 7 which only supported Julian. As a result, Spark
3.0+ can even still write with the Julian calendar, at least if using the
Spark-native parquet read and write path.

1) The parquet files written by Spark 3.0+ have metadata keys that contain
a Spark version ("org.apache.spark.version") and whether the timestamps are
in Julian a.k.a. Java 7 ("org.apache.spark.legacyDateTime"). There's also
"org.apache.spark.legacyINT96" which is about whether INT96 timestamps have
been written with Julian calendar in the date part.

2) Files that don't have a Spark version are interpreted as Julian or
proleptic Gregorian depending on a config
"spark.sql.parquet.datetimeRebaseModeInRead" /
"spark.sql.parquet.int96RebaseModeInRead". (There are similar configs for
ORC and avro.) This defaults to EXCEPTION, which means "if a date is
different in the two calendars, fail the write and force the users to
choose". If it's set to LEGACY, then Spark will actually "rebase" the dates
at read time because Spark 3.0+ uses the Java 8 proleptic gregorian
calendar internally.

3) Writing mode is controlled by configs
"spark.sql.parquet.datetimeRebaseModeInWrite" and
"spark.sql.parquet.int96RebaseModeInWrite". These were also until recently
set to EXCEPTION (i.e., force the user to choose when a value is
encountered where it matters). See
https://issues.apache.org/jira/browse/SPARK-46440.

I'm not sure if any of this matters for Iceberg though. It may matter if
any Iceberg implementation writes using the Spark native parquet/orc/avro
write path AND the user has configured it to use LEGACY dates. Or are there
paths where Iceberg can convert from Parquet files? Then you might
encounter these metadata flags. I'm not sure if it's worth complicating the
spec by supporting this. :)

On Thu, Sep 12, 2024 at 8:03 AM Micah Kornfield <emkornfi...@gmail.com>
wrote:

> At the moment, the specification is ambiguous on which calendar is used
> for temporal conversion/writing [1]. Reading the java code it appears it is
> using Java's OffsetDateTime which conforms to ISO8601 [2].  ISO8601 appears
> to explicitly disallow the Julian calendar (but only says proleptic
> gregorian can be used by mutual consent [3]).
>
> Therefore I'd propose:
> 1. We make the  ISO8601 + proleptic Gregorian + Gregorian calendars
> explicit in the specification.
> 2. Mention in an implementation note, that data migrated from other
> systems or data written by older systems might follow the Julian calendar
> (e.g. it looks like Spark only transitioned in 3.0 [4]).
>   *  Does anybody know of metadata available for systems to make this
> determination?
>   *  Or a recommendation on how to handle these?
>
> Thoughts?
>
> Thanks,
> Micah
>
> [1] This is esoteric but a few systems use 0001-01-01 as a sentinel value
> for null so does have some wider applicability
> [2]
> https://docs.oracle.com/javase/8/docs/api/java/time/OffsetDateTime.html
> [3] https://en.wikipedia.org/wiki/ISO_8601#Dates
> [4] https://issues.apache.org/jira/browse/SPARK-26651
>
>

Reply via email to