I have some historical context that may or may not be relevant. I still remember how we did the transition for Spark. This was ca. 2019, and there were still many people mixing Spark 2.x and 3.0. Also, many other systems were still using Java 7 which only supported Julian. As a result, Spark 3.0+ can even still write with the Julian calendar, at least if using the Spark-native parquet read and write path.
1) The parquet files written by Spark 3.0+ have metadata keys that contain a Spark version ("org.apache.spark.version") and whether the timestamps are in Julian a.k.a. Java 7 ("org.apache.spark.legacyDateTime"). There's also "org.apache.spark.legacyINT96" which is about whether INT96 timestamps have been written with Julian calendar in the date part. 2) Files that don't have a Spark version are interpreted as Julian or proleptic Gregorian depending on a config "spark.sql.parquet.datetimeRebaseModeInRead" / "spark.sql.parquet.int96RebaseModeInRead". (There are similar configs for ORC and avro.) This defaults to EXCEPTION, which means "if a date is different in the two calendars, fail the write and force the users to choose". If it's set to LEGACY, then Spark will actually "rebase" the dates at read time because Spark 3.0+ uses the Java 8 proleptic gregorian calendar internally. 3) Writing mode is controlled by configs "spark.sql.parquet.datetimeRebaseModeInWrite" and "spark.sql.parquet.int96RebaseModeInWrite". These were also until recently set to EXCEPTION (i.e., force the user to choose when a value is encountered where it matters). See https://issues.apache.org/jira/browse/SPARK-46440. I'm not sure if any of this matters for Iceberg though. It may matter if any Iceberg implementation writes using the Spark native parquet/orc/avro write path AND the user has configured it to use LEGACY dates. Or are there paths where Iceberg can convert from Parquet files? Then you might encounter these metadata flags. I'm not sure if it's worth complicating the spec by supporting this. :) On Thu, Sep 12, 2024 at 8:03 AM Micah Kornfield <emkornfi...@gmail.com> wrote: > At the moment, the specification is ambiguous on which calendar is used > for temporal conversion/writing [1]. Reading the java code it appears it is > using Java's OffsetDateTime which conforms to ISO8601 [2]. ISO8601 appears > to explicitly disallow the Julian calendar (but only says proleptic > gregorian can be used by mutual consent [3]). > > Therefore I'd propose: > 1. We make the ISO8601 + proleptic Gregorian + Gregorian calendars > explicit in the specification. > 2. Mention in an implementation note, that data migrated from other > systems or data written by older systems might follow the Julian calendar > (e.g. it looks like Spark only transitioned in 3.0 [4]). > * Does anybody know of metadata available for systems to make this > determination? > * Or a recommendation on how to handle these? > > Thoughts? > > Thanks, > Micah > > [1] This is esoteric but a few systems use 0001-01-01 as a sentinel value > for null so does have some wider applicability > [2] > https://docs.oracle.com/javase/8/docs/api/java/time/OffsetDateTime.html > [3] https://en.wikipedia.org/wiki/ISO_8601#Dates > [4] https://issues.apache.org/jira/browse/SPARK-26651 > >