> > The spec purposely avoids timestamp conversion. Iceberg returns values as > they are passed from the engine and it is the engine's responsibility to do > any date/time conversion. I don't think that we should change this and take > responsibility in Iceberg.
I'm not sure I understand this and I might be confusing the issues. Two questions: 1. How can two different engines agree on this conversion pre-gregorian calendar dates without a specification? (I guess the alternative that is being proposed to be explicit is that this is out of scope?) 2. I could be misreading this but it seems the Java implementation implicitly relies on a calendar system both for transforms [1][2] and reading parquet [3]. It seems pyiceberg also seems to use proleptic gregorian [4][5]. Doesn't this imply that there has been a specific choice in the calendar? Thanks, Micah [1] https://github.com/apache/iceberg/blob/ab0594bf71a6884ee0e196470bfe4b4d3baa58b9/api/src/main/java/org/apache/iceberg/util/DateTimeUtil.java#L227 [2] https://github.com/apache/iceberg/blob/ab0594bf71a6884ee0e196470bfe4b4d3baa58b9/api/src/main/java/org/apache/iceberg/transforms/Timestamps.java#L205 [3] https://github.com/apache/iceberg/blob/ab0594bf71a6884ee0e196470bfe4b4d3baa58b9/parquet/src/main/java/org/apache/iceberg/data/parquet/BaseParquetReaders.java#L421 [4] https://github.com/apache/iceberg-python/blob/d8d509ff1bc33040b9f6c90c28ee47ac7437945d/pyiceberg/utils/datetime.py#L122C12-L122C27 [5] datetime.datetime(1582, 10, 20) + datetime.timedelta(microseconds=-7* 86400*1000*1000) => datetime.datetime(1582, 10, 13, 0, 0) On Thu, Sep 12, 2024 at 12:27 PM rdb...@gmail.com <rdb...@gmail.com> wrote: > The spec purposely avoids timestamp conversion. Iceberg returns values as > they are passed from the engine and it is the engine's responsibility to do > any date/time conversion. I don't think that we should change this and take > responsibility in Iceberg. > > On Thu, Sep 12, 2024 at 12:32 AM Bart Samwel <b...@databricks.com.invalid> > wrote: > >> I have some historical context that may or may not be relevant. I still >> remember how we did the transition for Spark. This was ca. 2019, and there >> were still many people mixing Spark 2.x and 3.0. Also, many other systems >> were still using Java 7 which only supported Julian. As a result, Spark >> 3.0+ can even still write with the Julian calendar, at least if using the >> Spark-native parquet read and write path. >> >> 1) The parquet files written by Spark 3.0+ have metadata keys that >> contain a Spark version ("org.apache.spark.version") and whether the >> timestamps are in Julian a.k.a. Java 7 ("org.apache.spark.legacyDateTime"). >> There's also "org.apache.spark.legacyINT96" which is about whether INT96 >> timestamps have been written with Julian calendar in the date part. >> >> 2) Files that don't have a Spark version are interpreted as Julian or >> proleptic Gregorian depending on a config >> "spark.sql.parquet.datetimeRebaseModeInRead" / >> "spark.sql.parquet.int96RebaseModeInRead". (There are similar configs for >> ORC and avro.) This defaults to EXCEPTION, which means "if a date is >> different in the two calendars, fail the write and force the users to >> choose". If it's set to LEGACY, then Spark will actually "rebase" the dates >> at read time because Spark 3.0+ uses the Java 8 proleptic gregorian >> calendar internally. >> >> 3) Writing mode is controlled by configs >> "spark.sql.parquet.datetimeRebaseModeInWrite" and >> "spark.sql.parquet.int96RebaseModeInWrite". These were also until recently >> set to EXCEPTION (i.e., force the user to choose when a value is >> encountered where it matters). See >> https://issues.apache.org/jira/browse/SPARK-46440. >> >> I'm not sure if any of this matters for Iceberg though. It may matter if >> any Iceberg implementation writes using the Spark native parquet/orc/avro >> write path AND the user has configured it to use LEGACY dates. Or are there >> paths where Iceberg can convert from Parquet files? Then you might >> encounter these metadata flags. I'm not sure if it's worth complicating the >> spec by supporting this. :) >> >> On Thu, Sep 12, 2024 at 8:03 AM Micah Kornfield <emkornfi...@gmail.com> >> wrote: >> >>> At the moment, the specification is ambiguous on which calendar is used >>> for temporal conversion/writing [1]. Reading the java code it appears it is >>> using Java's OffsetDateTime which conforms to ISO8601 [2]. ISO8601 appears >>> to explicitly disallow the Julian calendar (but only says proleptic >>> gregorian can be used by mutual consent [3]). >>> >>> Therefore I'd propose: >>> 1. We make the ISO8601 + proleptic Gregorian + Gregorian calendars >>> explicit in the specification. >>> 2. Mention in an implementation note, that data migrated from other >>> systems or data written by older systems might follow the Julian calendar >>> (e.g. it looks like Spark only transitioned in 3.0 [4]). >>> * Does anybody know of metadata available for systems to make this >>> determination? >>> * Or a recommendation on how to handle these? >>> >>> Thoughts? >>> >>> Thanks, >>> Micah >>> >>> [1] This is esoteric but a few systems use 0001-01-01 as a sentinel >>> value for null so does have some wider applicability >>> [2] >>> https://docs.oracle.com/javase/8/docs/api/java/time/OffsetDateTime.html >>> [3] https://en.wikipedia.org/wiki/ISO_8601#Dates >>> [4] https://issues.apache.org/jira/browse/SPARK-26651 >>> >>>