Trying to catch up on this, Serge's suggestion in the doc seems the best way forward, https://docs.google.com/document/d/1wjFsBdlV2YK75x7UOk2HhDOqWVA0yC7iEiqOMnNnxlA/edit?disco=AAABe5AUnWU. Spark would support the full ANSI SQL timestamp range, and Iceberg / Parquet/ other data source will throw runtime error if it trying to write a value outside its supported range, until we get a wider timestamp type in Parquet (Iceberg's V3 timestamp_ns type is just built on top of that)
Thanks, Szehon On Thu, Mar 27, 2025 at 9:45 PM Micah Kornfield <emkornfi...@gmail.com> wrote: > I think the key issue is the format. The proposed 10-byte format doesn't >> seem like a standard and the one in Iceberg/Parquet does not support the >> required range by ANSI SQL: year 0001 to year 9999. We should address this >> issue first. Note that Parquet has an INT96 timestamp that supports >> nanosecond precision, but it's deprecated. Shall we work with the Parquet >> community to revive it? > > > It would be great to discuss a plan for this in parquet. This has come up > in passing in some of the recent parquet syncs. I don't think resurrecting > int96 is necessarily a great idea since it is defined in terms of Julian > days [1], and most systems these days are standardizing on > proleptic-Gregorian. > > A fair number of OSS implementations that do interact with int96 I've seen > do conversion assuming all timestamps are post Unix epoch timestamps and > therefore have errors/idiosyncrasies when translating dates prior to the > Gregorian cutover. > > Cheers, > Micah > > [1] https://github.com/apache/parquet-format/pull/49 > > On Thu, Mar 27, 2025 at 7:02 PM Wenchen Fan <cloud0...@gmail.com> wrote: > >> Maybe we should discuss the key issues on the dev list as it's easy to >> lose track of Google Doc comments. >> >> I think all the proposals for adding new data types need to prove that >> the new data type is common/standard in the ecosystem. This means 3 things: >> - it has common/standard semantic. TIMESTAMP with nanosecond precision is >> definitely a standard data type, in both ANSI SQL and mainstream databases. >> - it has common/standard storage format. Parquet/Iceberg supports >> nanosecond timestamp using int64, which is different from what is proposed >> here. >> - it has common/standard processing methods. The java datetime library >> Spark is using now already support nanosecond, so we are fine here. >> >> I think the key issue is the format. The proposed 10-byte format doesn't >> seem like a standard and the one in Iceberg/Parquet does not support the >> required range by ANSI SQL: year 0001 to year 9999. We should address this >> issue first. Note that Parquet has an INT96 timestamp that supports >> nanosecond precision, but it's deprecated. Shall we work with the Parquet >> community to revive it? >> >> On Fri, Mar 28, 2025 at 7:03 AM DB Tsai <dbt...@dbtsai.com> wrote: >> >>> Thanks!!! >>> >>> DB Tsai | https://www.dbtsai.com/ | PGP 42E5B25A8F7A82C1 >>> >>> On Mar 27, 2025, at 3:56 PM, Qi Tan <qi.tan.j...@gmail.com> wrote: >>> >>> Thanks DB, >>> >>> I just noticed a few more comments came in after I initiated the vote. >>> I'm going to postpone the voting process and address those outstanding >>> comments. >>> >>> Qi Tan >>> >>> DB Tsai <dbt...@dbtsai.com> 于2025年3月27日周四 15:12写道: >>> >>>> Hello Qi, >>>> >>>> I'm supportive of the NanoSecond Timestamps proposal; however, before >>>> we initiate the vote, there are a few outstanding comments in the SPIP >>>> document that haven't been addressed yet. Since the vote is on the document >>>> itself, could we resolve these items beforehand? >>>> >>>> For example: >>>> >>>> - >>>> >>>> The default precision of TimestampNsNTZType is set to 6, which >>>> overlaps with the existing TimestampNTZ. >>>> - >>>> >>>> The specified range exceeds the capacity of an int64, but the >>>> document doesn't clarify how this type will be represented in memory or >>>> serialized in data sources. >>>> - >>>> >>>> Schema inference details for data sources are missing. >>>> >>>> These points still need discussion. >>>> >>>> I appreciate your efforts in putting the doc together and look forward >>>> to your contribution! >>>> >>>> Thanks, >>>> DB Tsai | https://www.dbtsai.com/ | PGP 42E5B25A8F7A82C1 >>>> >>>> On Mar 27, 2025, at 1:24 PM, huaxin gao <huaxin.ga...@gmail.com> wrote: >>>> >>>> +1 >>>> >>>> On Thu, Mar 27, 2025 at 1:22 PM Qi Tan <qi.tan.j...@gmail.com> wrote: >>>> >>>>> Hi all, >>>>> >>>>> I would like to start a vote on adding support for nanoseconds >>>>> timestamps. >>>>> >>>>> *Discussion thread: * >>>>> https://lists.apache.org/thread/y2vzrjl1499j5dvbpg3m81jxdhf4b6of >>>>> *SPIP:* >>>>> https://docs.google.com/document/d/1wjFsBdlV2YK75x7UOk2HhDOqWVA0yC7iEiqOMnNnxlA/edit?usp=sharing >>>>> *JIRA:* https://issues.apache.org/jira/browse/SPARK-50532 >>>>> >>>>> Please vote on the SPIP for the next 72 hours: >>>>> >>>>> [ ] +1: Accept the proposal as an official SPIP >>>>> [ ] +0 >>>>> [ ] -1: I don’t think this is a good idea because >>>>> >>>> >>>> >>>