Thanks all for the discussion! I agree that we first need to reach a consensus on adding the TIMESTAMP(nanosecond) data type to Apache Spark. It's a standard data type supported by major databases like Oracle and IBM DB2, making it a necessary inclusion in Spark to align with industry practices.
For the storage format, Spark supports the full ANSI SQL range from the year 0001 to 9999, which requires us to use a 10-byte format. Currently, Parquet/Iceberg timestamps cover only those after the Unix epoch, so 8 bytes suffice. While adopting a unified 10-byte format in Parquet/Iceberg is worth considering, it may not be essential at this moment. Instead, we can handle timestamps that fall outside the Parquet/Iceberg range by throwing an exception when they occur. This approach allows us to move forward without having to rely on external dependencies. Thanks, Huaxin On Fri, Mar 28, 2025 at 2:11 PM Szehon Ho <szehon.apa...@gmail.com> wrote: > Trying to catch up on this, Serge's suggestion in the doc seems the best > way forward, > https://docs.google.com/document/d/1wjFsBdlV2YK75x7UOk2HhDOqWVA0yC7iEiqOMnNnxlA/edit?disco=AAABe5AUnWU. > Spark would support the full ANSI SQL timestamp range, and Iceberg / > Parquet/ other data source will throw runtime error if it trying to write a > value outside its supported range, until we get a wider timestamp type in > Parquet (Iceberg's V3 timestamp_ns type is just built on top of that) > > Thanks, > Szehon > > On Thu, Mar 27, 2025 at 9:45 PM Micah Kornfield <emkornfi...@gmail.com> > wrote: > >> I think the key issue is the format. The proposed 10-byte format doesn't >>> seem like a standard and the one in Iceberg/Parquet does not support the >>> required range by ANSI SQL: year 0001 to year 9999. We should address this >>> issue first. Note that Parquet has an INT96 timestamp that supports >>> nanosecond precision, but it's deprecated. Shall we work with the Parquet >>> community to revive it? >> >> >> It would be great to discuss a plan for this in parquet. This has come >> up in passing in some of the recent parquet syncs. I don't think >> resurrecting int96 is necessarily a great idea since it is defined in terms >> of Julian days [1], and most systems these days are standardizing on >> proleptic-Gregorian. >> >> A fair number of OSS implementations that do interact with int96 I've >> seen do conversion assuming all timestamps are post Unix epoch timestamps >> and therefore have errors/idiosyncrasies when translating dates prior to >> the Gregorian cutover. >> >> Cheers, >> Micah >> >> [1] https://github.com/apache/parquet-format/pull/49 >> >> On Thu, Mar 27, 2025 at 7:02 PM Wenchen Fan <cloud0...@gmail.com> wrote: >> >>> Maybe we should discuss the key issues on the dev list as it's easy to >>> lose track of Google Doc comments. >>> >>> I think all the proposals for adding new data types need to prove that >>> the new data type is common/standard in the ecosystem. This means 3 things: >>> - it has common/standard semantic. TIMESTAMP with nanosecond precision >>> is definitely a standard data type, in both ANSI SQL and mainstream >>> databases. >>> - it has common/standard storage format. Parquet/Iceberg supports >>> nanosecond timestamp using int64, which is different from what is proposed >>> here. >>> - it has common/standard processing methods. The java datetime library >>> Spark is using now already support nanosecond, so we are fine here. >>> >>> I think the key issue is the format. The proposed 10-byte format doesn't >>> seem like a standard and the one in Iceberg/Parquet does not support the >>> required range by ANSI SQL: year 0001 to year 9999. We should address this >>> issue first. Note that Parquet has an INT96 timestamp that supports >>> nanosecond precision, but it's deprecated. Shall we work with the Parquet >>> community to revive it? >>> >>> On Fri, Mar 28, 2025 at 7:03 AM DB Tsai <dbt...@dbtsai.com> wrote: >>> >>>> Thanks!!! >>>> >>>> DB Tsai | https://www.dbtsai.com/ | PGP 42E5B25A8F7A82C1 >>>> >>>> On Mar 27, 2025, at 3:56 PM, Qi Tan <qi.tan.j...@gmail.com> wrote: >>>> >>>> Thanks DB, >>>> >>>> I just noticed a few more comments came in after I initiated the vote. >>>> I'm going to postpone the voting process and address those outstanding >>>> comments. >>>> >>>> Qi Tan >>>> >>>> DB Tsai <dbt...@dbtsai.com> 于2025年3月27日周四 15:12写道: >>>> >>>>> Hello Qi, >>>>> >>>>> I'm supportive of the NanoSecond Timestamps proposal; however, before >>>>> we initiate the vote, there are a few outstanding comments in the SPIP >>>>> document that haven't been addressed yet. Since the vote is on the >>>>> document >>>>> itself, could we resolve these items beforehand? >>>>> >>>>> For example: >>>>> >>>>> - >>>>> >>>>> The default precision of TimestampNsNTZType is set to 6, which >>>>> overlaps with the existing TimestampNTZ. >>>>> - >>>>> >>>>> The specified range exceeds the capacity of an int64, but the >>>>> document doesn't clarify how this type will be represented in memory or >>>>> serialized in data sources. >>>>> - >>>>> >>>>> Schema inference details for data sources are missing. >>>>> >>>>> These points still need discussion. >>>>> >>>>> I appreciate your efforts in putting the doc together and look forward >>>>> to your contribution! >>>>> >>>>> Thanks, >>>>> DB Tsai | https://www.dbtsai.com/ | PGP 42E5B25A8F7A82C1 >>>>> >>>>> On Mar 27, 2025, at 1:24 PM, huaxin gao <huaxin.ga...@gmail.com> >>>>> wrote: >>>>> >>>>> +1 >>>>> >>>>> On Thu, Mar 27, 2025 at 1:22 PM Qi Tan <qi.tan.j...@gmail.com> wrote: >>>>> >>>>>> Hi all, >>>>>> >>>>>> I would like to start a vote on adding support for nanoseconds >>>>>> timestamps. >>>>>> >>>>>> *Discussion thread: * >>>>>> https://lists.apache.org/thread/y2vzrjl1499j5dvbpg3m81jxdhf4b6of >>>>>> *SPIP:* >>>>>> https://docs.google.com/document/d/1wjFsBdlV2YK75x7UOk2HhDOqWVA0yC7iEiqOMnNnxlA/edit?usp=sharing >>>>>> *JIRA:* https://issues.apache.org/jira/browse/SPARK-50532 >>>>>> >>>>>> Please vote on the SPIP for the next 72 hours: >>>>>> >>>>>> [ ] +1: Accept the proposal as an official SPIP >>>>>> [ ] +0 >>>>>> [ ] -1: I don’t think this is a good idea because >>>>>> >>>>> >>>>> >>>>