Xiaoxuan Li created SPARK-56159:
-----------------------------------

             Summary: Support Nano Second Timestamp Data Types
                 Key: SPARK-56159
                 URL: https://issues.apache.org/jira/browse/SPARK-56159
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 4.2.0
            Reporter: Xiaoxuan Li


Add two new nanosecond-precision timestamp types to Spark SQL: 
{{TimestampNSType}} (with local timezone) and {{TimestampNTZNSType}} (without 
timezone), following the same pattern as {{TimestampNTZType}} (SPARK-35662).

Both types store epoch nanoseconds as a single {{Long}} (INT64). This directly 
matches Parquet {{{}TIMESTAMP(NANOS){}}}, Arrow {{{}Timestamp(NANOSECOND){}}}, 
Iceberg V3 {{{}timestamp_ns{}}}, DuckDB {{{}TIMESTAMP_NS{}}}, and ClickHouse 
{{DateTime64(9)}} -- enabling zero-conversion-overhead read/write. The INT64 
representation fits in UnsafeRow's 8-byte fixed-width slot, requiring no 
changes to Tungsten memory layout or CodeGen.

The representable range is ~1677-2262 (same as the Parquet INT64 nanos 
specification). Users needing wider range need to use existing microsecond 
types (0001-9999).

We also support {{TIMESTAMP(p)}} parameterized SQL syntax: p=0..6 maps to 
existing microsecond types, p=7..9 maps to the new nanosecond types.

 
h3. *Milestone 1 -- Core type system (TimestampNSType / TimestampNTZNSType 
meets or exceeds all function of the existing TimestampType / 
TimestampNTZType):*
 * Add new DataType implementations for TimestampNSType and TimestampNTZNSType
 * Support {{TIMESTAMP(p)}} parameterized SQL syntax (p=0..6 -> micros, p=7..9 
-> nanos)
 * TimestampNSType / TimestampNTZNSType literals
 * TimestampNSType arithmetic (e.g. TimestampNSType - TimestampNSType, 
TimestampNSType - Date)
 * Datetime functions/operators: dayofweek, weekofyear, year, etc
 * Cast to and from TimestampNSType / TimestampNTZNSType (Long, String, 
TimestampType, TimestampNTZType, DateType), with the SQL syntax to specify the 
types
 * Support sorting and hashing TimestampNSType / TimestampNTZNSType
 * Type coercion rules for mixed-precision expressions
 * {{timestamp_nanos()}} and {{unix_nanos()}} functions

h3. *Milestone 2 -- Persistence:*
 * Ability to create tables of type TimestampNSType / TimestampNTZNSType
 * Ability to read/write Parquet files with {{TIMESTAMP(NANOS, true/false)}} 
columns
 * Ability to read/write ORC, CSV, JSON, Avro files with nanosecond timestamps
 * Arrow type mapping: {{Timestamp(NANOSECOND)}} <-> TimestampNSType / 
TimestampNTZNSType
 * INSERT, SELECT, UPDATE, MERGE
 * Configuration: {{spark.sql.parquet.inferTimestampNS.enabled}}
 * Iceberg V3 {{TIMESTAMP_NANO}} type support
 * Precision enforcement for {{TIMESTAMP(p)}} using CHAR/VARCHAR metadata 
pattern
 * Discovery (schema inference)

h3. *Milestone 3 -- Client support*
 * JDBC support
 * Hive Thrift server
 * Spark Connect protocol support

h3. *Milestone 4 -- PySpark integration*
 * Python UDF can take and return TimestampNSType / TimestampNTZNSType
 * PySpark {{toPandas()}} / {{createDataFrame()}} nanosecond support
 * Pandas UDF support
 * DataFrame support
 * Dataset/Encoder support


 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to