[ 
https://issues.apache.org/jira/browse/SPARK-57102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-57102:
-----------------------------------
    Labels: pull-request-available  (was: )

> Read Parquet TIMESTAMP(NANOS) via non-vectorized reader for NTZ and LTZ 
> nanosecond types
> ----------------------------------------------------------------------------------------
>
>                 Key: SPARK-57102
>                 URL: https://issues.apache.org/jira/browse/SPARK-57102
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>    Affects Versions: 4.3.0
>            Reporter: Max Gekk
>            Priority: Major
>              Labels: pull-request-available
>
> h3. Summary
> Enable reading Parquet files that store timestamps as INT64 with logical type 
> TIMESTAMP(NANOS), produced by external tools (e.g. PyArrow or pandas), into 
> Spark's nanosecond timestamp types TimestampNTZNanosType and 
> TimestampLTZNanosType. Implementation is limited to the non-vectorized 
> (row-based) read path (ParquetRowConverter). Reuse existing Parquet datetime 
> rebasing for TIMESTAMP_LTZ; TIMESTAMP_NTZ does not rebase.
> Today, TIMESTAMP(NANOS) is either rejected (PARQUET_TYPE_ILLEGAL) or mapped 
> to LongType when spark.sql.legacy.parquet.nanosAsLong is true (SPARK-40819). 
> This issue delivers native nanos type read for real-world interop files.
> h3. Background
> * Parent: SPARK-56822 (SPIP: Timestamps with nanosecond precision)
> * Physical row layer: SPARK-56981 (TimestampNanosVal, InternalRow / UnsafeRow 
> accessors)
> * Logical types: SPARK-56876 (TimestampNTZNanosType / TimestampLTZNanosType, 
> p in [7, 9])
> * On-wire Parquet: epoch nanoseconds as INT64; Spark internal value is 
> (epochMicros, nanosWithinMicro) in TimestampNanosVal
> * Existing test resource: test-data/timestamp-nanos.parquet (TIMESTAMP(NANOS, 
> true) only; nanosAsLong path)
> h3. What to do
> h4. 1. External Parquet test fixtures
> * Add committed Parquet file(s) under sql/core/src/test/resources (alongside 
> or extending existing timestamp-nanos.parquet).
> * Generate with an external tool (PyArrow recommended), not Spark 
> df.write.parquet.
> * Include at least:
> ** ts_ltz — INT64 TIMESTAMP(NANOS, isAdjustedToUTC=true) -> 
> TimestampLTZNanosType(9)
> ** ts_ntz — INT64 TIMESTAMP(NANOS, isAdjustedToUTC=false) -> 
> TimestampNTZNanosType(9)
> * Row values should cover: sub-micro fractional part (non-zero 
> nanosWithinMicro), negative epoch-nanos, and at least one LTZ instant that 
> differs under LEGACY vs CORRECTED datetime rebase (same class of dates as 
> existing Parquet microsecond rebase tests).
> * Set Parquet file metadata keys Spark already uses for datetime rebase (e.g. 
> spark.sql.parquet.datetimeRebaseMode) so RebaseSpec is exercised.
> * Provide a small Python regeneration script (documented header; checked-in 
> files are the source of truth for CI).
> h4. 2. Epoch-nanos conversion helpers
> * Package-private helpers, e.g. epochNanosToTimestampNanosVal(epochNanos: 
> Long): TimestampNanosVal and inverse for test oracles.
> * Use Math.floorDiv / floorMod for negative timestamps; nanosWithinMicro in 
> [0, 999].
> * Unit tests without Parquet I/O.
> h4. 3. Schema mapping (ParquetSchemaConverter)
> When spark.sql.legacy.parquet.nanosAsLong is false (default):
> || Parquet logical type || Spark type (schema inference) ||
> | TIMESTAMP(NANOS, isAdjustedToUTC=true) | TimestampLTZNanosType (default 
> precision 9) |
> | TIMESTAMP(NANOS, isAdjustedToUTC=false) | TimestampNTZNanosType (default 
> precision 9) |
> * Keep nanosAsLong=true -> LongType behavior (SPARK-40819).
> * Apply preview / SQLConf gating from SPARK-56969 if required for user-facing 
> analysis; tests may enable the conf explicitly.
> * Update Parquet schema inference tests accordingly.
> h4. 4. Non-vectorized read (ParquetRowConverter)
> * Add ParquetPrimitiveConverter branches for TimeUnit.NANOS only on the row 
> converter path (not ParquetVectorUpdaterFactory / vectorized reader).
> * TimestampNTZNanosType: addLong(epochNanos) -> convert to TimestampNanosVal 
> -> updater; no timestampRebaseFunc (same policy as TimestampNTZType + MICROS).
> * TimestampLTZNanosType: addLong(epochNanos) -> decompose to epochMicros + 
> nanosWithinMicro -> apply existing timestampRebaseFunc from 
> ParquetRowConverter (DataSourceUtils.createTimestampRebaseFuncInRead / 
> datetimeRebaseSpec) on epochMicros -> reassemble TimestampNanosVal. Do not 
> add a separate rebase implementation.
> * Wire updaters for nanos types in nested converters as needed.
> h4. 5. Integration tests
> * Force non-vectorized read: spark.sql.parquet.enableVectorizedReader=false 
> (and legacy.parquet.nanosAsLong=false).
> * Read LTZ and NTZ columns from fixtures; assert TimestampNanosVal matches 
> precomputed oracle.
> * Rebase: same file with datetimeRebaseMode LEGACY vs CORRECTED for LTZ 
> column; behavior aligned with microsecond LTZ Parquet rebase tests.
> * SPARK-40819: with nanosAsLong=false, read succeeds and returns nanos types; 
> with nanosAsLong=true, schema remains LongType.
> * Values readable via getTimestampNTZNanos / getTimestampLTZNanos on 
> collected rows.
> h3. Acceptance criteria
> * spark.read.parquet on committed external fixtures returns 
> TimestampNTZNanosType and TimestampLTZNanosType columns when nanosAsLong is 
> false.
> * Non-vectorized path populates TimestampNanosVal with correct (epochMicros, 
> nanosWithinMicro).
> * LTZ columns use existing Parquet datetime rebase spec; NTZ columns do not 
> rebase.
> * Vectorized reader disabled in tests passes; no requirement to support 
> vectorized reader in this issue.
> * spark.sql.legacy.parquet.nanosAsLong=true unchanged (LongType).
> * Microsecond TimestampType / TimestampNTZType Parquet behavior unchanged.
> h3. Out of scope
> * Parquet vectorized reader (ParquetVectorUpdaterFactory, 
> VectorizedParquetRecordReader) — follow-up after columnar ColumnVector 
> support for nanos types
> * Parquet write of TIMESTAMP(NANOS) native types
> * Cast matrix, string parsing, Dataset java.time encoders (SPARK-57032, 
> SPARK-57033)
> * INT96-as-timestamp nanos carrier (focus on TIMESTAMP(NANOS) INT64)
> * Changing UnsafeRow 16-byte payload layout
> h3. Depends on
> * SPARK-56981 (physical row storage and TimestampNanosVal)
> h3. Related
> * SPARK-56969 (preview SQLConf gating, if analysis must be enabled for ad-hoc 
> reads)
> * SPARK-40819 (existing timestamp-nanos.parquet and nanosAsLong behavior)
> h3. References
> * ParquetSchemaConverter — TIMESTAMP(NANOS) handling today
> * ParquetRowConverter — timestampRebaseFunc, TimestampNTZType / TimestampType 
> MICROS precedents
> * org.apache.spark.unsafe.types.TimestampNanosVal



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to