[
https://issues.apache.org/jira/browse/SPARK-57102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated SPARK-57102:
-----------------------------------
Labels: pull-request-available (was: )
> Read Parquet TIMESTAMP(NANOS) via non-vectorized reader for NTZ and LTZ
> nanosecond types
> ----------------------------------------------------------------------------------------
>
> Key: SPARK-57102
> URL: https://issues.apache.org/jira/browse/SPARK-57102
> Project: Spark
> Issue Type: Sub-task
> Components: SQL
> Affects Versions: 4.3.0
> Reporter: Max Gekk
> Priority: Major
> Labels: pull-request-available
>
> h3. Summary
> Enable reading Parquet files that store timestamps as INT64 with logical type
> TIMESTAMP(NANOS), produced by external tools (e.g. PyArrow or pandas), into
> Spark's nanosecond timestamp types TimestampNTZNanosType and
> TimestampLTZNanosType. Implementation is limited to the non-vectorized
> (row-based) read path (ParquetRowConverter). Reuse existing Parquet datetime
> rebasing for TIMESTAMP_LTZ; TIMESTAMP_NTZ does not rebase.
> Today, TIMESTAMP(NANOS) is either rejected (PARQUET_TYPE_ILLEGAL) or mapped
> to LongType when spark.sql.legacy.parquet.nanosAsLong is true (SPARK-40819).
> This issue delivers native nanos type read for real-world interop files.
> h3. Background
> * Parent: SPARK-56822 (SPIP: Timestamps with nanosecond precision)
> * Physical row layer: SPARK-56981 (TimestampNanosVal, InternalRow / UnsafeRow
> accessors)
> * Logical types: SPARK-56876 (TimestampNTZNanosType / TimestampLTZNanosType,
> p in [7, 9])
> * On-wire Parquet: epoch nanoseconds as INT64; Spark internal value is
> (epochMicros, nanosWithinMicro) in TimestampNanosVal
> * Existing test resource: test-data/timestamp-nanos.parquet (TIMESTAMP(NANOS,
> true) only; nanosAsLong path)
> h3. What to do
> h4. 1. External Parquet test fixtures
> * Add committed Parquet file(s) under sql/core/src/test/resources (alongside
> or extending existing timestamp-nanos.parquet).
> * Generate with an external tool (PyArrow recommended), not Spark
> df.write.parquet.
> * Include at least:
> ** ts_ltz — INT64 TIMESTAMP(NANOS, isAdjustedToUTC=true) ->
> TimestampLTZNanosType(9)
> ** ts_ntz — INT64 TIMESTAMP(NANOS, isAdjustedToUTC=false) ->
> TimestampNTZNanosType(9)
> * Row values should cover: sub-micro fractional part (non-zero
> nanosWithinMicro), negative epoch-nanos, and at least one LTZ instant that
> differs under LEGACY vs CORRECTED datetime rebase (same class of dates as
> existing Parquet microsecond rebase tests).
> * Set Parquet file metadata keys Spark already uses for datetime rebase (e.g.
> spark.sql.parquet.datetimeRebaseMode) so RebaseSpec is exercised.
> * Provide a small Python regeneration script (documented header; checked-in
> files are the source of truth for CI).
> h4. 2. Epoch-nanos conversion helpers
> * Package-private helpers, e.g. epochNanosToTimestampNanosVal(epochNanos:
> Long): TimestampNanosVal and inverse for test oracles.
> * Use Math.floorDiv / floorMod for negative timestamps; nanosWithinMicro in
> [0, 999].
> * Unit tests without Parquet I/O.
> h4. 3. Schema mapping (ParquetSchemaConverter)
> When spark.sql.legacy.parquet.nanosAsLong is false (default):
> || Parquet logical type || Spark type (schema inference) ||
> | TIMESTAMP(NANOS, isAdjustedToUTC=true) | TimestampLTZNanosType (default
> precision 9) |
> | TIMESTAMP(NANOS, isAdjustedToUTC=false) | TimestampNTZNanosType (default
> precision 9) |
> * Keep nanosAsLong=true -> LongType behavior (SPARK-40819).
> * Apply preview / SQLConf gating from SPARK-56969 if required for user-facing
> analysis; tests may enable the conf explicitly.
> * Update Parquet schema inference tests accordingly.
> h4. 4. Non-vectorized read (ParquetRowConverter)
> * Add ParquetPrimitiveConverter branches for TimeUnit.NANOS only on the row
> converter path (not ParquetVectorUpdaterFactory / vectorized reader).
> * TimestampNTZNanosType: addLong(epochNanos) -> convert to TimestampNanosVal
> -> updater; no timestampRebaseFunc (same policy as TimestampNTZType + MICROS).
> * TimestampLTZNanosType: addLong(epochNanos) -> decompose to epochMicros +
> nanosWithinMicro -> apply existing timestampRebaseFunc from
> ParquetRowConverter (DataSourceUtils.createTimestampRebaseFuncInRead /
> datetimeRebaseSpec) on epochMicros -> reassemble TimestampNanosVal. Do not
> add a separate rebase implementation.
> * Wire updaters for nanos types in nested converters as needed.
> h4. 5. Integration tests
> * Force non-vectorized read: spark.sql.parquet.enableVectorizedReader=false
> (and legacy.parquet.nanosAsLong=false).
> * Read LTZ and NTZ columns from fixtures; assert TimestampNanosVal matches
> precomputed oracle.
> * Rebase: same file with datetimeRebaseMode LEGACY vs CORRECTED for LTZ
> column; behavior aligned with microsecond LTZ Parquet rebase tests.
> * SPARK-40819: with nanosAsLong=false, read succeeds and returns nanos types;
> with nanosAsLong=true, schema remains LongType.
> * Values readable via getTimestampNTZNanos / getTimestampLTZNanos on
> collected rows.
> h3. Acceptance criteria
> * spark.read.parquet on committed external fixtures returns
> TimestampNTZNanosType and TimestampLTZNanosType columns when nanosAsLong is
> false.
> * Non-vectorized path populates TimestampNanosVal with correct (epochMicros,
> nanosWithinMicro).
> * LTZ columns use existing Parquet datetime rebase spec; NTZ columns do not
> rebase.
> * Vectorized reader disabled in tests passes; no requirement to support
> vectorized reader in this issue.
> * spark.sql.legacy.parquet.nanosAsLong=true unchanged (LongType).
> * Microsecond TimestampType / TimestampNTZType Parquet behavior unchanged.
> h3. Out of scope
> * Parquet vectorized reader (ParquetVectorUpdaterFactory,
> VectorizedParquetRecordReader) — follow-up after columnar ColumnVector
> support for nanos types
> * Parquet write of TIMESTAMP(NANOS) native types
> * Cast matrix, string parsing, Dataset java.time encoders (SPARK-57032,
> SPARK-57033)
> * INT96-as-timestamp nanos carrier (focus on TIMESTAMP(NANOS) INT64)
> * Changing UnsafeRow 16-byte payload layout
> h3. Depends on
> * SPARK-56981 (physical row storage and TimestampNanosVal)
> h3. Related
> * SPARK-56969 (preview SQLConf gating, if analysis must be enabled for ad-hoc
> reads)
> * SPARK-40819 (existing timestamp-nanos.parquet and nanosAsLong behavior)
> h3. References
> * ParquetSchemaConverter — TIMESTAMP(NANOS) handling today
> * ParquetRowConverter — timestampRebaseFunc, TimestampNTZType / TimestampType
> MICROS precedents
> * org.apache.spark.unsafe.types.TimestampNanosVal
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]