andygrove opened a new issue, #4218:
URL: https://github.com/apache/datafusion-comet/issues/4218
## Description
When a Parquet file stores timestamps as INT96 (Spark's `TimestampType` with
UTC-adjusted local-time semantics) and the read schema requests `TimestampNTZ`,
the `native_datafusion` scan silently returns wall-clock values that disagree
with what was written.
Spark itself raises an error in this scenario (SPARK-36182) to prevent
silent reinterpretation of an LTZ instant as NTZ. Comet's native scan should
either match Spark's behavior by raising an error, or correctly handle the
timezone conversion.
## Steps to Reproduce
```scala
val sessionTz = "America/Los_Angeles"
val written = "2020-01-01 12:00:00"
withSQLConf(
SQLConf.SESSION_LOCAL_TIMEZONE.key -> sessionTz,
SQLConf.PARQUET_OUTPUT_TIMESTAMP_TYPE.key -> "INT96",
SQLConf.USE_V1_SOURCE_LIST.key -> "parquet") {
withTempPath { dir =>
val path = dir.getCanonicalPath
// Write "2020-01-01 12:00:00" America/Los_Angeles as INT96.
// The bits encode the UTC instant 2020-01-01 20:00:00.
Seq(Timestamp.valueOf(written)).toDF("ts").write.parquet(path)
// Spark refuses to read INT96 as TimestampNTZ (SPARK-36182)
withSQLConf(CometConf.COMET_ENABLED.key -> "false") {
intercept[SparkException] {
spark.read.schema("ts timestamp_ntz").parquet(path).collect()
}
}
// native_datafusion silently returns a shifted value
withSQLConf(CometConf.COMET_NATIVE_SCAN_IMPL.key ->
CometConf.SCAN_NATIVE_DATAFUSION) {
val rows = spark.read.schema("ts
timestamp_ntz").parquet(path).collect()
val actual = rows.head.getAs[LocalDateTime](0)
// actual != LocalDateTime.parse("2020-01-01T12:00:00")
// The value is silently wrong — shifted by the timezone offset
}
}
}
```
## Expected Behavior
Comet should match Spark's behavior and raise an error when asked to read
INT96 timestamps as TimestampNTZ, since the LTZ→NTZ reinterpretation cannot be
done safely without explicit conversion.
## Actual Behavior
The native DataFusion scan returns a result without error, but the timestamp
value is silently incorrect (shifted by the session timezone offset).
## Related
- SPARK-36182
- https://github.com/apache/datafusion-comet/issues/3720
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]