andygrove opened a new pull request, #4084: URL: https://github.com/apache/datafusion-comet/pull/4084
## Which issue does this PR close? Closes #2209. ## Rationale for this change Spark 4.0's `PushVariantIntoScan` optimizer rewrites a `VariantType` column into a `StructType` whose fields each carry `__VARIANT_METADATA_KEY` metadata, then pushes `variant_get` paths down as ordinary struct field accesses. By the time `CometScanRule` runs, the `requiredSchema` looks like a normal struct of primitives, so Comet scans natively but does not honor the on-disk Parquet variant shredding layout. The result is silent data corruption: typed paths read back as nulls (or, with some shapes, a hard `FAILED_READ_FILE`). This is data correctness, so we need to fall back to Spark for these reads rather than continuing to ignore the suites. ## What changes are included in this PR? - `CometTypeShim` (Spark 4.0): add `isVariantStruct` that delegates to Spark's `VariantMetadata.isVariantStruct`, which checks for the `__VARIANT_METADATA_KEY` marker on every field. - `CometTypeShim` (Spark 3.x): stub returning false; variant shredding does not exist pre-4.0. - `CometScanTypeChecker.isTypeSupported`: add a `case s: StructType if isVariantStruct(s) => false` arm with a fallback reason, so both `auto`/`native_datafusion` and `native_iceberg_compat` scan paths fall back to Spark on shredded Variant reads. - `dev/diffs/4.0.1.diff`: stop ignoring `VariantShreddingSuite` and `ParquetVariantShreddingSuite`. ## How are these changes tested? Ran `sql/testOnly org.apache.spark.sql.VariantShreddingSuite org.apache.spark.sql.execution.datasources.parquet.ParquetVariantShreddingSuite` against patched Spark v4.0.1 with `ENABLE_COMET=true ENABLE_COMET_ONHEAP=true`: - `COMET_PARQUET_SCAN_IMPL=auto`: 13/13 pass (was 5/13). - `COMET_PARQUET_SCAN_IMPL=native_iceberg_compat`: 13/13 pass (was 5/13). The Spark SQL test workflows already cover both scan impls on every PR, so the unignored suites give us ongoing protection against regressions. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
