schenksj opened a new pull request, #4366:
URL: https://github.com/apache/datafusion-comet/pull/4366

   ## Summary
   
   Native Delta Lake scan integration for Comet, structured as an Iceberg-style
   contrib (typed `OpStruct::DeltaScan` proto + a small set of feature-gated 
core
   touchpoints + a separate `contrib/delta/` tree). Replaces the SPI / registry
   design from #3932 per the feedback on that PR.
   
   **This PR supersedes #3932**, which is being closed in favor of this one.
   
   ## What's in here
   
   - **`contrib/delta/`** — full Delta integration (Scala + Rust + dev scripts).
     Scan rule, native scan exec, plan-data injector, kernel-rs engine cache,
     partition-value parsing, DV filtering, column-mapping, row-tracking, DPP
     partition pruning, multi-task packing, `input_file_name()` support.
   - **Typed proto variant** `delta_scan = 117` on `OpStruct` (no envelope op).
   - **~40 lines of core touchpoints**, all feature-gated:
     - `DeltaIntegration` reflection bridge in `org.apache.comet.rules`
     - One arm in `CometScanRule.transformV1Scan`
     - One arm in `CometExecRule.transform` for the Delta scan marker
     - One trait method (`opStructCase`) on `PlanDataInjector`
     - Per-partition file-path plumbing through `CometExecRDD` →
       `CometExecIterator` so `wrapNativeParquetError` and
       `InputFileBlockHolder` get the right path (used by Delta's
       UPDATE/DELETE/MERGE `input_file_name()` flows)
   - **Maven `-Pcontrib-delta` profile** in `spark/pom.xml` + a parallel Cargo
     `contrib-delta` feature in the native crates
   - **Standalone regression harness** at `contrib/delta/dev/run-regression.sh`
     runs the full Delta 4.1 Spark test suite against this branch
   
   ## Notable fixes that landed during validation
   
   - `de9e0d3c` + `ed0d8acb` — `FAILED_READ_FILE.NO_HINT` wrapping for native 
parquet errors, with file path threaded from scan partitions (fixes 
`SnapshotManagementSuite` checkpoint-broken tests)
   - `effe5f76` — filesystem scheme allowlist on V1 scans (fixes `fake://` test 
fallback)
   - `56c2b011` — decline `CreateArray` when children have mismatched data 
types (fixes CDF `replaceWhere` panics; upstream issue filed as 
[apache/datafusion#22366](https://github.com/apache/datafusion/issues/22366))
   - `90969575` — engine cache by `(scheme, authority, config)` to bound 
OS-thread churn under high scan rate
   - `43768c1c` — review-fix bundle: missing `InputFileBlockHolder` hook, DV 
filter ordering safeguards, tighter `is_not_found` matching, multi-line parquet 
error regex
   
   ## Perf-sweep items addressed (vs. initial direct-port)
   
   - `7e9249f6` — cache resolved `Method` handle in `DeltaIntegration`
   - `fea28d7e` — `Map[OpStructCase, PlanDataInjector]` O(1) injector lookup
   - `fea28d7e` — pre-parsed `SessionTimezone` (one parse per scan, not per row)
   - `a805f813` — hoist `CometScanTypeChecker` out of per-scan loop
   - `e3467761` — O(1) partition-value lookup in `build_delta_partitioned_files`
   
   ## Test plan
   
   - [x] Targeted retest of all surfaced failure clusters passes against 
current branch:
     - `DescribeDeltaHistorySuite "replaceWhere on data column"` — 8/8 pass
     - `DeltaTableHadoopOptionsSuite "dropFeatureSupport - with filesystem 
options"` — 1/1 pass
     - `SnapshotManagementSuite "should not recover when the current checkpoint 
is broken..."` — 2/2 pass
   - [ ] Full Delta 4.1 regression in progress (relaunched post review-fix 
bundle)
   - [x] Default builds (no `-Pcontrib-delta`) still build green: `mvn -pl 
spark -am test-compile`
   - [x] `-Pcontrib-delta` builds green
   - [x] Engine-cache fix verified to bound OS thread count (previously hitting 
`pthread_create EAGAIN`)
   
   🤖 Generated with [Claude Code](https://claude.com/claude-code)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to