Thanks for updating the SPIP (B.2 and A.1). +1
Regards, Viquar Khan On Wed, 4 Mar 2026 at 14:03, Gengliang Wang <[email protected]> wrote: > Sure, I've updated the SPIP doc with the semantic guarantee note in > Appendix B.2 and expanded the deduplicationMode descriptions in A.1 to > clarify all three modes. > Regarding the warning for incomplete change logs — that is > connector-specific behavior, so it's best left to each connector's > implementation rather than prescribed at the Spark level. > > On Wed, Mar 4, 2026 at 11:06 AM vaquar khan <[email protected]> wrote: > >> Thanks Gengliang for the continued engagement, and thank you Anton for >> the important clarification on containsCarryoverRows(). >> >> 1. Capability Naming — Documentation Clarification Ask >> I respect the DSv2 naming consistency. My ask is narrower: please add a >> note in Appendix B.2 explicitly stating that returning false carries the >> semantic guarantee that pre/post-images are fully materialized by the >> connector-not lazily computed at scan time. No new method needed, just >> documentation clarity to prevent incorrect connector implementations. >> >> 2. CoW I/O — Fully Withdrawn >> Anton's clarification changes my position here entirely. If TableCatalog >> loads Changelog with awareness of the specific range being scanned, and the >> connector can inspect the actual commit history for that range to determine >> whether CoW operations occurred, then containsCarryoverRows() is >> effectively a range-scoped, commit-aware signal -not a coarse table-level >> binary. That fully addresses my concern. I'm withdrawing Item 2 entirely, >> not just as a blocker. >> >> 3. Audit Discoverability — Revised Ask >> You're right that ALL CHANGES could mislead compliance engineers when >> change logs are partially vacuumed. I withdraw that request. >> My revised ask: >> - Clearly state in the SQL documentation that deduplicationMode='none' is >> the right way to get a full audit trail >> - Show a warning if a user queries a table where some of the old change >> logs have already been deleted >> >> With items 1 and 3 addressed in the SPIP text, count my +1. >> >> Regards, >> Viquar Khan >> >> On Wed, 4 Mar 2026 at 12:35, Anton Okolnychyi <[email protected]> >> wrote: >> >>> To add to Gengliang's point, TableCatalog would load Changelog >>> knowing the range that is being scanned. This allows the connector to >>> traverse the commit history and detect whether it had any CoW operation or >>> not. In other words, it is not a blind flag at the table level. It is >>> specific to the changelog range that is being requested. >>> >>> ср, 4 бер. 2026 р. о 09:17 Gengliang Wang <[email protected]> пише: >>> >>>> Thanks for the follow-up — appreciate the rigor. >>>> >>>> *1.* *Capability Naming*: The naming is intentional — >>>> representsUpdateAsDeleteAndInsert() mirrors the existing >>>> SupportsDelta.representUpdateAsDeleteAndInsert() >>>> <https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/connector/write/SupportsDelta.java#L45> >>>> in the DSv2 API. When it returns false, it means the connector's change >>>> data already distinguishes updates from raw delete/insert pairs, so there >>>> is nothing for Catalyst to derive. >>>> >>>> *2.* *Partition-Level CoW Hints*: A table-level flag is sufficient for >>>> the common case. If a connector has partitions with mixed CoW behavior and >>>> needs finer-grained control, it can simply return containsCarryoverRows() = >>>> false and handle carry-over removal internally within its ScanBuilder — the >>>> interface already supports this. There is no need to complicate the >>>> Spark-level API for an edge case that connectors can solve themselves. >>>> >>>> *3. Audit Discoverability*: The SPIP proposes only two options in the >>>> WITH clause (deduplicationMode and computeUpdates) — this is a small, >>>> well-documented surface, not a hidden knob. Adding an ALL CHANGES grammar >>>> modifier introduces its own discoverability problem: it implies the table >>>> retains a complete history of all changes, which is not guaranteed — most >>>> formats discard old change data after vacuum/expiration. A SQL keyword that >>>> suggests completeness but silently returns partial results is arguably >>>> worse for compliance engineers than an explicit option with clear >>>> documentation. >>>> >>>> >>>> >>>> On Tue, Mar 3, 2026 at 11:17 PM vaquar khan <[email protected]> >>>> wrote: >>>> >>>>> Thanks Gengliang for the detailed follow-ans. While the mechanics >>>>> you laid out make sense on paper, looking at how this will actually play >>>>> out in production. >>>>> >>>>> 1. Capability Pushdown vs. Format Flag >>>>> Returning representsUpdateAsDeleteAndInsert() = false just signals >>>>> that the connector doesn't use raw delete/insert pairs. It doesn't >>>>> explicitly tell Catalyst, "I already computed the pre/post images >>>>> natively, >>>>> trust my output and skip the window function entirely." Those are >>>>> semantically different. A dedicated supportsNativePrePostImages() >>>>> capability method would close this gap much more cleanly than overloading >>>>> the format flag. >>>>> >>>>> 2. CoW I/O is a Table-Level Binary >>>>> The ScanBuilder delegation is a fair point, but >>>>> containsCarryoverRows() is still a table-level binary flag. For massive, >>>>> partitioned CoW tables that have carry-overs in some partitions but not >>>>> others, this interface forces Spark to apply carry-over removal globally >>>>> or >>>>> not at all. A partition-level or scan-level hint is a necessary >>>>> improvement >>>>> for mixed-mode CoW tables. >>>>> >>>>> 3. Audit Discoverability >>>>> I agree deduplicationMode='none' is functionally correct, but my >>>>> concern is discoverability. A compliance engineer or DBA writing SQL >>>>> shouldn't need institutional knowledge of a hidden WITH clause option >>>>> string to get audit-safe output. Having an explicit ALL CHANGES modifier >>>>> in >>>>> the grammar is crucial for enterprise adoption and auditing. >>>>> >>>>> I am highly supportive of the core architecture, but these are real >>>>> production blockers for enterprise workloads. Let's get these clarified >>>>> and >>>>> updated in the SPIP document, Items 1 and 3 are production blockers I'd >>>>> like addressed in the SPIP document. Item 2 is a real limitation but could >>>>> reasonably be tracked as a follow-on improvement. Happy to cast my +1 once >>>>> 1 and 3 are clarified. >>>>> >>>>> Regards, >>>>> Viquar Khan >>>>> >>>>> On Wed, 4 Mar 2026 at 00:37, Gengliang Wang <[email protected]> wrote: >>>>> >>>>>> Hi Viquar, >>>>>> >>>>>> Thanks for the detailed review — all three concerns are already accounted >>>>>> for in the current SPIP design (Appendix B.2 and B.6).1. Capability >>>>>> Pushdown: The Changelog interface already exposes declarative >>>>>> capability methods — containsCarryoverRows(), containsIntermediate >>>>>> Changes(), and representsUpdateAsDeleteAndInsert(). The >>>>>> ResolveChangelogTable rule only injects post-processing when the >>>>>> connector declares it is needed. If Delta Lake already materializes >>>>>> pre/post-images natively, it returns representsUpdateAsDeleteAndInsert() >>>>>> = false and Spark skips that work entirely. Catalyst never >>>>>> reconstructs what the storage layer already provides.2. CoW I/O >>>>>> Bottlenecks: Carry-over removal is already gated on >>>>>> containsCarryoverRows() >>>>>> = true. If a connector eliminates carry-over rows at the scan level, >>>>>> it returns false and Spark does nothing. The connector also retains >>>>>> full control over scan planning via its ScanBuilder, so I/O >>>>>> optimization stays in the storage layer.3. Audit Fidelity: The >>>>>> deduplicationMode option already supports none, dropCarryovers, and >>>>>> netChanges. Setting deduplicationMode = 'none' returns the raw, >>>>>> unmodified change stream with every intermediate state preserved. >>>>>> Net change collapsing happens when explicitly requested by the user. >>>>>> >>>>>> On Tue, Mar 3, 2026 at 10:27 PM Yuming Wang <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> +1, really looking forward to this feature. >>>>>>> >>>>>>> On Wed, Mar 4, 2026 at 1:57 PM vaquar khan <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi everyone, >>>>>>>> >>>>>>>> Sorry for the late response, I know the vote is actively underway, >>>>>>>> but reviewing the SPIP's Catalyst post-processing mechanics raised a >>>>>>>> few >>>>>>>> systemic design concerns we need to clarify to avoid severe performance >>>>>>>> regressions down the line. >>>>>>>> >>>>>>>> 1. Capability Pushdown: The proposal has Catalyst deriving >>>>>>>> pre/post-images from raw insert/delete pairs. Storage layers like Delta >>>>>>>> Lake already materialize these natively. If the Changelog interface >>>>>>>> lacks >>>>>>>> state pushdown, Catalyst will burn CPU and memory reconstructing what >>>>>>>> the >>>>>>>> storage layer already solved. >>>>>>>> >>>>>>>> 2. CoW I/O Bottlenecks: Mandating Catalyst to filter "carry-over" >>>>>>>> rows for CoW tables is highly problematic. Without strict >>>>>>>> connector-level >>>>>>>> row lineage, we will be dragging massive, unmodified Parquet files >>>>>>>> across >>>>>>>> the network, forcing Spark into heavy distributed joins just to discard >>>>>>>> unchanged data. >>>>>>>> >>>>>>>> 3. Audit Fidelity: The design explicitly targets computing "net >>>>>>>> changes." Collapsing intermediate states breaks enterprise audit and >>>>>>>> compliance workflows that require full transactional history. The SQL >>>>>>>> grammar needs an explicit ALL CHANGES execution path. >>>>>>>> >>>>>>>> I fully support unifying CDC and this SIP is the right direction, >>>>>>>> but abstracting it at the cost of storage-native optimizations and >>>>>>>> audit >>>>>>>> fidelity is a dangerous trade-off. We need to clarify how physical >>>>>>>> planning >>>>>>>> will handle these bottlenecks before formally ratifying the proposal. >>>>>>>> >>>>>>>> Regards, >>>>>>>> Viquar Khan >>>>>>>> >>>>>>>> On Tue, 3 Mar 2026 at 20:09, Cheng Pan <[email protected]> wrote: >>>>>>>> >>>>>>>>> +1 (non-binding) >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Cheng Pan >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Mar 4, 2026, at 09:59, John Zhuge <[email protected]> wrote: >>>>>>>>> >>>>>>>>> +1 (non-binding) >>>>>>>>> >>>>>>>>> Thanks for the contribution! >>>>>>>>> >>>>>>>>> >>>>>>>>> On Tue, Mar 3, 2026 at 5:50 PM Burak Yavuz <[email protected]> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> +1! >>>>>>>>>> >>>>>>>>>> On Tue, Mar 3, 2026 at 5:48 PM Szehon Ho <[email protected]> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> +1, look forward to it (non binding) >>>>>>>>>>> >>>>>>>>>>> Thanks >>>>>>>>>>> Szehon >>>>>>>>>>> >>>>>>>>>>> On Tue, Mar 3, 2026 at 5:37 PM Anton Okolnychyi < >>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>> >>>>>>>>>>>> +1 (non-binding) >>>>>>>>>>>> >>>>>>>>>>>> On Tue, Mar 3, 2026 at 5:07 PM Mich Talebzadeh < >>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> +1 >>>>>>>>>>>>> >>>>>>>>>>>>> Dr Mich Talebzadeh, >>>>>>>>>>>>> Data Scientist | Distributed Systems (Spark) | Financial >>>>>>>>>>>>> Forensics & Metadata Analytics | Transaction Reconstruction | >>>>>>>>>>>>> Audit & >>>>>>>>>>>>> Evidence-Based Analytics >>>>>>>>>>>>> >>>>>>>>>>>>> view my Linkedin profile >>>>>>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Wed, 4 Mar 2026 at 00:57, Gengliang Wang <[email protected]> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hi Spark devs, >>>>>>>>>>>>>> >>>>>>>>>>>>>> I'd like to call a vote on the SPIP*: Change Data Capture >>>>>>>>>>>>>> (CDC) Support* >>>>>>>>>>>>>> >>>>>>>>>>>>>> *Summary:* >>>>>>>>>>>>>> >>>>>>>>>>>>>> This SPIP proposes a unified approach by adding a CHANGES SQL >>>>>>>>>>>>>> clause and corresponding DataFrame/DataStream APIs that work >>>>>>>>>>>>>> across DSv2 >>>>>>>>>>>>>> connectors. >>>>>>>>>>>>>> >>>>>>>>>>>>>> 1. Standardized User API >>>>>>>>>>>>>> SQL: >>>>>>>>>>>>>> >>>>>>>>>>>>>> -- Batch: What changed between version 10 and 20? >>>>>>>>>>>>>> SELECT * FROM my_table CHANGES FROM VERSION 10 TO VERSION 20; >>>>>>>>>>>>>> >>>>>>>>>>>>>> -- Streaming: Continuously process changes >>>>>>>>>>>>>> CREATE STREAMING TABLE cdc_sink AS >>>>>>>>>>>>>> SELECT * FROM STREAM my_table CHANGES FROM VERSION 0; >>>>>>>>>>>>>> >>>>>>>>>>>>>> DataFrame API: >>>>>>>>>>>>>> spark.read >>>>>>>>>>>>>> .option("startingVersion", "10") >>>>>>>>>>>>>> .option("endingVersion", "20") >>>>>>>>>>>>>> .changes("my_table") >>>>>>>>>>>>>> >>>>>>>>>>>>>> 2. Engine-Level Post Processing Under the hood, this >>>>>>>>>>>>>> proposal introduces a minimal Changelog interface for DSv2 >>>>>>>>>>>>>> connectors. Spark's Catalyst optimizer will take over the CDC >>>>>>>>>>>>>> post-processing, including: >>>>>>>>>>>>>> >>>>>>>>>>>>>> - >>>>>>>>>>>>>> >>>>>>>>>>>>>> Filtering out copy-on-write carry-over rows. >>>>>>>>>>>>>> - Deriving pre-image/post-image updates from raw >>>>>>>>>>>>>> insert/delete pairs. >>>>>>>>>>>>>> - >>>>>>>>>>>>>> >>>>>>>>>>>>>> Computing net changes. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> *Relevant Links:* >>>>>>>>>>>>>> >>>>>>>>>>>>>> - *SPIP Doc: * >>>>>>>>>>>>>> >>>>>>>>>>>>>> https://docs.google.com/document/d/1-4rCS3vsGIyhwnkAwPsEaqyUDg-AuVkdrYLotFPw0U0/edit?usp=sharing >>>>>>>>>>>>>> - *Discuss Thread: * >>>>>>>>>>>>>> >>>>>>>>>>>>>> https://lists.apache.org/thread/dhxx6pohs7fvqc3knzhtoj4tbcgrwxts >>>>>>>>>>>>>> - *JIRA: * >>>>>>>>>>>>>> https://issues.apache.org/jira/browse/SPARK-55668 >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> *The vote will be open for at least 72 hours. *Please vote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> [ ] +1: Accept the proposal as an official SPIP >>>>>>>>>>>>>> >>>>>>>>>>>>>> [ ] +0 >>>>>>>>>>>>>> >>>>>>>>>>>>>> [ ] -1: I don't think this is a good idea because ... >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>> Gengliang Wang >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> John Zhuge >>>>>>>>> >>>>>>>>> >>>>>>>>>
