+1 On Wed, Mar 4, 2026 at 12:47 PM vaquar khan <[email protected]> wrote:
> Thanks for updating the SPIP (B.2 and A.1). > > +1 > > Regards, > > Viquar Khan > > > On Wed, 4 Mar 2026 at 14:03, Gengliang Wang <[email protected]> wrote: > >> Sure, I've updated the SPIP doc with the semantic guarantee note in >> Appendix B.2 and expanded the deduplicationMode descriptions in A.1 to >> clarify all three modes. >> Regarding the warning for incomplete change logs — that is >> connector-specific behavior, so it's best left to each connector's >> implementation rather than prescribed at the Spark level. >> >> On Wed, Mar 4, 2026 at 11:06 AM vaquar khan <[email protected]> >> wrote: >> >>> Thanks Gengliang for the continued engagement, and thank you Anton for >>> the important clarification on containsCarryoverRows(). >>> >>> 1. Capability Naming — Documentation Clarification Ask >>> I respect the DSv2 naming consistency. My ask is narrower: please add a >>> note in Appendix B.2 explicitly stating that returning false carries the >>> semantic guarantee that pre/post-images are fully materialized by the >>> connector-not lazily computed at scan time. No new method needed, just >>> documentation clarity to prevent incorrect connector implementations. >>> >>> 2. CoW I/O — Fully Withdrawn >>> Anton's clarification changes my position here entirely. If TableCatalog >>> loads Changelog with awareness of the specific range being scanned, and the >>> connector can inspect the actual commit history for that range to determine >>> whether CoW operations occurred, then containsCarryoverRows() is >>> effectively a range-scoped, commit-aware signal -not a coarse table-level >>> binary. That fully addresses my concern. I'm withdrawing Item 2 entirely, >>> not just as a blocker. >>> >>> 3. Audit Discoverability — Revised Ask >>> You're right that ALL CHANGES could mislead compliance engineers when >>> change logs are partially vacuumed. I withdraw that request. >>> My revised ask: >>> - Clearly state in the SQL documentation that deduplicationMode='none' >>> is the right way to get a full audit trail >>> - Show a warning if a user queries a table where some of the old change >>> logs have already been deleted >>> >>> With items 1 and 3 addressed in the SPIP text, count my +1. >>> >>> Regards, >>> Viquar Khan >>> >>> On Wed, 4 Mar 2026 at 12:35, Anton Okolnychyi <[email protected]> >>> wrote: >>> >>>> To add to Gengliang's point, TableCatalog would load Changelog >>>> knowing the range that is being scanned. This allows the connector to >>>> traverse the commit history and detect whether it had any CoW operation or >>>> not. In other words, it is not a blind flag at the table level. It is >>>> specific to the changelog range that is being requested. >>>> >>>> ср, 4 бер. 2026 р. о 09:17 Gengliang Wang <[email protected]> пише: >>>> >>>>> Thanks for the follow-up — appreciate the rigor. >>>>> >>>>> *1.* *Capability Naming*: The naming is intentional — >>>>> representsUpdateAsDeleteAndInsert() mirrors the existing >>>>> SupportsDelta.representUpdateAsDeleteAndInsert() >>>>> <https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/connector/write/SupportsDelta.java#L45> >>>>> in the DSv2 API. When it returns false, it means the connector's change >>>>> data already distinguishes updates from raw delete/insert pairs, so there >>>>> is nothing for Catalyst to derive. >>>>> >>>>> *2.* *Partition-Level CoW Hints*: A table-level flag is sufficient >>>>> for the common case. If a connector has partitions with mixed CoW behavior >>>>> and needs finer-grained control, it can simply return >>>>> containsCarryoverRows() = false and handle carry-over removal internally >>>>> within its ScanBuilder — the interface already supports this. There is no >>>>> need to complicate the Spark-level API for an edge case that connectors >>>>> can >>>>> solve themselves. >>>>> >>>>> *3. Audit Discoverability*: The SPIP proposes only two options in the >>>>> WITH clause (deduplicationMode and computeUpdates) — this is a small, >>>>> well-documented surface, not a hidden knob. Adding an ALL CHANGES grammar >>>>> modifier introduces its own discoverability problem: it implies the table >>>>> retains a complete history of all changes, which is not guaranteed — most >>>>> formats discard old change data after vacuum/expiration. A SQL keyword >>>>> that >>>>> suggests completeness but silently returns partial results is arguably >>>>> worse for compliance engineers than an explicit option with clear >>>>> documentation. >>>>> >>>>> >>>>> >>>>> On Tue, Mar 3, 2026 at 11:17 PM vaquar khan <[email protected]> >>>>> wrote: >>>>> >>>>>> Thanks Gengliang for the detailed follow-ans. While the mechanics >>>>>> you laid out make sense on paper, looking at how this will actually play >>>>>> out in production. >>>>>> >>>>>> 1. Capability Pushdown vs. Format Flag >>>>>> Returning representsUpdateAsDeleteAndInsert() = false just signals >>>>>> that the connector doesn't use raw delete/insert pairs. It doesn't >>>>>> explicitly tell Catalyst, "I already computed the pre/post images >>>>>> natively, >>>>>> trust my output and skip the window function entirely." Those are >>>>>> semantically different. A dedicated supportsNativePrePostImages() >>>>>> capability method would close this gap much more cleanly than overloading >>>>>> the format flag. >>>>>> >>>>>> 2. CoW I/O is a Table-Level Binary >>>>>> The ScanBuilder delegation is a fair point, but >>>>>> containsCarryoverRows() is still a table-level binary flag. For massive, >>>>>> partitioned CoW tables that have carry-overs in some partitions but not >>>>>> others, this interface forces Spark to apply carry-over removal globally >>>>>> or >>>>>> not at all. A partition-level or scan-level hint is a necessary >>>>>> improvement >>>>>> for mixed-mode CoW tables. >>>>>> >>>>>> 3. Audit Discoverability >>>>>> I agree deduplicationMode='none' is functionally correct, but my >>>>>> concern is discoverability. A compliance engineer or DBA writing SQL >>>>>> shouldn't need institutional knowledge of a hidden WITH clause option >>>>>> string to get audit-safe output. Having an explicit ALL CHANGES modifier >>>>>> in >>>>>> the grammar is crucial for enterprise adoption and auditing. >>>>>> >>>>>> I am highly supportive of the core architecture, but these are real >>>>>> production blockers for enterprise workloads. Let's get these clarified >>>>>> and >>>>>> updated in the SPIP document, Items 1 and 3 are production blockers I'd >>>>>> like addressed in the SPIP document. Item 2 is a real limitation but >>>>>> could >>>>>> reasonably be tracked as a follow-on improvement. Happy to cast my +1 >>>>>> once >>>>>> 1 and 3 are clarified. >>>>>> >>>>>> Regards, >>>>>> Viquar Khan >>>>>> >>>>>> On Wed, 4 Mar 2026 at 00:37, Gengliang Wang <[email protected]> wrote: >>>>>> >>>>>>> Hi Viquar, >>>>>>> >>>>>>> Thanks for the detailed review — all three concerns are already >>>>>>> accounted >>>>>>> for in the current SPIP design (Appendix B.2 and B.6).1. Capability >>>>>>> Pushdown: The Changelog interface already exposes declarative >>>>>>> capability methods — containsCarryoverRows(), containsIntermediate >>>>>>> Changes(), and representsUpdateAsDeleteAndInsert(). The >>>>>>> ResolveChangelogTable rule only injects post-processing when the >>>>>>> connector declares it is needed. If Delta Lake already materializes >>>>>>> pre/post-images natively, it returns representsUpdateAsDeleteAndInsert() >>>>>>> = false and Spark skips that work entirely. Catalyst never >>>>>>> reconstructs what the storage layer already provides.2. CoW I/O >>>>>>> Bottlenecks: Carry-over removal is already gated on >>>>>>> containsCarryoverRows() >>>>>>> = true. If a connector eliminates carry-over rows at the scan >>>>>>> level, it returns false and Spark does nothing. The connector also >>>>>>> retains full control over scan planning via its ScanBuilder, so I/O >>>>>>> optimization stays in the storage layer.3. Audit Fidelity: The >>>>>>> deduplicationMode option already supports none, dropCarryovers, and >>>>>>> netChanges. Setting deduplicationMode = 'none' returns the raw, >>>>>>> unmodified change stream with every intermediate state preserved. >>>>>>> Net change collapsing happens when explicitly requested by the user. >>>>>>> >>>>>>> On Tue, Mar 3, 2026 at 10:27 PM Yuming Wang <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> +1, really looking forward to this feature. >>>>>>>> >>>>>>>> On Wed, Mar 4, 2026 at 1:57 PM vaquar khan <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hi everyone, >>>>>>>>> >>>>>>>>> Sorry for the late response, I know the vote is actively underway, >>>>>>>>> but reviewing the SPIP's Catalyst post-processing mechanics raised a >>>>>>>>> few >>>>>>>>> systemic design concerns we need to clarify to avoid severe >>>>>>>>> performance >>>>>>>>> regressions down the line. >>>>>>>>> >>>>>>>>> 1. Capability Pushdown: The proposal has Catalyst deriving >>>>>>>>> pre/post-images from raw insert/delete pairs. Storage layers like >>>>>>>>> Delta >>>>>>>>> Lake already materialize these natively. If the Changelog interface >>>>>>>>> lacks >>>>>>>>> state pushdown, Catalyst will burn CPU and memory reconstructing what >>>>>>>>> the >>>>>>>>> storage layer already solved. >>>>>>>>> >>>>>>>>> 2. CoW I/O Bottlenecks: Mandating Catalyst to filter "carry-over" >>>>>>>>> rows for CoW tables is highly problematic. Without strict >>>>>>>>> connector-level >>>>>>>>> row lineage, we will be dragging massive, unmodified Parquet files >>>>>>>>> across >>>>>>>>> the network, forcing Spark into heavy distributed joins just to >>>>>>>>> discard >>>>>>>>> unchanged data. >>>>>>>>> >>>>>>>>> 3. Audit Fidelity: The design explicitly targets computing "net >>>>>>>>> changes." Collapsing intermediate states breaks enterprise audit and >>>>>>>>> compliance workflows that require full transactional history. The SQL >>>>>>>>> grammar needs an explicit ALL CHANGES execution path. >>>>>>>>> >>>>>>>>> I fully support unifying CDC and this SIP is the right direction, >>>>>>>>> but abstracting it at the cost of storage-native optimizations and >>>>>>>>> audit >>>>>>>>> fidelity is a dangerous trade-off. We need to clarify how physical >>>>>>>>> planning >>>>>>>>> will handle these bottlenecks before formally ratifying the proposal. >>>>>>>>> >>>>>>>>> Regards, >>>>>>>>> Viquar Khan >>>>>>>>> >>>>>>>>> On Tue, 3 Mar 2026 at 20:09, Cheng Pan <[email protected]> wrote: >>>>>>>>> >>>>>>>>>> +1 (non-binding) >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Cheng Pan >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Mar 4, 2026, at 09:59, John Zhuge <[email protected]> wrote: >>>>>>>>>> >>>>>>>>>> +1 (non-binding) >>>>>>>>>> >>>>>>>>>> Thanks for the contribution! >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Tue, Mar 3, 2026 at 5:50 PM Burak Yavuz <[email protected]> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> +1! >>>>>>>>>>> >>>>>>>>>>> On Tue, Mar 3, 2026 at 5:48 PM Szehon Ho < >>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>> >>>>>>>>>>>> +1, look forward to it (non binding) >>>>>>>>>>>> >>>>>>>>>>>> Thanks >>>>>>>>>>>> Szehon >>>>>>>>>>>> >>>>>>>>>>>> On Tue, Mar 3, 2026 at 5:37 PM Anton Okolnychyi < >>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> +1 (non-binding) >>>>>>>>>>>>> >>>>>>>>>>>>> On Tue, Mar 3, 2026 at 5:07 PM Mich Talebzadeh < >>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> +1 >>>>>>>>>>>>>> >>>>>>>>>>>>>> Dr Mich Talebzadeh, >>>>>>>>>>>>>> Data Scientist | Distributed Systems (Spark) | Financial >>>>>>>>>>>>>> Forensics & Metadata Analytics | Transaction Reconstruction | >>>>>>>>>>>>>> Audit & >>>>>>>>>>>>>> Evidence-Based Analytics >>>>>>>>>>>>>> >>>>>>>>>>>>>> view my Linkedin profile >>>>>>>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Wed, 4 Mar 2026 at 00:57, Gengliang Wang <[email protected]> >>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi Spark devs, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I'd like to call a vote on the SPIP*: Change Data Capture >>>>>>>>>>>>>>> (CDC) Support* >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> *Summary:* >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> This SPIP proposes a unified approach by adding a CHANGES SQL >>>>>>>>>>>>>>> clause and corresponding DataFrame/DataStream APIs that work >>>>>>>>>>>>>>> across DSv2 >>>>>>>>>>>>>>> connectors. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 1. Standardized User API >>>>>>>>>>>>>>> SQL: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> -- Batch: What changed between version 10 and 20? >>>>>>>>>>>>>>> SELECT * FROM my_table CHANGES FROM VERSION 10 TO VERSION 20 >>>>>>>>>>>>>>> ; >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> -- Streaming: Continuously process changes >>>>>>>>>>>>>>> CREATE STREAMING TABLE cdc_sink AS >>>>>>>>>>>>>>> SELECT * FROM STREAM my_table CHANGES FROM VERSION 0; >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> DataFrame API: >>>>>>>>>>>>>>> spark.read >>>>>>>>>>>>>>> .option("startingVersion", "10") >>>>>>>>>>>>>>> .option("endingVersion", "20") >>>>>>>>>>>>>>> .changes("my_table") >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 2. Engine-Level Post Processing Under the hood, this >>>>>>>>>>>>>>> proposal introduces a minimal Changelog interface for DSv2 >>>>>>>>>>>>>>> connectors. Spark's Catalyst optimizer will take over the CDC >>>>>>>>>>>>>>> post-processing, including: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> - >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Filtering out copy-on-write carry-over rows. >>>>>>>>>>>>>>> - Deriving pre-image/post-image updates from raw >>>>>>>>>>>>>>> insert/delete pairs. >>>>>>>>>>>>>>> - >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Computing net changes. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> *Relevant Links:* >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> - *SPIP Doc: * >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> https://docs.google.com/document/d/1-4rCS3vsGIyhwnkAwPsEaqyUDg-AuVkdrYLotFPw0U0/edit?usp=sharing >>>>>>>>>>>>>>> - *Discuss Thread: * >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> https://lists.apache.org/thread/dhxx6pohs7fvqc3knzhtoj4tbcgrwxts >>>>>>>>>>>>>>> - *JIRA: * >>>>>>>>>>>>>>> https://issues.apache.org/jira/browse/SPARK-55668 >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> *The vote will be open for at least 72 hours. *Please vote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> [ ] +1: Accept the proposal as an official SPIP >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> [ ] +0 >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> [ ] -1: I don't think this is a good idea because ... >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>> Gengliang Wang >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> John Zhuge >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>
