Hi Viquar, Thanks for the detailed review — all three concerns are already accounted for in the current SPIP design (Appendix B.2 and B.6).1. Capability Pushdown: The Changelog interface already exposes declarative capability methods — containsCarryoverRows(), containsIntermediateChanges(), and representsUpdateAsDeleteAndInsert(). The ResolveChangelogTable rule only injects post-processing when the connector declares it is needed. If Delta Lake already materializes pre/post-images natively, it returns representsUpdateAsDeleteAndInsert() = false and Spark skips that work entirely. Catalyst never reconstructs what the storage layer already provides.2. CoW I/O Bottlenecks: Carry-over removal is already gated on containsCarryoverRows() = true. If a connector eliminates carry-over rows at the scan level, it returns false and Spark does nothing. The connector also retains full control over scan planning via its ScanBuilder, so I/O optimization stays in the storage layer.3. Audit Fidelity: The deduplicationMode option already supports none, dropCarryovers, and netChanges. Setting deduplicationMode = 'none' returns the raw, unmodified change stream with every intermediate state preserved. Net change collapsing happens when explicitly requested by the user.
On Tue, Mar 3, 2026 at 10:27 PM Yuming Wang <[email protected]> wrote: > +1, really looking forward to this feature. > > On Wed, Mar 4, 2026 at 1:57 PM vaquar khan <[email protected]> wrote: > >> Hi everyone, >> >> Sorry for the late response, I know the vote is actively underway, but >> reviewing the SPIP's Catalyst post-processing mechanics raised a few >> systemic design concerns we need to clarify to avoid severe performance >> regressions down the line. >> >> 1. Capability Pushdown: The proposal has Catalyst deriving >> pre/post-images from raw insert/delete pairs. Storage layers like Delta >> Lake already materialize these natively. If the Changelog interface lacks >> state pushdown, Catalyst will burn CPU and memory reconstructing what the >> storage layer already solved. >> >> 2. CoW I/O Bottlenecks: Mandating Catalyst to filter "carry-over" rows >> for CoW tables is highly problematic. Without strict connector-level row >> lineage, we will be dragging massive, unmodified Parquet files across the >> network, forcing Spark into heavy distributed joins just to discard >> unchanged data. >> >> 3. Audit Fidelity: The design explicitly targets computing "net changes." >> Collapsing intermediate states breaks enterprise audit and compliance >> workflows that require full transactional history. The SQL grammar needs an >> explicit ALL CHANGES execution path. >> >> I fully support unifying CDC and this SIP is the right direction, but >> abstracting it at the cost of storage-native optimizations and audit >> fidelity is a dangerous trade-off. We need to clarify how physical planning >> will handle these bottlenecks before formally ratifying the proposal. >> >> Regards, >> Viquar Khan >> >> On Tue, 3 Mar 2026 at 20:09, Cheng Pan <[email protected]> wrote: >> >>> +1 (non-binding) >>> >>> Thanks, >>> Cheng Pan >>> >>> >>> >>> On Mar 4, 2026, at 09:59, John Zhuge <[email protected]> wrote: >>> >>> +1 (non-binding) >>> >>> Thanks for the contribution! >>> >>> >>> On Tue, Mar 3, 2026 at 5:50 PM Burak Yavuz <[email protected]> wrote: >>> >>>> +1! >>>> >>>> On Tue, Mar 3, 2026 at 5:48 PM Szehon Ho <[email protected]> >>>> wrote: >>>> >>>>> +1, look forward to it (non binding) >>>>> >>>>> Thanks >>>>> Szehon >>>>> >>>>> On Tue, Mar 3, 2026 at 5:37 PM Anton Okolnychyi <[email protected]> >>>>> wrote: >>>>> >>>>>> +1 (non-binding) >>>>>> >>>>>> On Tue, Mar 3, 2026 at 5:07 PM Mich Talebzadeh < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> +1 >>>>>>> >>>>>>> Dr Mich Talebzadeh, >>>>>>> Data Scientist | Distributed Systems (Spark) | Financial Forensics & >>>>>>> Metadata Analytics | Transaction Reconstruction | Audit & Evidence-Based >>>>>>> Analytics >>>>>>> >>>>>>> view my Linkedin profile >>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Wed, 4 Mar 2026 at 00:57, Gengliang Wang <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi Spark devs, >>>>>>>> >>>>>>>> I'd like to call a vote on the SPIP*: Change Data Capture (CDC) >>>>>>>> Support* >>>>>>>> >>>>>>>> *Summary:* >>>>>>>> >>>>>>>> This SPIP proposes a unified approach by adding a CHANGES SQL >>>>>>>> clause and corresponding DataFrame/DataStream APIs that work across >>>>>>>> DSv2 >>>>>>>> connectors. >>>>>>>> >>>>>>>> 1. Standardized User API >>>>>>>> SQL: >>>>>>>> >>>>>>>> -- Batch: What changed between version 10 and 20? >>>>>>>> SELECT * FROM my_table CHANGES FROM VERSION 10 TO VERSION 20; >>>>>>>> >>>>>>>> -- Streaming: Continuously process changes >>>>>>>> CREATE STREAMING TABLE cdc_sink AS >>>>>>>> SELECT * FROM STREAM my_table CHANGES FROM VERSION 0; >>>>>>>> >>>>>>>> DataFrame API: >>>>>>>> spark.read >>>>>>>> .option("startingVersion", "10") >>>>>>>> .option("endingVersion", "20") >>>>>>>> .changes("my_table") >>>>>>>> >>>>>>>> 2. Engine-Level Post Processing Under the hood, this proposal >>>>>>>> introduces a minimal Changelog interface for DSv2 connectors. >>>>>>>> Spark's Catalyst optimizer will take over the CDC post-processing, >>>>>>>> including: >>>>>>>> >>>>>>>> - >>>>>>>> >>>>>>>> Filtering out copy-on-write carry-over rows. >>>>>>>> - Deriving pre-image/post-image updates from raw insert/delete >>>>>>>> pairs. >>>>>>>> - >>>>>>>> >>>>>>>> Computing net changes. >>>>>>>> >>>>>>>> >>>>>>>> *Relevant Links:* >>>>>>>> >>>>>>>> - *SPIP Doc: * >>>>>>>> >>>>>>>> https://docs.google.com/document/d/1-4rCS3vsGIyhwnkAwPsEaqyUDg-AuVkdrYLotFPw0U0/edit?usp=sharing >>>>>>>> - *Discuss Thread: * >>>>>>>> https://lists.apache.org/thread/dhxx6pohs7fvqc3knzhtoj4tbcgrwxts >>>>>>>> - *JIRA: *https://issues.apache.org/jira/browse/SPARK-55668 >>>>>>>> >>>>>>>> >>>>>>>> *The vote will be open for at least 72 hours. *Please vote: >>>>>>>> >>>>>>>> [ ] +1: Accept the proposal as an official SPIP >>>>>>>> >>>>>>>> [ ] +0 >>>>>>>> >>>>>>>> [ ] -1: I don't think this is a good idea because ... >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Gengliang Wang >>>>>>>> >>>>>>> >>> >>> -- >>> John Zhuge >>> >>> >>>
