Hi everyone, Sorry for the late response, I know the vote is actively underway, but reviewing the SPIP's Catalyst post-processing mechanics raised a few systemic design concerns we need to clarify to avoid severe performance regressions down the line.
1. Capability Pushdown: The proposal has Catalyst deriving pre/post-images from raw insert/delete pairs. Storage layers like Delta Lake already materialize these natively. If the Changelog interface lacks state pushdown, Catalyst will burn CPU and memory reconstructing what the storage layer already solved. 2. CoW I/O Bottlenecks: Mandating Catalyst to filter "carry-over" rows for CoW tables is highly problematic. Without strict connector-level row lineage, we will be dragging massive, unmodified Parquet files across the network, forcing Spark into heavy distributed joins just to discard unchanged data. 3. Audit Fidelity: The design explicitly targets computing "net changes." Collapsing intermediate states breaks enterprise audit and compliance workflows that require full transactional history. The SQL grammar needs an explicit ALL CHANGES execution path. I fully support unifying CDC and this SIP is the right direction, but abstracting it at the cost of storage-native optimizations and audit fidelity is a dangerous trade-off. We need to clarify how physical planning will handle these bottlenecks before formally ratifying the proposal. Regards, Viquar Khan On Tue, 3 Mar 2026 at 20:09, Cheng Pan <[email protected]> wrote: > +1 (non-binding) > > Thanks, > Cheng Pan > > > > On Mar 4, 2026, at 09:59, John Zhuge <[email protected]> wrote: > > +1 (non-binding) > > Thanks for the contribution! > > > On Tue, Mar 3, 2026 at 5:50 PM Burak Yavuz <[email protected]> wrote: > >> +1! >> >> On Tue, Mar 3, 2026 at 5:48 PM Szehon Ho <[email protected]> wrote: >> >>> +1, look forward to it (non binding) >>> >>> Thanks >>> Szehon >>> >>> On Tue, Mar 3, 2026 at 5:37 PM Anton Okolnychyi <[email protected]> >>> wrote: >>> >>>> +1 (non-binding) >>>> >>>> On Tue, Mar 3, 2026 at 5:07 PM Mich Talebzadeh < >>>> [email protected]> wrote: >>>> >>>>> +1 >>>>> >>>>> Dr Mich Talebzadeh, >>>>> Data Scientist | Distributed Systems (Spark) | Financial Forensics & >>>>> Metadata Analytics | Transaction Reconstruction | Audit & Evidence-Based >>>>> Analytics >>>>> >>>>> view my Linkedin profile >>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On Wed, 4 Mar 2026 at 00:57, Gengliang Wang <[email protected]> wrote: >>>>> >>>>>> Hi Spark devs, >>>>>> >>>>>> I'd like to call a vote on the SPIP*: Change Data Capture (CDC) >>>>>> Support* >>>>>> >>>>>> *Summary:* >>>>>> >>>>>> This SPIP proposes a unified approach by adding a CHANGES SQL clause >>>>>> and corresponding DataFrame/DataStream APIs that work across DSv2 >>>>>> connectors. >>>>>> >>>>>> 1. Standardized User API >>>>>> SQL: >>>>>> >>>>>> -- Batch: What changed between version 10 and 20? >>>>>> SELECT * FROM my_table CHANGES FROM VERSION 10 TO VERSION 20; >>>>>> >>>>>> -- Streaming: Continuously process changes >>>>>> CREATE STREAMING TABLE cdc_sink AS >>>>>> SELECT * FROM STREAM my_table CHANGES FROM VERSION 0; >>>>>> >>>>>> DataFrame API: >>>>>> spark.read >>>>>> .option("startingVersion", "10") >>>>>> .option("endingVersion", "20") >>>>>> .changes("my_table") >>>>>> >>>>>> 2. Engine-Level Post Processing Under the hood, this proposal >>>>>> introduces a minimal Changelog interface for DSv2 connectors. >>>>>> Spark's Catalyst optimizer will take over the CDC post-processing, >>>>>> including: >>>>>> >>>>>> - >>>>>> >>>>>> Filtering out copy-on-write carry-over rows. >>>>>> - Deriving pre-image/post-image updates from raw insert/delete >>>>>> pairs. >>>>>> - >>>>>> >>>>>> Computing net changes. >>>>>> >>>>>> >>>>>> *Relevant Links:* >>>>>> >>>>>> - *SPIP Doc: * >>>>>> >>>>>> https://docs.google.com/document/d/1-4rCS3vsGIyhwnkAwPsEaqyUDg-AuVkdrYLotFPw0U0/edit?usp=sharing >>>>>> - *Discuss Thread: * >>>>>> https://lists.apache.org/thread/dhxx6pohs7fvqc3knzhtoj4tbcgrwxts >>>>>> - *JIRA: *https://issues.apache.org/jira/browse/SPARK-55668 >>>>>> >>>>>> >>>>>> *The vote will be open for at least 72 hours. *Please vote: >>>>>> >>>>>> [ ] +1: Accept the proposal as an official SPIP >>>>>> >>>>>> [ ] +0 >>>>>> >>>>>> [ ] -1: I don't think this is a good idea because ... >>>>>> >>>>>> Thanks, >>>>>> Gengliang Wang >>>>>> >>>>> > > -- > John Zhuge > > >
