Re: [VOTE] SPIP: Change Data Capture (CDC) Support

Yuming Wang Tue, 03 Mar 2026 22:27:15 -0800

+1, really looking forward to this feature.

On Wed, Mar 4, 2026 at 1:57 PM vaquar khan <[email protected]> wrote:


> Hi everyone,
>
> Sorry for the late response, I know the vote is actively underway, but
> reviewing the SPIP's Catalyst post-processing mechanics raised a few
> systemic design concerns we need to clarify to avoid severe performance
> regressions down the line.
>
> 1. Capability Pushdown: The proposal has Catalyst deriving pre/post-images
> from raw insert/delete pairs. Storage layers like Delta Lake already
> materialize these natively. If the Changelog interface lacks state
> pushdown, Catalyst will burn CPU and memory reconstructing what the storage
> layer already solved.
>
> 2. CoW I/O Bottlenecks: Mandating Catalyst to filter "carry-over" rows for
> CoW tables is highly problematic. Without strict connector-level row
> lineage, we will be dragging massive, unmodified Parquet files across the
> network, forcing Spark into heavy distributed joins just to discard
> unchanged data.
>
> 3. Audit Fidelity: The design explicitly targets computing "net changes."
> Collapsing intermediate states breaks enterprise audit and compliance
> workflows that require full transactional history. The SQL grammar needs an
> explicit ALL CHANGES execution path.
>
> I fully support unifying CDC  and this SIP is the right direction, but
> abstracting it at the cost of storage-native optimizations and audit
> fidelity is a dangerous trade-off. We need to clarify how physical planning
> will handle these bottlenecks before formally ratifying the proposal.
>
> Regards,
> Viquar Khan
>
> On Tue, 3 Mar 2026 at 20:09, Cheng Pan <[email protected]> wrote:
>
>> +1 (non-binding)
>>
>> Thanks,
>> Cheng Pan
>>
>>
>>
>> On Mar 4, 2026, at 09:59, John Zhuge <[email protected]> wrote:
>>
>> +1 (non-binding)
>>
>> Thanks for the contribution!
>>
>>
>> On Tue, Mar 3, 2026 at 5:50 PM Burak Yavuz <[email protected]> wrote:
>>
>>> +1!
>>>
>>> On Tue, Mar 3, 2026 at 5:48 PM Szehon Ho <[email protected]>
>>> wrote:
>>>
>>>> +1, look forward to it (non binding)
>>>>
>>>> Thanks
>>>> Szehon
>>>>
>>>> On Tue, Mar 3, 2026 at 5:37 PM Anton Okolnychyi <[email protected]>
>>>> wrote:
>>>>
>>>>> +1 (non-binding)
>>>>>
>>>>> On Tue, Mar 3, 2026 at 5:07 PM Mich Talebzadeh <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> +1
>>>>>>
>>>>>> Dr Mich Talebzadeh,
>>>>>> Data Scientist | Distributed Systems (Spark) | Financial Forensics &
>>>>>> Metadata Analytics | Transaction Reconstruction | Audit & Evidence-Based
>>>>>> Analytics
>>>>>>
>>>>>>    view my Linkedin profile
>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, 4 Mar 2026 at 00:57, Gengliang Wang <[email protected]> wrote:
>>>>>>
>>>>>>> Hi Spark devs,
>>>>>>>
>>>>>>> I'd like to call a vote on the SPIP*: Change Data Capture (CDC)
>>>>>>> Support*
>>>>>>>
>>>>>>> *Summary:*
>>>>>>>
>>>>>>> This SPIP proposes a unified approach by adding a CHANGES SQL
>>>>>>> clause and corresponding DataFrame/DataStream APIs that work across DSv2
>>>>>>> connectors.
>>>>>>>
>>>>>>> 1. Standardized User API
>>>>>>> SQL:
>>>>>>>
>>>>>>> -- Batch: What changed between version 10 and 20?
>>>>>>> SELECT * FROM my_table CHANGES FROM VERSION 10 TO VERSION 20;
>>>>>>>
>>>>>>> -- Streaming: Continuously process changes
>>>>>>> CREATE STREAMING TABLE cdc_sink AS
>>>>>>> SELECT * FROM STREAM my_table CHANGES FROM VERSION 0;
>>>>>>>
>>>>>>> DataFrame API:
>>>>>>> spark.read
>>>>>>>   .option("startingVersion", "10")
>>>>>>>   .option("endingVersion", "20")
>>>>>>>   .changes("my_table")
>>>>>>>
>>>>>>> 2. Engine-Level Post Processing Under the hood, this proposal
>>>>>>> introduces a minimal Changelog interface for DSv2 connectors.
>>>>>>> Spark's Catalyst optimizer will take over the CDC post-processing,
>>>>>>> including:
>>>>>>>
>>>>>>>    -
>>>>>>>
>>>>>>>    Filtering out copy-on-write carry-over rows.
>>>>>>>    - Deriving pre-image/post-image updates from raw insert/delete
>>>>>>>    pairs.
>>>>>>>    -
>>>>>>>
>>>>>>>    Computing net changes.
>>>>>>>
>>>>>>>
>>>>>>> *Relevant Links:*
>>>>>>>
>>>>>>>    - *SPIP Doc: *
>>>>>>>    
>>>>>>> https://docs.google.com/document/d/1-4rCS3vsGIyhwnkAwPsEaqyUDg-AuVkdrYLotFPw0U0/edit?usp=sharing
>>>>>>>    - *Discuss Thread: *
>>>>>>>    https://lists.apache.org/thread/dhxx6pohs7fvqc3knzhtoj4tbcgrwxts
>>>>>>>    - *JIRA: *https://issues.apache.org/jira/browse/SPARK-55668
>>>>>>>
>>>>>>>
>>>>>>> *The vote will be open for at least 72 hours. *Please vote:
>>>>>>>
>>>>>>> [ ] +1: Accept the proposal as an official SPIP
>>>>>>>
>>>>>>> [ ] +0
>>>>>>>
>>>>>>> [ ] -1: I don't think this is a good idea because ...
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Gengliang Wang
>>>>>>>
>>>>>>
>>
>> --
>> John Zhuge
>>
>>
>>

Re: [VOTE] SPIP: Change Data Capture (CDC) Support

Reply via email to