Re: [DISCUSS] SPIP: Change Data Capture (CDC) Support

Gengliang Wang Mon, 02 Mar 2026 10:44:10 -0800

Hi Mich,
The proposal actually covers all three categories:

   - Time-based — CHANGES FROM TIMESTAMP '2026-01-01' TO TIMESTAMP '2026-
   02-01'



   - Token-based (SCN/LSN) — CHANGES FROM VERSION '1234567' TO VERSION
    '2345678' — VERSION accepts strings, so connectors can interpret them
   as SCN, LSN, Kafka offset, etc.


   - Format-version-based — CHANGES FROM VERSION 10 TO VERSION 20

This is a Data Source V2 connector interface. The syntax is
intentionally agnostic
— each connector interprets "version" and "timestamp" in its own domain.
Spark provides the unified syntax and post-processing; connectors provide
the domain-specific change data.

On Mon, Mar 2, 2026 at 8:44 AM Anton Okolnychyi <[email protected]>
wrote:

> Mich, the proposal already comes with built-in support for timestamp
> ranges and a generic meaning of version. In Delta, this is a log version
> (number). In Iceberg, this is going to be a snapshot ID (string). Each
> connector can treat the version differently.
>
> I had a chance to see this proposal before it landed and I think it will
> be a great addition to Spark. I like the approach with computing updates
> and deduplication using window functions but we will have to benchmark the
> performance. If it ends up slower than what external connectors like Delta
> and Iceberg do today, we will have to pivot. This will only be known once
> the implementation is done. That said, it has no impact on the proposed
> behavior and APIs.
>
> Great to see this.
>
> - Anton
>
> пн, 2 бер. 2026 р. о 03:54 Mich Talebzadeh <[email protected]>
> пише:
>
>> one has to clarify that this is not all inclusive CDC
>>
>> So a realistic unified interface for CDC it should end  as one of:
>>
>>
>>    1. time-based: “changes between T1 and T2”
>>    2. token-based: “changes between (SCN/LSN)” (Oracle)
>>    3. format-version-based: “changes between snapshot/version IDs”
>>    (Delta/Iceberg/Hudi)
>>
>> this solution seems to aim for 3 only
>>
>>
>>
>> Dr Mich Talebzadeh,
>> Data Scientist | Distributed Systems (Spark) | Financial Forensics &
>> Metadata Analytics | Transaction Reconstruction | Audit & Evidence-Based
>> Analytics
>>
>>    view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>
>>
>>
>> On Fri, 27 Feb 2026 at 23:34, Gengliang Wang <[email protected]> wrote:
>>
>>> @Holden Karau <[email protected]> Thanks for taking a look! I have
>>> actually synced with a few Delta Lake and Iceberg committers offline, and
>>> they were comfortable with the proposed SQL syntax and API. Because this
>>> introduces a new SQL syntax, it won't affect the functionality of the
>>> existing connectors.
>>>
>>> Many of the active Delta and Iceberg developers are also on this mailing
>>> list, so I'm hoping we can gather most of the initial feedback right here
>>> in this thread. However, if we need deeper connector-specific alignment as
>>> the discussion evolves, I'm definitely open to cross-posting it to their
>>> respective lists.
>>>
>>> On Fri, Feb 27, 2026 at 2:29 PM Holden Karau <[email protected]>
>>> wrote:
>>>
>>>> This looks cool overall, would it maybe make sense to share with the
>>>> delta lake devs & iceberg devs for their input too? I have not had a chance
>>>> to dig into this closely yet though.
>>>>
>>>> On Fri, Feb 27, 2026 at 1:39 PM Gengliang Wang <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi Spark devs,
>>>>>
>>>>> It looks like my original email might have landed in some spam
>>>>> folders, so I am just bumping this thread for visibility.
>>>>>
>>>>> For quick reference, here are the links to the proposal again:
>>>>>
>>>>>    -
>>>>>
>>>>>    *SPIP Document:*
>>>>>    
>>>>> https://docs.google.com/document/d/1-4rCS3vsGIyhwnkAwPsEaqyUDg-AuVkdrYLotFPw0U0/edit?usp=sharing
>>>>>    -
>>>>>
>>>>>    *JIRA:* https://issues.apache.org/jira/browse/SPARK-55668
>>>>>
>>>>> Looking forward to your thoughts and feedback!
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Gengliang
>>>>>
>>>>> On Fri, Feb 27, 2026 at 1:13 PM Szehon Ho <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> +1 （non binding)
>>>>>>
>>>>>> This is a great idea, look forward to a standard user experience for
>>>>>> CDC for DSV2 data source, and centralizing the complicated share logic.
>>>>>>
>>>>>> Also this is somehow shown in my Spam folder :) , hope this brings it
>>>>>> out.
>>>>>>
>>>>>> Thanks
>>>>>> Szehon
>>>>>>
>>>>>> On Tue, Feb 24, 2026 at 4:37 PM Gengliang Wang <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>> I'd like to open a discussion on a new SPIP to introduce Change Data
>>>>>>> Capture (CDC) support to Apache Spark, targeting the Spark 4.2 release.
>>>>>>>
>>>>>>>    -
>>>>>>>
>>>>>>>    SPIP Document: <https://docs.google.com/document/d/>
>>>>>>>    
>>>>>>> https://docs.google.com/document/d/1-4rCS3vsGIyhwnkAwPsEaqyUDg-AuVkdrYLotFPw0U0/edit?usp=sharing
>>>>>>>    -
>>>>>>>
>>>>>>>    JIRA:
>>>>>>>    
>>>>>>> <https://www.google.com/search?q=https://issues.apache.org/jira/browse/SPARK->
>>>>>>>    https://issues.apache.org/jira/browse/SPARK-55668
>>>>>>>
>>>>>>> Motivation
>>>>>>>
>>>>>>> Currently, querying row-level changes (inserts, updates, deletes)
>>>>>>> from a table requires connector-specific syntax. This fragmentation 
>>>>>>> breaks
>>>>>>> query portability across different storage formats and forces each
>>>>>>> connector to reinvent complex post-processing logic:
>>>>>>>
>>>>>>>    -
>>>>>>>
>>>>>>>    Delta Lake: Uses table_changes()
>>>>>>>    -
>>>>>>>
>>>>>>>    Iceberg: Uses .changes virtual tables
>>>>>>>    -
>>>>>>>
>>>>>>>    Hudi: Relies on custom incremental read options
>>>>>>>
>>>>>>> There is no universal, engine-level standard in Spark to ask "show
>>>>>>> me what changed."
>>>>>>> Proposal
>>>>>>>
>>>>>>> This SPIP proposes a unified approach by adding a CHANGES SQL
>>>>>>> clause and corresponding DataFrame/DataStream APIs that work across DSv2
>>>>>>> connectors.
>>>>>>>
>>>>>>> 1. Standardized User API
>>>>>>>
>>>>>>> SQL:
>>>>>>>
>>>>>>> -- Batch: What changed between version 10 and 20?
>>>>>>>
>>>>>>> SELECT * FROM my_table CHANGES FROM VERSION 10 TO VERSION 20;
>>>>>>>
>>>>>>> -- Streaming: Continuously process changes
>>>>>>>
>>>>>>> CREATE STREAMING TABLE cdc_sink AS
>>>>>>>
>>>>>>> SELECT * FROM STREAM my_table CHANGES FROM VERSION 0;
>>>>>>>
>>>>>>> DataFrame API:
>>>>>>>
>>>>>>> spark.read
>>>>>>>
>>>>>>>   .option("startingVersion", "10")
>>>>>>>
>>>>>>>   .option("endingVersion", "20")
>>>>>>>
>>>>>>>   .changes("my_table")
>>>>>>>
>>>>>>> 2. Engine-Level Post Processing Under the hood, this proposal
>>>>>>> introduces a minimal Changelog interface for DSv2 connectors.
>>>>>>> Spark's Catalyst optimizer will take over the CDC post-processing,
>>>>>>> including:
>>>>>>>
>>>>>>>    -
>>>>>>>
>>>>>>>    Filtering out copy-on-write carry-over rows.
>>>>>>>    -
>>>>>>>
>>>>>>>    Deriving pre-image/post-image updates from raw insert/delete
>>>>>>>    pairs.
>>>>>>>    -
>>>>>>>
>>>>>>>    Computing net changes.
>>>>>>>
>>>>>>> This pushes complexity into the engine where it belongs, reducing
>>>>>>> duplicated effort across the ecosystem and ensuring consistent semantics
>>>>>>> for users.
>>>>>>>
>>>>>>> Please review the full SPIP for comprehensive design details, the
>>>>>>> proposed connector API, and deduplication semantics.
>>>>>>>
>>>>>>> Feedback and discussion are highly appreciated!
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Gengliang
>>>>>>>
>>>>>>>
>>>>
>>>> --
>>>> Twitter: https://twitter.com/holdenkarau
>>>> Fight Health Insurance: https://www.fighthealthinsurance.com/
>>>> <https://www.fighthealthinsurance.com/?q=hk_email>
>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>> Pronouns: she/her
>>>>
>>>

Re: [DISCUSS] SPIP: Change Data Capture (CDC) Support

Reply via email to