Re: [DISCUSS] SPIP: Change Data Capture (CDC) Support

Mich Talebzadeh Mon, 02 Mar 2026 10:47:39 -0800

Hi,

Thanks for your comments


I agree that allowing connectors to define the meaning of “version” makes
the API extensible. My original concern was less about syntax and more
about semantic portability. In Delta and Iceberg, version identifiers map
naturally to snapshot-based storage. In traditional RDBMS systems like
Oracle SCN, change tracking is token- or log-based and may not align
cleanly with snapshot semantics.

Let us take the example given:

SELECT * FROM table CHANGES FROM VERSION 5 TO VERSION 10;

Syntactically, that works everywhere.

But semantically in Oracle (if mapped) → it might mean rows whose SCN falls
between two SCNs.

So while the API is generic at the surface, the CDC guarantees (ordering,
completeness, retention, idempotency) will "ultimately remain
connector"-specific.

It may be helpful to clarify in the proposal that this is a
"connector-delegated CDC interface" rather than a storage-agnostic CDC. In
that sense, the abstraction offers a unified CDC interface, but not
necessarily semantic portability across heterogeneous storage engines.This
distinction is subtle but important: unifying the surface contract does not
automatically unify the underlying change model.

HTH

Dr Mich Talebzadeh,
Data Scientist | Distributed Systems (Spark) | Financial Forensics &
Metadata Analytics | Transaction Reconstruction | Audit & Evidence-Based
Analytics

   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>





On Mon, 2 Mar 2026 at 16:44, Anton Okolnychyi <[email protected]> wrote:

> Mich, the proposal already comes with built-in support for timestamp
> ranges and a generic meaning of version. In Delta, this is a log version
> (number). In Iceberg, this is going to be a snapshot ID (string). Each
> connector can treat the version differently.
>
> I had a chance to see this proposal before it landed and I think it will
> be a great addition to Spark. I like the approach with computing updates
> and deduplication using window functions but we will have to benchmark the
> performance. If it ends up slower than what external connectors like Delta
> and Iceberg do today, we will have to pivot. This will only be known once
> the implementation is done. That said, it has no impact on the proposed
> behavior and APIs.
>
> Great to see this.
>
> - Anton
>
> пн, 2 бер. 2026 р. о 03:54 Mich Talebzadeh <[email protected]>
> пише:
>
>> one has to clarify that this is not all inclusive CDC
>>
>> So a realistic unified interface for CDC it should end  as one of:
>>
>>
>>    1. time-based: “changes between T1 and T2”
>>    2. token-based: “changes between (SCN/LSN)” (Oracle)
>>    3. format-version-based: “changes between snapshot/version IDs”
>>    (Delta/Iceberg/Hudi)
>>
>> this solution seems to aim for 3 only
>>
>>
>>
>> Dr Mich Talebzadeh,
>> Data Scientist | Distributed Systems (Spark) | Financial Forensics &
>> Metadata Analytics | Transaction Reconstruction | Audit & Evidence-Based
>> Analytics
>>
>>    view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>
>>
>>
>> On Fri, 27 Feb 2026 at 23:34, Gengliang Wang <[email protected]> wrote:
>>
>>> @Holden Karau <[email protected]> Thanks for taking a look! I have
>>> actually synced with a few Delta Lake and Iceberg committers offline, and
>>> they were comfortable with the proposed SQL syntax and API. Because this
>>> introduces a new SQL syntax, it won't affect the functionality of the
>>> existing connectors.
>>>
>>> Many of the active Delta and Iceberg developers are also on this mailing
>>> list, so I'm hoping we can gather most of the initial feedback right here
>>> in this thread. However, if we need deeper connector-specific alignment as
>>> the discussion evolves, I'm definitely open to cross-posting it to their
>>> respective lists.
>>>
>>> On Fri, Feb 27, 2026 at 2:29 PM Holden Karau <[email protected]>
>>> wrote:
>>>
>>>> This looks cool overall, would it maybe make sense to share with the
>>>> delta lake devs & iceberg devs for their input too? I have not had a chance
>>>> to dig into this closely yet though.
>>>>
>>>> On Fri, Feb 27, 2026 at 1:39 PM Gengliang Wang <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi Spark devs,
>>>>>
>>>>> It looks like my original email might have landed in some spam
>>>>> folders, so I am just bumping this thread for visibility.
>>>>>
>>>>> For quick reference, here are the links to the proposal again:
>>>>>
>>>>>    -
>>>>>
>>>>>    *SPIP Document:*
>>>>>    
>>>>> https://docs.google.com/document/d/1-4rCS3vsGIyhwnkAwPsEaqyUDg-AuVkdrYLotFPw0U0/edit?usp=sharing
>>>>>    -
>>>>>
>>>>>    *JIRA:* https://issues.apache.org/jira/browse/SPARK-55668
>>>>>
>>>>> Looking forward to your thoughts and feedback!
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Gengliang
>>>>>
>>>>> On Fri, Feb 27, 2026 at 1:13 PM Szehon Ho <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> +1 （non binding)
>>>>>>
>>>>>> This is a great idea, look forward to a standard user experience for
>>>>>> CDC for DSV2 data source, and centralizing the complicated share logic.
>>>>>>
>>>>>> Also this is somehow shown in my Spam folder :) , hope this brings it
>>>>>> out.
>>>>>>
>>>>>> Thanks
>>>>>> Szehon
>>>>>>
>>>>>> On Tue, Feb 24, 2026 at 4:37 PM Gengliang Wang <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>> I'd like to open a discussion on a new SPIP to introduce Change Data
>>>>>>> Capture (CDC) support to Apache Spark, targeting the Spark 4.2 release.
>>>>>>>
>>>>>>>    -
>>>>>>>
>>>>>>>    SPIP Document: <https://docs.google.com/document/d/>
>>>>>>>    
>>>>>>> https://docs.google.com/document/d/1-4rCS3vsGIyhwnkAwPsEaqyUDg-AuVkdrYLotFPw0U0/edit?usp=sharing
>>>>>>>    -
>>>>>>>
>>>>>>>    JIRA:
>>>>>>>    
>>>>>>> <https://www.google.com/search?q=https://issues.apache.org/jira/browse/SPARK->
>>>>>>>    https://issues.apache.org/jira/browse/SPARK-55668
>>>>>>>
>>>>>>> Motivation
>>>>>>>
>>>>>>> Currently, querying row-level changes (inserts, updates, deletes)
>>>>>>> from a table requires connector-specific syntax. This fragmentation 
>>>>>>> breaks
>>>>>>> query portability across different storage formats and forces each
>>>>>>> connector to reinvent complex post-processing logic:
>>>>>>>
>>>>>>>    -
>>>>>>>
>>>>>>>    Delta Lake: Uses table_changes()
>>>>>>>    -
>>>>>>>
>>>>>>>    Iceberg: Uses .changes virtual tables
>>>>>>>    -
>>>>>>>
>>>>>>>    Hudi: Relies on custom incremental read options
>>>>>>>
>>>>>>> There is no universal, engine-level standard in Spark to ask "show
>>>>>>> me what changed."
>>>>>>> Proposal
>>>>>>>
>>>>>>> This SPIP proposes a unified approach by adding a CHANGES SQL
>>>>>>> clause and corresponding DataFrame/DataStream APIs that work across DSv2
>>>>>>> connectors.
>>>>>>>
>>>>>>> 1. Standardized User API
>>>>>>>
>>>>>>> SQL:
>>>>>>>
>>>>>>> -- Batch: What changed between version 10 and 20?
>>>>>>>
>>>>>>> SELECT * FROM my_table CHANGES FROM VERSION 10 TO VERSION 20;
>>>>>>>
>>>>>>> -- Streaming: Continuously process changes
>>>>>>>
>>>>>>> CREATE STREAMING TABLE cdc_sink AS
>>>>>>>
>>>>>>> SELECT * FROM STREAM my_table CHANGES FROM VERSION 0;
>>>>>>>
>>>>>>> DataFrame API:
>>>>>>>
>>>>>>> spark.read
>>>>>>>
>>>>>>>   .option("startingVersion", "10")
>>>>>>>
>>>>>>>   .option("endingVersion", "20")
>>>>>>>
>>>>>>>   .changes("my_table")
>>>>>>>
>>>>>>> 2. Engine-Level Post Processing Under the hood, this proposal
>>>>>>> introduces a minimal Changelog interface for DSv2 connectors.
>>>>>>> Spark's Catalyst optimizer will take over the CDC post-processing,
>>>>>>> including:
>>>>>>>
>>>>>>>    -
>>>>>>>
>>>>>>>    Filtering out copy-on-write carry-over rows.
>>>>>>>    -
>>>>>>>
>>>>>>>    Deriving pre-image/post-image updates from raw insert/delete
>>>>>>>    pairs.
>>>>>>>    -
>>>>>>>
>>>>>>>    Computing net changes.
>>>>>>>
>>>>>>> This pushes complexity into the engine where it belongs, reducing
>>>>>>> duplicated effort across the ecosystem and ensuring consistent semantics
>>>>>>> for users.
>>>>>>>
>>>>>>> Please review the full SPIP for comprehensive design details, the
>>>>>>> proposed connector API, and deduplication semantics.
>>>>>>>
>>>>>>> Feedback and discussion are highly appreciated!
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Gengliang
>>>>>>>
>>>>>>>
>>>>
>>>> --
>>>> Twitter: https://twitter.com/holdenkarau
>>>> Fight Health Insurance: https://www.fighthealthinsurance.com/
>>>> <https://www.fighthealthinsurance.com/?q=hk_email>
>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>> Pronouns: she/her
>>>>
>>>

Re: [DISCUSS] SPIP: Change Data Capture (CDC) Support

Reply via email to