Re: [DISCUSS] SPIP: Change Data Capture (CDC) Support

Gengliang Wang Mon, 02 Mar 2026 11:18:02 -0800

Great point — I fully agree with the distinction. This proposal unifies the
surface contract — the syntax, the API, and the post-processing framework —
but intentionally delegates the underlying change model to each connector.
I just added a clarification in the proposal to make this explicit. Thanks
for the suggestion!


On Mon, Mar 2, 2026 at 10:47 AM Mich Talebzadeh <[email protected]>
wrote:

> Hi,
>
> Thanks for your comments
>
> I agree that allowing connectors to define the meaning of “version” makes
> the API extensible. My original concern was less about syntax and more
> about semantic portability. In Delta and Iceberg, version identifiers map
> naturally to snapshot-based storage. In traditional RDBMS systems like
> Oracle SCN, change tracking is token- or log-based and may not align
> cleanly with snapshot semantics.
>
> Let us take the example given:
>
> SELECT * FROM table CHANGES FROM VERSION 5 TO VERSION 10;
>
> Syntactically, that works everywhere.
>
> But semantically in Oracle (if mapped) → it might mean rows whose SCN
> falls between two SCNs.
>
> So while the API is generic at the surface, the CDC guarantees (ordering,
> completeness, retention, idempotency) will "ultimately remain
> connector"-specific.
>
> It may be helpful to clarify in the proposal that this is a
> "connector-delegated CDC interface" rather than a storage-agnostic CDC. In
> that sense, the abstraction offers a unified CDC interface, but not
> necessarily semantic portability across heterogeneous storage engines.This
> distinction is subtle but important: unifying the surface contract does not
> automatically unify the underlying change model.
>
> HTH
>
> Dr Mich Talebzadeh,
> Data Scientist | Distributed Systems (Spark) | Financial Forensics &
> Metadata Analytics | Transaction Reconstruction | Audit & Evidence-Based
> Analytics
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>
>
>
> On Mon, 2 Mar 2026 at 16:44, Anton Okolnychyi <[email protected]>
> wrote:
>
>> Mich, the proposal already comes with built-in support for timestamp
>> ranges and a generic meaning of version. In Delta, this is a log version
>> (number). In Iceberg, this is going to be a snapshot ID (string). Each
>> connector can treat the version differently.
>>
>> I had a chance to see this proposal before it landed and I think it will
>> be a great addition to Spark. I like the approach with computing updates
>> and deduplication using window functions but we will have to benchmark the
>> performance. If it ends up slower than what external connectors like Delta
>> and Iceberg do today, we will have to pivot. This will only be known once
>> the implementation is done. That said, it has no impact on the proposed
>> behavior and APIs.
>>
>> Great to see this.
>>
>> - Anton
>>
>> пн, 2 бер. 2026 р. о 03:54 Mich Talebzadeh <[email protected]>
>> пише:
>>
>>> one has to clarify that this is not all inclusive CDC
>>>
>>> So a realistic unified interface for CDC it should end  as one of:
>>>
>>>
>>>    1. time-based: “changes between T1 and T2”
>>>    2. token-based: “changes between (SCN/LSN)” (Oracle)
>>>    3. format-version-based: “changes between snapshot/version IDs”
>>>    (Delta/Iceberg/Hudi)
>>>
>>> this solution seems to aim for 3 only
>>>
>>>
>>>
>>> Dr Mich Talebzadeh,
>>> Data Scientist | Distributed Systems (Spark) | Financial Forensics &
>>> Metadata Analytics | Transaction Reconstruction | Audit & Evidence-Based
>>> Analytics
>>>
>>>    view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>
>>>
>>>
>>> On Fri, 27 Feb 2026 at 23:34, Gengliang Wang <[email protected]> wrote:
>>>
>>>> @Holden Karau <[email protected]> Thanks for taking a look! I have
>>>> actually synced with a few Delta Lake and Iceberg committers offline, and
>>>> they were comfortable with the proposed SQL syntax and API. Because this
>>>> introduces a new SQL syntax, it won't affect the functionality of the
>>>> existing connectors.
>>>>
>>>> Many of the active Delta and Iceberg developers are also on this
>>>> mailing list, so I'm hoping we can gather most of the initial feedback
>>>> right here in this thread. However, if we need deeper connector-specific
>>>> alignment as the discussion evolves, I'm definitely open to cross-posting
>>>> it to their respective lists.
>>>>
>>>> On Fri, Feb 27, 2026 at 2:29 PM Holden Karau <[email protected]>
>>>> wrote:
>>>>
>>>>> This looks cool overall, would it maybe make sense to share with the
>>>>> delta lake devs & iceberg devs for their input too? I have not had a 
>>>>> chance
>>>>> to dig into this closely yet though.
>>>>>
>>>>> On Fri, Feb 27, 2026 at 1:39 PM Gengliang Wang <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hi Spark devs,
>>>>>>
>>>>>> It looks like my original email might have landed in some spam
>>>>>> folders, so I am just bumping this thread for visibility.
>>>>>>
>>>>>> For quick reference, here are the links to the proposal again:
>>>>>>
>>>>>>    -
>>>>>>
>>>>>>    *SPIP Document:*
>>>>>>    
>>>>>> https://docs.google.com/document/d/1-4rCS3vsGIyhwnkAwPsEaqyUDg-AuVkdrYLotFPw0U0/edit?usp=sharing
>>>>>>    -
>>>>>>
>>>>>>    *JIRA:* https://issues.apache.org/jira/browse/SPARK-55668
>>>>>>
>>>>>> Looking forward to your thoughts and feedback!
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Gengliang
>>>>>>
>>>>>> On Fri, Feb 27, 2026 at 1:13 PM Szehon Ho <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> +1 （non binding)
>>>>>>>
>>>>>>> This is a great idea, look forward to a standard user experience for
>>>>>>> CDC for DSV2 data source, and centralizing the complicated share logic.
>>>>>>>
>>>>>>> Also this is somehow shown in my Spam folder :) , hope this brings
>>>>>>> it out.
>>>>>>>
>>>>>>> Thanks
>>>>>>> Szehon
>>>>>>>
>>>>>>> On Tue, Feb 24, 2026 at 4:37 PM Gengliang Wang <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> I'd like to open a discussion on a new SPIP to introduce Change
>>>>>>>> Data Capture (CDC) support to Apache Spark, targeting the Spark 4.2 
>>>>>>>> release.
>>>>>>>>
>>>>>>>>    -
>>>>>>>>
>>>>>>>>    SPIP Document: <https://docs.google.com/document/d/>
>>>>>>>>    
>>>>>>>> https://docs.google.com/document/d/1-4rCS3vsGIyhwnkAwPsEaqyUDg-AuVkdrYLotFPw0U0/edit?usp=sharing
>>>>>>>>    -
>>>>>>>>
>>>>>>>>    JIRA:
>>>>>>>>    
>>>>>>>> <https://www.google.com/search?q=https://issues.apache.org/jira/browse/SPARK->
>>>>>>>>    https://issues.apache.org/jira/browse/SPARK-55668
>>>>>>>>
>>>>>>>> Motivation
>>>>>>>>
>>>>>>>> Currently, querying row-level changes (inserts, updates, deletes)
>>>>>>>> from a table requires connector-specific syntax. This fragmentation 
>>>>>>>> breaks
>>>>>>>> query portability across different storage formats and forces each
>>>>>>>> connector to reinvent complex post-processing logic:
>>>>>>>>
>>>>>>>>    -
>>>>>>>>
>>>>>>>>    Delta Lake: Uses table_changes()
>>>>>>>>    -
>>>>>>>>
>>>>>>>>    Iceberg: Uses .changes virtual tables
>>>>>>>>    -
>>>>>>>>
>>>>>>>>    Hudi: Relies on custom incremental read options
>>>>>>>>
>>>>>>>> There is no universal, engine-level standard in Spark to ask "show
>>>>>>>> me what changed."
>>>>>>>> Proposal
>>>>>>>>
>>>>>>>> This SPIP proposes a unified approach by adding a CHANGES SQL
>>>>>>>> clause and corresponding DataFrame/DataStream APIs that work across 
>>>>>>>> DSv2
>>>>>>>> connectors.
>>>>>>>>
>>>>>>>> 1. Standardized User API
>>>>>>>>
>>>>>>>> SQL:
>>>>>>>>
>>>>>>>> -- Batch: What changed between version 10 and 20?
>>>>>>>>
>>>>>>>> SELECT * FROM my_table CHANGES FROM VERSION 10 TO VERSION 20;
>>>>>>>>
>>>>>>>> -- Streaming: Continuously process changes
>>>>>>>>
>>>>>>>> CREATE STREAMING TABLE cdc_sink AS
>>>>>>>>
>>>>>>>> SELECT * FROM STREAM my_table CHANGES FROM VERSION 0;
>>>>>>>>
>>>>>>>> DataFrame API:
>>>>>>>>
>>>>>>>> spark.read
>>>>>>>>
>>>>>>>>   .option("startingVersion", "10")
>>>>>>>>
>>>>>>>>   .option("endingVersion", "20")
>>>>>>>>
>>>>>>>>   .changes("my_table")
>>>>>>>>
>>>>>>>> 2. Engine-Level Post Processing Under the hood, this proposal
>>>>>>>> introduces a minimal Changelog interface for DSv2 connectors.
>>>>>>>> Spark's Catalyst optimizer will take over the CDC post-processing,
>>>>>>>> including:
>>>>>>>>
>>>>>>>>    -
>>>>>>>>
>>>>>>>>    Filtering out copy-on-write carry-over rows.
>>>>>>>>    -
>>>>>>>>
>>>>>>>>    Deriving pre-image/post-image updates from raw insert/delete
>>>>>>>>    pairs.
>>>>>>>>    -
>>>>>>>>
>>>>>>>>    Computing net changes.
>>>>>>>>
>>>>>>>> This pushes complexity into the engine where it belongs, reducing
>>>>>>>> duplicated effort across the ecosystem and ensuring consistent 
>>>>>>>> semantics
>>>>>>>> for users.
>>>>>>>>
>>>>>>>> Please review the full SPIP for comprehensive design details, the
>>>>>>>> proposed connector API, and deduplication semantics.
>>>>>>>>
>>>>>>>> Feedback and discussion are highly appreciated!
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> Gengliang
>>>>>>>>
>>>>>>>>
>>>>>
>>>>> --
>>>>> Twitter: https://twitter.com/holdenkarau
>>>>> Fight Health Insurance: https://www.fighthealthinsurance.com/
>>>>> <https://www.fighthealthinsurance.com/?q=hk_email>
>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>> Pronouns: she/her
>>>>>
>>>>

Re: [DISCUSS] SPIP: Change Data Capture (CDC) Support

Reply via email to