one has to clarify that this is not all inclusive CDC

So a realistic unified interface for CDC it should end  as one of:


   1. time-based: “changes between T1 and T2”
   2. token-based: “changes between (SCN/LSN)” (Oracle)
   3. format-version-based: “changes between snapshot/version IDs”
   (Delta/Iceberg/Hudi)

this solution seems to aim for 3 only



Dr Mich Talebzadeh,
Data Scientist | Distributed Systems (Spark) | Financial Forensics &
Metadata Analytics | Transaction Reconstruction | Audit & Evidence-Based
Analytics

   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>





On Fri, 27 Feb 2026 at 23:34, Gengliang Wang <[email protected]> wrote:

> @Holden Karau <[email protected]> Thanks for taking a look! I have
> actually synced with a few Delta Lake and Iceberg committers offline, and
> they were comfortable with the proposed SQL syntax and API. Because this
> introduces a new SQL syntax, it won't affect the functionality of the
> existing connectors.
>
> Many of the active Delta and Iceberg developers are also on this mailing
> list, so I'm hoping we can gather most of the initial feedback right here
> in this thread. However, if we need deeper connector-specific alignment as
> the discussion evolves, I'm definitely open to cross-posting it to their
> respective lists.
>
> On Fri, Feb 27, 2026 at 2:29 PM Holden Karau <[email protected]>
> wrote:
>
>> This looks cool overall, would it maybe make sense to share with the
>> delta lake devs & iceberg devs for their input too? I have not had a chance
>> to dig into this closely yet though.
>>
>> On Fri, Feb 27, 2026 at 1:39 PM Gengliang Wang <[email protected]> wrote:
>>
>>> Hi Spark devs,
>>>
>>> It looks like my original email might have landed in some spam folders,
>>> so I am just bumping this thread for visibility.
>>>
>>> For quick reference, here are the links to the proposal again:
>>>
>>>    -
>>>
>>>    *SPIP Document:*
>>>    
>>> https://docs.google.com/document/d/1-4rCS3vsGIyhwnkAwPsEaqyUDg-AuVkdrYLotFPw0U0/edit?usp=sharing
>>>    -
>>>
>>>    *JIRA:* https://issues.apache.org/jira/browse/SPARK-55668
>>>
>>> Looking forward to your thoughts and feedback!
>>>
>>> Thanks,
>>>
>>> Gengliang
>>>
>>> On Fri, Feb 27, 2026 at 1:13 PM Szehon Ho <[email protected]>
>>> wrote:
>>>
>>>> +1 (non binding)
>>>>
>>>> This is a great idea, look forward to a standard user experience for
>>>> CDC for DSV2 data source, and centralizing the complicated share logic.
>>>>
>>>> Also this is somehow shown in my Spam folder :) , hope this brings it
>>>> out.
>>>>
>>>> Thanks
>>>> Szehon
>>>>
>>>> On Tue, Feb 24, 2026 at 4:37 PM Gengliang Wang <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I'd like to open a discussion on a new SPIP to introduce Change Data
>>>>> Capture (CDC) support to Apache Spark, targeting the Spark 4.2 release.
>>>>>
>>>>>    -
>>>>>
>>>>>    SPIP Document: <https://docs.google.com/document/d/>
>>>>>    
>>>>> https://docs.google.com/document/d/1-4rCS3vsGIyhwnkAwPsEaqyUDg-AuVkdrYLotFPw0U0/edit?usp=sharing
>>>>>    -
>>>>>
>>>>>    JIRA:
>>>>>    
>>>>> <https://www.google.com/search?q=https://issues.apache.org/jira/browse/SPARK->
>>>>>    https://issues.apache.org/jira/browse/SPARK-55668
>>>>>
>>>>> Motivation
>>>>>
>>>>> Currently, querying row-level changes (inserts, updates, deletes) from
>>>>> a table requires connector-specific syntax. This fragmentation breaks 
>>>>> query
>>>>> portability across different storage formats and forces each connector to
>>>>> reinvent complex post-processing logic:
>>>>>
>>>>>    -
>>>>>
>>>>>    Delta Lake: Uses table_changes()
>>>>>    -
>>>>>
>>>>>    Iceberg: Uses .changes virtual tables
>>>>>    -
>>>>>
>>>>>    Hudi: Relies on custom incremental read options
>>>>>
>>>>> There is no universal, engine-level standard in Spark to ask "show me
>>>>> what changed."
>>>>> Proposal
>>>>>
>>>>> This SPIP proposes a unified approach by adding a CHANGES SQL clause
>>>>> and corresponding DataFrame/DataStream APIs that work across DSv2
>>>>> connectors.
>>>>>
>>>>> 1. Standardized User API
>>>>>
>>>>> SQL:
>>>>>
>>>>> -- Batch: What changed between version 10 and 20?
>>>>>
>>>>> SELECT * FROM my_table CHANGES FROM VERSION 10 TO VERSION 20;
>>>>>
>>>>> -- Streaming: Continuously process changes
>>>>>
>>>>> CREATE STREAMING TABLE cdc_sink AS
>>>>>
>>>>> SELECT * FROM STREAM my_table CHANGES FROM VERSION 0;
>>>>>
>>>>> DataFrame API:
>>>>>
>>>>> spark.read
>>>>>
>>>>>   .option("startingVersion", "10")
>>>>>
>>>>>   .option("endingVersion", "20")
>>>>>
>>>>>   .changes("my_table")
>>>>>
>>>>> 2. Engine-Level Post Processing Under the hood, this proposal
>>>>> introduces a minimal Changelog interface for DSv2 connectors. Spark's
>>>>> Catalyst optimizer will take over the CDC post-processing, including:
>>>>>
>>>>>    -
>>>>>
>>>>>    Filtering out copy-on-write carry-over rows.
>>>>>    -
>>>>>
>>>>>    Deriving pre-image/post-image updates from raw insert/delete pairs.
>>>>>    -
>>>>>
>>>>>    Computing net changes.
>>>>>
>>>>> This pushes complexity into the engine where it belongs, reducing
>>>>> duplicated effort across the ecosystem and ensuring consistent semantics
>>>>> for users.
>>>>>
>>>>> Please review the full SPIP for comprehensive design details, the
>>>>> proposed connector API, and deduplication semantics.
>>>>>
>>>>> Feedback and discussion are highly appreciated!
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Gengliang
>>>>>
>>>>>
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Fight Health Insurance: https://www.fighthealthinsurance.com/
>> <https://www.fighthealthinsurance.com/?q=hk_email>
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>> Pronouns: she/her
>>
>

Reply via email to