one has to clarify that this is not all inclusive CDC So a realistic unified interface for CDC it should end as one of:
1. time-based: “changes between T1 and T2” 2. token-based: “changes between (SCN/LSN)” (Oracle) 3. format-version-based: “changes between snapshot/version IDs” (Delta/Iceberg/Hudi) this solution seems to aim for 3 only Dr Mich Talebzadeh, Data Scientist | Distributed Systems (Spark) | Financial Forensics & Metadata Analytics | Transaction Reconstruction | Audit & Evidence-Based Analytics view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> On Fri, 27 Feb 2026 at 23:34, Gengliang Wang <[email protected]> wrote: > @Holden Karau <[email protected]> Thanks for taking a look! I have > actually synced with a few Delta Lake and Iceberg committers offline, and > they were comfortable with the proposed SQL syntax and API. Because this > introduces a new SQL syntax, it won't affect the functionality of the > existing connectors. > > Many of the active Delta and Iceberg developers are also on this mailing > list, so I'm hoping we can gather most of the initial feedback right here > in this thread. However, if we need deeper connector-specific alignment as > the discussion evolves, I'm definitely open to cross-posting it to their > respective lists. > > On Fri, Feb 27, 2026 at 2:29 PM Holden Karau <[email protected]> > wrote: > >> This looks cool overall, would it maybe make sense to share with the >> delta lake devs & iceberg devs for their input too? I have not had a chance >> to dig into this closely yet though. >> >> On Fri, Feb 27, 2026 at 1:39 PM Gengliang Wang <[email protected]> wrote: >> >>> Hi Spark devs, >>> >>> It looks like my original email might have landed in some spam folders, >>> so I am just bumping this thread for visibility. >>> >>> For quick reference, here are the links to the proposal again: >>> >>> - >>> >>> *SPIP Document:* >>> >>> https://docs.google.com/document/d/1-4rCS3vsGIyhwnkAwPsEaqyUDg-AuVkdrYLotFPw0U0/edit?usp=sharing >>> - >>> >>> *JIRA:* https://issues.apache.org/jira/browse/SPARK-55668 >>> >>> Looking forward to your thoughts and feedback! >>> >>> Thanks, >>> >>> Gengliang >>> >>> On Fri, Feb 27, 2026 at 1:13 PM Szehon Ho <[email protected]> >>> wrote: >>> >>>> +1 (non binding) >>>> >>>> This is a great idea, look forward to a standard user experience for >>>> CDC for DSV2 data source, and centralizing the complicated share logic. >>>> >>>> Also this is somehow shown in my Spam folder :) , hope this brings it >>>> out. >>>> >>>> Thanks >>>> Szehon >>>> >>>> On Tue, Feb 24, 2026 at 4:37 PM Gengliang Wang <[email protected]> >>>> wrote: >>>> >>>>> Hi all, >>>>> >>>>> I'd like to open a discussion on a new SPIP to introduce Change Data >>>>> Capture (CDC) support to Apache Spark, targeting the Spark 4.2 release. >>>>> >>>>> - >>>>> >>>>> SPIP Document: <https://docs.google.com/document/d/> >>>>> >>>>> https://docs.google.com/document/d/1-4rCS3vsGIyhwnkAwPsEaqyUDg-AuVkdrYLotFPw0U0/edit?usp=sharing >>>>> - >>>>> >>>>> JIRA: >>>>> >>>>> <https://www.google.com/search?q=https://issues.apache.org/jira/browse/SPARK-> >>>>> https://issues.apache.org/jira/browse/SPARK-55668 >>>>> >>>>> Motivation >>>>> >>>>> Currently, querying row-level changes (inserts, updates, deletes) from >>>>> a table requires connector-specific syntax. This fragmentation breaks >>>>> query >>>>> portability across different storage formats and forces each connector to >>>>> reinvent complex post-processing logic: >>>>> >>>>> - >>>>> >>>>> Delta Lake: Uses table_changes() >>>>> - >>>>> >>>>> Iceberg: Uses .changes virtual tables >>>>> - >>>>> >>>>> Hudi: Relies on custom incremental read options >>>>> >>>>> There is no universal, engine-level standard in Spark to ask "show me >>>>> what changed." >>>>> Proposal >>>>> >>>>> This SPIP proposes a unified approach by adding a CHANGES SQL clause >>>>> and corresponding DataFrame/DataStream APIs that work across DSv2 >>>>> connectors. >>>>> >>>>> 1. Standardized User API >>>>> >>>>> SQL: >>>>> >>>>> -- Batch: What changed between version 10 and 20? >>>>> >>>>> SELECT * FROM my_table CHANGES FROM VERSION 10 TO VERSION 20; >>>>> >>>>> -- Streaming: Continuously process changes >>>>> >>>>> CREATE STREAMING TABLE cdc_sink AS >>>>> >>>>> SELECT * FROM STREAM my_table CHANGES FROM VERSION 0; >>>>> >>>>> DataFrame API: >>>>> >>>>> spark.read >>>>> >>>>> .option("startingVersion", "10") >>>>> >>>>> .option("endingVersion", "20") >>>>> >>>>> .changes("my_table") >>>>> >>>>> 2. Engine-Level Post Processing Under the hood, this proposal >>>>> introduces a minimal Changelog interface for DSv2 connectors. Spark's >>>>> Catalyst optimizer will take over the CDC post-processing, including: >>>>> >>>>> - >>>>> >>>>> Filtering out copy-on-write carry-over rows. >>>>> - >>>>> >>>>> Deriving pre-image/post-image updates from raw insert/delete pairs. >>>>> - >>>>> >>>>> Computing net changes. >>>>> >>>>> This pushes complexity into the engine where it belongs, reducing >>>>> duplicated effort across the ecosystem and ensuring consistent semantics >>>>> for users. >>>>> >>>>> Please review the full SPIP for comprehensive design details, the >>>>> proposed connector API, and deduplication semantics. >>>>> >>>>> Feedback and discussion are highly appreciated! >>>>> >>>>> Thanks, >>>>> >>>>> Gengliang >>>>> >>>>> >> >> -- >> Twitter: https://twitter.com/holdenkarau >> Fight Health Insurance: https://www.fighthealthinsurance.com/ >> <https://www.fighthealthinsurance.com/?q=hk_email> >> Books (Learning Spark, High Performance Spark, etc.): >> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >> Pronouns: she/her >> >
