Great point — I fully agree with the distinction. This proposal unifies the surface contract — the syntax, the API, and the post-processing framework — but intentionally delegates the underlying change model to each connector. I just added a clarification in the proposal to make this explicit. Thanks for the suggestion!
On Mon, Mar 2, 2026 at 10:47 AM Mich Talebzadeh <[email protected]> wrote: > Hi, > > Thanks for your comments > > I agree that allowing connectors to define the meaning of “version” makes > the API extensible. My original concern was less about syntax and more > about semantic portability. In Delta and Iceberg, version identifiers map > naturally to snapshot-based storage. In traditional RDBMS systems like > Oracle SCN, change tracking is token- or log-based and may not align > cleanly with snapshot semantics. > > Let us take the example given: > > SELECT * FROM table CHANGES FROM VERSION 5 TO VERSION 10; > > Syntactically, that works everywhere. > > But semantically in Oracle (if mapped) → it might mean rows whose SCN > falls between two SCNs. > > So while the API is generic at the surface, the CDC guarantees (ordering, > completeness, retention, idempotency) will "ultimately remain > connector"-specific. > > It may be helpful to clarify in the proposal that this is a > "connector-delegated CDC interface" rather than a storage-agnostic CDC. In > that sense, the abstraction offers a unified CDC interface, but not > necessarily semantic portability across heterogeneous storage engines.This > distinction is subtle but important: unifying the surface contract does not > automatically unify the underlying change model. > > HTH > > Dr Mich Talebzadeh, > Data Scientist | Distributed Systems (Spark) | Financial Forensics & > Metadata Analytics | Transaction Reconstruction | Audit & Evidence-Based > Analytics > > view my Linkedin profile > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > > > > > > On Mon, 2 Mar 2026 at 16:44, Anton Okolnychyi <[email protected]> > wrote: > >> Mich, the proposal already comes with built-in support for timestamp >> ranges and a generic meaning of version. In Delta, this is a log version >> (number). In Iceberg, this is going to be a snapshot ID (string). Each >> connector can treat the version differently. >> >> I had a chance to see this proposal before it landed and I think it will >> be a great addition to Spark. I like the approach with computing updates >> and deduplication using window functions but we will have to benchmark the >> performance. If it ends up slower than what external connectors like Delta >> and Iceberg do today, we will have to pivot. This will only be known once >> the implementation is done. That said, it has no impact on the proposed >> behavior and APIs. >> >> Great to see this. >> >> - Anton >> >> пн, 2 бер. 2026 р. о 03:54 Mich Talebzadeh <[email protected]> >> пише: >> >>> one has to clarify that this is not all inclusive CDC >>> >>> So a realistic unified interface for CDC it should end as one of: >>> >>> >>> 1. time-based: “changes between T1 and T2” >>> 2. token-based: “changes between (SCN/LSN)” (Oracle) >>> 3. format-version-based: “changes between snapshot/version IDs” >>> (Delta/Iceberg/Hudi) >>> >>> this solution seems to aim for 3 only >>> >>> >>> >>> Dr Mich Talebzadeh, >>> Data Scientist | Distributed Systems (Spark) | Financial Forensics & >>> Metadata Analytics | Transaction Reconstruction | Audit & Evidence-Based >>> Analytics >>> >>> view my Linkedin profile >>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>> >>> >>> >>> >>> >>> On Fri, 27 Feb 2026 at 23:34, Gengliang Wang <[email protected]> wrote: >>> >>>> @Holden Karau <[email protected]> Thanks for taking a look! I have >>>> actually synced with a few Delta Lake and Iceberg committers offline, and >>>> they were comfortable with the proposed SQL syntax and API. Because this >>>> introduces a new SQL syntax, it won't affect the functionality of the >>>> existing connectors. >>>> >>>> Many of the active Delta and Iceberg developers are also on this >>>> mailing list, so I'm hoping we can gather most of the initial feedback >>>> right here in this thread. However, if we need deeper connector-specific >>>> alignment as the discussion evolves, I'm definitely open to cross-posting >>>> it to their respective lists. >>>> >>>> On Fri, Feb 27, 2026 at 2:29 PM Holden Karau <[email protected]> >>>> wrote: >>>> >>>>> This looks cool overall, would it maybe make sense to share with the >>>>> delta lake devs & iceberg devs for their input too? I have not had a >>>>> chance >>>>> to dig into this closely yet though. >>>>> >>>>> On Fri, Feb 27, 2026 at 1:39 PM Gengliang Wang <[email protected]> >>>>> wrote: >>>>> >>>>>> Hi Spark devs, >>>>>> >>>>>> It looks like my original email might have landed in some spam >>>>>> folders, so I am just bumping this thread for visibility. >>>>>> >>>>>> For quick reference, here are the links to the proposal again: >>>>>> >>>>>> - >>>>>> >>>>>> *SPIP Document:* >>>>>> >>>>>> https://docs.google.com/document/d/1-4rCS3vsGIyhwnkAwPsEaqyUDg-AuVkdrYLotFPw0U0/edit?usp=sharing >>>>>> - >>>>>> >>>>>> *JIRA:* https://issues.apache.org/jira/browse/SPARK-55668 >>>>>> >>>>>> Looking forward to your thoughts and feedback! >>>>>> >>>>>> Thanks, >>>>>> >>>>>> Gengliang >>>>>> >>>>>> On Fri, Feb 27, 2026 at 1:13 PM Szehon Ho <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> +1 (non binding) >>>>>>> >>>>>>> This is a great idea, look forward to a standard user experience for >>>>>>> CDC for DSV2 data source, and centralizing the complicated share logic. >>>>>>> >>>>>>> Also this is somehow shown in my Spam folder :) , hope this brings >>>>>>> it out. >>>>>>> >>>>>>> Thanks >>>>>>> Szehon >>>>>>> >>>>>>> On Tue, Feb 24, 2026 at 4:37 PM Gengliang Wang <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi all, >>>>>>>> >>>>>>>> I'd like to open a discussion on a new SPIP to introduce Change >>>>>>>> Data Capture (CDC) support to Apache Spark, targeting the Spark 4.2 >>>>>>>> release. >>>>>>>> >>>>>>>> - >>>>>>>> >>>>>>>> SPIP Document: <https://docs.google.com/document/d/> >>>>>>>> >>>>>>>> https://docs.google.com/document/d/1-4rCS3vsGIyhwnkAwPsEaqyUDg-AuVkdrYLotFPw0U0/edit?usp=sharing >>>>>>>> - >>>>>>>> >>>>>>>> JIRA: >>>>>>>> >>>>>>>> <https://www.google.com/search?q=https://issues.apache.org/jira/browse/SPARK-> >>>>>>>> https://issues.apache.org/jira/browse/SPARK-55668 >>>>>>>> >>>>>>>> Motivation >>>>>>>> >>>>>>>> Currently, querying row-level changes (inserts, updates, deletes) >>>>>>>> from a table requires connector-specific syntax. This fragmentation >>>>>>>> breaks >>>>>>>> query portability across different storage formats and forces each >>>>>>>> connector to reinvent complex post-processing logic: >>>>>>>> >>>>>>>> - >>>>>>>> >>>>>>>> Delta Lake: Uses table_changes() >>>>>>>> - >>>>>>>> >>>>>>>> Iceberg: Uses .changes virtual tables >>>>>>>> - >>>>>>>> >>>>>>>> Hudi: Relies on custom incremental read options >>>>>>>> >>>>>>>> There is no universal, engine-level standard in Spark to ask "show >>>>>>>> me what changed." >>>>>>>> Proposal >>>>>>>> >>>>>>>> This SPIP proposes a unified approach by adding a CHANGES SQL >>>>>>>> clause and corresponding DataFrame/DataStream APIs that work across >>>>>>>> DSv2 >>>>>>>> connectors. >>>>>>>> >>>>>>>> 1. Standardized User API >>>>>>>> >>>>>>>> SQL: >>>>>>>> >>>>>>>> -- Batch: What changed between version 10 and 20? >>>>>>>> >>>>>>>> SELECT * FROM my_table CHANGES FROM VERSION 10 TO VERSION 20; >>>>>>>> >>>>>>>> -- Streaming: Continuously process changes >>>>>>>> >>>>>>>> CREATE STREAMING TABLE cdc_sink AS >>>>>>>> >>>>>>>> SELECT * FROM STREAM my_table CHANGES FROM VERSION 0; >>>>>>>> >>>>>>>> DataFrame API: >>>>>>>> >>>>>>>> spark.read >>>>>>>> >>>>>>>> .option("startingVersion", "10") >>>>>>>> >>>>>>>> .option("endingVersion", "20") >>>>>>>> >>>>>>>> .changes("my_table") >>>>>>>> >>>>>>>> 2. Engine-Level Post Processing Under the hood, this proposal >>>>>>>> introduces a minimal Changelog interface for DSv2 connectors. >>>>>>>> Spark's Catalyst optimizer will take over the CDC post-processing, >>>>>>>> including: >>>>>>>> >>>>>>>> - >>>>>>>> >>>>>>>> Filtering out copy-on-write carry-over rows. >>>>>>>> - >>>>>>>> >>>>>>>> Deriving pre-image/post-image updates from raw insert/delete >>>>>>>> pairs. >>>>>>>> - >>>>>>>> >>>>>>>> Computing net changes. >>>>>>>> >>>>>>>> This pushes complexity into the engine where it belongs, reducing >>>>>>>> duplicated effort across the ecosystem and ensuring consistent >>>>>>>> semantics >>>>>>>> for users. >>>>>>>> >>>>>>>> Please review the full SPIP for comprehensive design details, the >>>>>>>> proposed connector API, and deduplication semantics. >>>>>>>> >>>>>>>> Feedback and discussion are highly appreciated! >>>>>>>> >>>>>>>> Thanks, >>>>>>>> >>>>>>>> Gengliang >>>>>>>> >>>>>>>> >>>>> >>>>> -- >>>>> Twitter: https://twitter.com/holdenkarau >>>>> Fight Health Insurance: https://www.fighthealthinsurance.com/ >>>>> <https://www.fighthealthinsurance.com/?q=hk_email> >>>>> Books (Learning Spark, High Performance Spark, etc.): >>>>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >>>>> Pronouns: she/her >>>>> >>>>
