+1, really looking forward to this feature. On Wed, Mar 4, 2026 at 1:57 PM vaquar khan <[email protected]> wrote:
> Hi everyone, > > Sorry for the late response, I know the vote is actively underway, but > reviewing the SPIP's Catalyst post-processing mechanics raised a few > systemic design concerns we need to clarify to avoid severe performance > regressions down the line. > > 1. Capability Pushdown: The proposal has Catalyst deriving pre/post-images > from raw insert/delete pairs. Storage layers like Delta Lake already > materialize these natively. If the Changelog interface lacks state > pushdown, Catalyst will burn CPU and memory reconstructing what the storage > layer already solved. > > 2. CoW I/O Bottlenecks: Mandating Catalyst to filter "carry-over" rows for > CoW tables is highly problematic. Without strict connector-level row > lineage, we will be dragging massive, unmodified Parquet files across the > network, forcing Spark into heavy distributed joins just to discard > unchanged data. > > 3. Audit Fidelity: The design explicitly targets computing "net changes." > Collapsing intermediate states breaks enterprise audit and compliance > workflows that require full transactional history. The SQL grammar needs an > explicit ALL CHANGES execution path. > > I fully support unifying CDC and this SIP is the right direction, but > abstracting it at the cost of storage-native optimizations and audit > fidelity is a dangerous trade-off. We need to clarify how physical planning > will handle these bottlenecks before formally ratifying the proposal. > > Regards, > Viquar Khan > > On Tue, 3 Mar 2026 at 20:09, Cheng Pan <[email protected]> wrote: > >> +1 (non-binding) >> >> Thanks, >> Cheng Pan >> >> >> >> On Mar 4, 2026, at 09:59, John Zhuge <[email protected]> wrote: >> >> +1 (non-binding) >> >> Thanks for the contribution! >> >> >> On Tue, Mar 3, 2026 at 5:50 PM Burak Yavuz <[email protected]> wrote: >> >>> +1! >>> >>> On Tue, Mar 3, 2026 at 5:48 PM Szehon Ho <[email protected]> >>> wrote: >>> >>>> +1, look forward to it (non binding) >>>> >>>> Thanks >>>> Szehon >>>> >>>> On Tue, Mar 3, 2026 at 5:37 PM Anton Okolnychyi <[email protected]> >>>> wrote: >>>> >>>>> +1 (non-binding) >>>>> >>>>> On Tue, Mar 3, 2026 at 5:07 PM Mich Talebzadeh < >>>>> [email protected]> wrote: >>>>> >>>>>> +1 >>>>>> >>>>>> Dr Mich Talebzadeh, >>>>>> Data Scientist | Distributed Systems (Spark) | Financial Forensics & >>>>>> Metadata Analytics | Transaction Reconstruction | Audit & Evidence-Based >>>>>> Analytics >>>>>> >>>>>> view my Linkedin profile >>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Wed, 4 Mar 2026 at 00:57, Gengliang Wang <[email protected]> wrote: >>>>>> >>>>>>> Hi Spark devs, >>>>>>> >>>>>>> I'd like to call a vote on the SPIP*: Change Data Capture (CDC) >>>>>>> Support* >>>>>>> >>>>>>> *Summary:* >>>>>>> >>>>>>> This SPIP proposes a unified approach by adding a CHANGES SQL >>>>>>> clause and corresponding DataFrame/DataStream APIs that work across DSv2 >>>>>>> connectors. >>>>>>> >>>>>>> 1. Standardized User API >>>>>>> SQL: >>>>>>> >>>>>>> -- Batch: What changed between version 10 and 20? >>>>>>> SELECT * FROM my_table CHANGES FROM VERSION 10 TO VERSION 20; >>>>>>> >>>>>>> -- Streaming: Continuously process changes >>>>>>> CREATE STREAMING TABLE cdc_sink AS >>>>>>> SELECT * FROM STREAM my_table CHANGES FROM VERSION 0; >>>>>>> >>>>>>> DataFrame API: >>>>>>> spark.read >>>>>>> .option("startingVersion", "10") >>>>>>> .option("endingVersion", "20") >>>>>>> .changes("my_table") >>>>>>> >>>>>>> 2. Engine-Level Post Processing Under the hood, this proposal >>>>>>> introduces a minimal Changelog interface for DSv2 connectors. >>>>>>> Spark's Catalyst optimizer will take over the CDC post-processing, >>>>>>> including: >>>>>>> >>>>>>> - >>>>>>> >>>>>>> Filtering out copy-on-write carry-over rows. >>>>>>> - Deriving pre-image/post-image updates from raw insert/delete >>>>>>> pairs. >>>>>>> - >>>>>>> >>>>>>> Computing net changes. >>>>>>> >>>>>>> >>>>>>> *Relevant Links:* >>>>>>> >>>>>>> - *SPIP Doc: * >>>>>>> >>>>>>> https://docs.google.com/document/d/1-4rCS3vsGIyhwnkAwPsEaqyUDg-AuVkdrYLotFPw0U0/edit?usp=sharing >>>>>>> - *Discuss Thread: * >>>>>>> https://lists.apache.org/thread/dhxx6pohs7fvqc3knzhtoj4tbcgrwxts >>>>>>> - *JIRA: *https://issues.apache.org/jira/browse/SPARK-55668 >>>>>>> >>>>>>> >>>>>>> *The vote will be open for at least 72 hours. *Please vote: >>>>>>> >>>>>>> [ ] +1: Accept the proposal as an official SPIP >>>>>>> >>>>>>> [ ] +0 >>>>>>> >>>>>>> [ ] -1: I don't think this is a good idea because ... >>>>>>> >>>>>>> Thanks, >>>>>>> Gengliang Wang >>>>>>> >>>>>> >> >> -- >> John Zhuge >> >> >>
