Re: [VOTE] SPIP: Change Data Capture (CDC) Support

vaquar khan Wed, 04 Mar 2026 12:47:07 -0800

Thanks for updating the SPIP (B.2 and A.1).

+1


Regards,

Viquar Khan


On Wed, 4 Mar 2026 at 14:03, Gengliang Wang <[email protected]> wrote:

> Sure, I've updated the SPIP doc with the semantic guarantee note in
> Appendix B.2 and expanded the deduplicationMode descriptions in A.1 to
> clarify all three modes.
> Regarding the warning for incomplete change logs — that is
> connector-specific behavior, so it's best left to each connector's
> implementation rather than prescribed at the Spark level.
>
> On Wed, Mar 4, 2026 at 11:06 AM vaquar khan <[email protected]> wrote:
>
>> Thanks Gengliang for the continued engagement, and thank you Anton for
>> the important clarification on containsCarryoverRows().
>>
>> 1. Capability Naming — Documentation Clarification Ask
>> I respect the DSv2 naming consistency. My ask is narrower: please add a
>> note in Appendix B.2 explicitly stating that returning false carries the
>> semantic guarantee that pre/post-images are fully materialized by the
>> connector-not lazily computed at scan time. No new method needed, just
>> documentation clarity to prevent incorrect connector implementations.
>>
>> 2. CoW I/O — Fully Withdrawn
>> Anton's clarification changes my position here entirely. If TableCatalog
>> loads Changelog with awareness of the specific range being scanned, and the
>> connector can inspect the actual commit history for that range to determine
>> whether CoW operations occurred, then containsCarryoverRows() is
>> effectively a range-scoped, commit-aware signal -not a coarse table-level
>> binary. That fully addresses my concern. I'm withdrawing Item 2 entirely,
>> not just as a blocker.
>>
>> 3. Audit Discoverability — Revised Ask
>> You're right that ALL CHANGES could mislead compliance engineers when
>> change logs are partially vacuumed. I withdraw that request.
>> My revised ask:
>> - Clearly state in the SQL documentation that deduplicationMode='none' is
>> the right way to get a full audit trail
>> - Show a warning if a user queries a table where some of the old change
>> logs have already been deleted
>>
>> With items 1 and 3 addressed in the SPIP text, count my +1.
>>
>> Regards,
>> Viquar Khan
>>
>> On Wed, 4 Mar 2026 at 12:35, Anton Okolnychyi <[email protected]>
>> wrote:
>>
>>> To add to Gengliang's point, TableCatalog would load Changelog
>>> knowing the range that is being scanned. This allows the connector to
>>> traverse the commit history and detect whether it had any CoW operation or
>>> not. In other words, it is not a blind flag at the table level. It is
>>> specific to the changelog range that is being requested.
>>>
>>> ср, 4 бер. 2026 р. о 09:17 Gengliang Wang <[email protected]> пише:
>>>
>>>> Thanks for the follow-up — appreciate the rigor.
>>>>
>>>> *1.* *Capability Naming*: The naming is intentional —
>>>> representsUpdateAsDeleteAndInsert() mirrors the existing
>>>> SupportsDelta.representUpdateAsDeleteAndInsert()
>>>> <https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/connector/write/SupportsDelta.java#L45>
>>>> in the DSv2 API. When it returns false, it means the connector's change
>>>> data already distinguishes updates from raw delete/insert pairs, so there
>>>> is nothing for Catalyst to derive.
>>>>
>>>> *2.* *Partition-Level CoW Hints*: A table-level flag is sufficient for
>>>> the common case. If a connector has partitions with mixed CoW behavior and
>>>> needs finer-grained control, it can simply return containsCarryoverRows() =
>>>> false and handle carry-over removal internally within its ScanBuilder — the
>>>> interface already supports this. There is no need to complicate the
>>>> Spark-level API for an edge case that connectors can solve themselves.
>>>>
>>>> *3. Audit Discoverability*: The SPIP proposes only two options in the
>>>> WITH clause (deduplicationMode and computeUpdates) — this is a small,
>>>> well-documented surface, not a hidden knob. Adding an ALL CHANGES grammar
>>>> modifier introduces its own discoverability problem: it implies the table
>>>> retains a complete history of all changes, which is not guaranteed — most
>>>> formats discard old change data after vacuum/expiration. A SQL keyword that
>>>> suggests completeness but silently returns partial results is arguably
>>>> worse for compliance engineers than an explicit option with clear
>>>> documentation.
>>>>
>>>>
>>>>
>>>> On Tue, Mar 3, 2026 at 11:17 PM vaquar khan <[email protected]>
>>>> wrote:
>>>>
>>>>> Thanks  Gengliang   for the detailed follow-ans. While the mechanics
>>>>> you laid out make sense on paper, looking at how this will actually play
>>>>> out in production.
>>>>>
>>>>> 1. Capability Pushdown vs. Format Flag
>>>>> Returning representsUpdateAsDeleteAndInsert() = false just signals
>>>>> that the connector doesn't use raw delete/insert pairs. It doesn't
>>>>> explicitly tell Catalyst, "I already computed the pre/post images 
>>>>> natively,
>>>>> trust my output and skip the window function entirely." Those are
>>>>> semantically different. A dedicated supportsNativePrePostImages()
>>>>> capability method would close this gap much more cleanly than overloading
>>>>> the format flag.
>>>>>
>>>>> 2. CoW I/O is a Table-Level Binary
>>>>> The ScanBuilder delegation is a fair point, but
>>>>> containsCarryoverRows() is still a table-level binary flag. For massive,
>>>>> partitioned CoW tables that have carry-overs in some partitions but not
>>>>> others, this interface forces Spark to apply carry-over removal globally 
>>>>> or
>>>>> not at all. A partition-level or scan-level hint is a necessary 
>>>>> improvement
>>>>> for mixed-mode CoW tables.
>>>>>
>>>>> 3. Audit Discoverability
>>>>> I agree deduplicationMode='none' is functionally correct, but my
>>>>> concern is discoverability. A compliance engineer or DBA writing SQL
>>>>> shouldn't need institutional knowledge of a hidden WITH clause option
>>>>> string to get audit-safe output. Having an explicit ALL CHANGES modifier 
>>>>> in
>>>>> the grammar is crucial for enterprise adoption and auditing.
>>>>>
>>>>> I am highly supportive of the core architecture, but these are real
>>>>> production blockers for enterprise workloads. Let's get these clarified 
>>>>> and
>>>>> updated in the SPIP document, Items 1 and 3 are production blockers I'd
>>>>> like addressed in the SPIP document. Item 2 is a real limitation but could
>>>>> reasonably be tracked as a follow-on improvement. Happy to cast my +1 once
>>>>> 1 and 3 are clarified.
>>>>>
>>>>> Regards,
>>>>> Viquar Khan
>>>>>
>>>>> On Wed, 4 Mar 2026 at 00:37, Gengliang Wang <[email protected]> wrote:
>>>>>
>>>>>> Hi Viquar,
>>>>>>
>>>>>> Thanks for the detailed review — all three concerns are already accounted
>>>>>> for in the current SPIP design (Appendix B.2 and B.6).1. Capability
>>>>>> Pushdown: The Changelog interface already exposes declarative
>>>>>> capability methods — containsCarryoverRows(), containsIntermediate
>>>>>> Changes(), and representsUpdateAsDeleteAndInsert(). The
>>>>>> ResolveChangelogTable rule only injects post-processing when the
>>>>>> connector declares it is needed. If Delta Lake already materializes
>>>>>> pre/post-images natively, it returns representsUpdateAsDeleteAndInsert()
>>>>>> = false and Spark skips that work entirely. Catalyst never
>>>>>> reconstructs what the storage layer already provides.2. CoW I/O
>>>>>> Bottlenecks: Carry-over removal is already gated on 
>>>>>> containsCarryoverRows()
>>>>>> = true. If a connector eliminates carry-over rows at the scan level,
>>>>>> it returns false and Spark does nothing. The connector also retains
>>>>>> full control over scan planning via its ScanBuilder, so I/O
>>>>>> optimization stays in the storage layer.3. Audit Fidelity: The
>>>>>> deduplicationMode option already supports none, dropCarryovers, and
>>>>>> netChanges. Setting deduplicationMode = 'none' returns the raw,
>>>>>> unmodified change stream with every intermediate state preserved.
>>>>>> Net change collapsing happens when explicitly requested by the user.
>>>>>>
>>>>>> On Tue, Mar 3, 2026 at 10:27 PM Yuming Wang <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> +1, really looking forward to this feature.
>>>>>>>
>>>>>>> On Wed, Mar 4, 2026 at 1:57 PM vaquar khan <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi everyone,
>>>>>>>>
>>>>>>>> Sorry for the late response, I know the vote is actively underway,
>>>>>>>> but reviewing the SPIP's Catalyst post-processing mechanics raised a 
>>>>>>>> few
>>>>>>>> systemic design concerns we need to clarify to avoid severe performance
>>>>>>>> regressions down the line.
>>>>>>>>
>>>>>>>> 1. Capability Pushdown: The proposal has Catalyst deriving
>>>>>>>> pre/post-images from raw insert/delete pairs. Storage layers like Delta
>>>>>>>> Lake already materialize these natively. If the Changelog interface 
>>>>>>>> lacks
>>>>>>>> state pushdown, Catalyst will burn CPU and memory reconstructing what 
>>>>>>>> the
>>>>>>>> storage layer already solved.
>>>>>>>>
>>>>>>>> 2. CoW I/O Bottlenecks: Mandating Catalyst to filter "carry-over"
>>>>>>>> rows for CoW tables is highly problematic. Without strict 
>>>>>>>> connector-level
>>>>>>>> row lineage, we will be dragging massive, unmodified Parquet files 
>>>>>>>> across
>>>>>>>> the network, forcing Spark into heavy distributed joins just to discard
>>>>>>>> unchanged data.
>>>>>>>>
>>>>>>>> 3. Audit Fidelity: The design explicitly targets computing "net
>>>>>>>> changes." Collapsing intermediate states breaks enterprise audit and
>>>>>>>> compliance workflows that require full transactional history. The SQL
>>>>>>>> grammar needs an explicit ALL CHANGES execution path.
>>>>>>>>
>>>>>>>> I fully support unifying CDC  and this SIP is the right direction,
>>>>>>>> but abstracting it at the cost of storage-native optimizations and 
>>>>>>>> audit
>>>>>>>> fidelity is a dangerous trade-off. We need to clarify how physical 
>>>>>>>> planning
>>>>>>>> will handle these bottlenecks before formally ratifying the proposal.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Viquar Khan
>>>>>>>>
>>>>>>>> On Tue, 3 Mar 2026 at 20:09, Cheng Pan <[email protected]> wrote:
>>>>>>>>
>>>>>>>>> +1 (non-binding)
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Cheng Pan
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mar 4, 2026, at 09:59, John Zhuge <[email protected]> wrote:
>>>>>>>>>
>>>>>>>>> +1 (non-binding)
>>>>>>>>>
>>>>>>>>> Thanks for the contribution!
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Mar 3, 2026 at 5:50 PM Burak Yavuz <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> +1!
>>>>>>>>>>
>>>>>>>>>> On Tue, Mar 3, 2026 at 5:48 PM Szehon Ho <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> +1, look forward to it (non binding)
>>>>>>>>>>>
>>>>>>>>>>> Thanks
>>>>>>>>>>> Szehon
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Mar 3, 2026 at 5:37 PM Anton Okolnychyi <
>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> +1 (non-binding)
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Mar 3, 2026 at 5:07 PM Mich Talebzadeh <
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> +1
>>>>>>>>>>>>>
>>>>>>>>>>>>> Dr Mich Talebzadeh,
>>>>>>>>>>>>> Data Scientist | Distributed Systems (Spark) | Financial
>>>>>>>>>>>>> Forensics & Metadata Analytics | Transaction Reconstruction | 
>>>>>>>>>>>>> Audit &
>>>>>>>>>>>>> Evidence-Based Analytics
>>>>>>>>>>>>>
>>>>>>>>>>>>>    view my Linkedin profile
>>>>>>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, 4 Mar 2026 at 00:57, Gengliang Wang <[email protected]>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Spark devs,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I'd like to call a vote on the SPIP*: Change Data Capture
>>>>>>>>>>>>>> (CDC) Support*
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> *Summary:*
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This SPIP proposes a unified approach by adding a CHANGES SQL
>>>>>>>>>>>>>> clause and corresponding DataFrame/DataStream APIs that work 
>>>>>>>>>>>>>> across DSv2
>>>>>>>>>>>>>> connectors.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 1. Standardized User API
>>>>>>>>>>>>>> SQL:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -- Batch: What changed between version 10 and 20?
>>>>>>>>>>>>>> SELECT * FROM my_table CHANGES FROM VERSION 10 TO VERSION 20;
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -- Streaming: Continuously process changes
>>>>>>>>>>>>>> CREATE STREAMING TABLE cdc_sink AS
>>>>>>>>>>>>>> SELECT * FROM STREAM my_table CHANGES FROM VERSION 0;
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> DataFrame API:
>>>>>>>>>>>>>> spark.read
>>>>>>>>>>>>>>   .option("startingVersion", "10")
>>>>>>>>>>>>>>   .option("endingVersion", "20")
>>>>>>>>>>>>>>   .changes("my_table")
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2. Engine-Level Post Processing Under the hood, this
>>>>>>>>>>>>>> proposal introduces a minimal Changelog interface for DSv2
>>>>>>>>>>>>>> connectors. Spark's Catalyst optimizer will take over the CDC
>>>>>>>>>>>>>> post-processing, including:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>    -
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>    Filtering out copy-on-write carry-over rows.
>>>>>>>>>>>>>>    - Deriving pre-image/post-image updates from raw
>>>>>>>>>>>>>>    insert/delete pairs.
>>>>>>>>>>>>>>    -
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>    Computing net changes.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> *Relevant Links:*
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>    - *SPIP Doc: *
>>>>>>>>>>>>>>    
>>>>>>>>>>>>>> https://docs.google.com/document/d/1-4rCS3vsGIyhwnkAwPsEaqyUDg-AuVkdrYLotFPw0U0/edit?usp=sharing
>>>>>>>>>>>>>>    - *Discuss Thread: *
>>>>>>>>>>>>>>    
>>>>>>>>>>>>>> https://lists.apache.org/thread/dhxx6pohs7fvqc3knzhtoj4tbcgrwxts
>>>>>>>>>>>>>>    - *JIRA: *
>>>>>>>>>>>>>>    https://issues.apache.org/jira/browse/SPARK-55668
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> *The vote will be open for at least 72 hours. *Please vote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [ ] +1: Accept the proposal as an official SPIP
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [ ] +0
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [ ] -1: I don't think this is a good idea because ...
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Gengliang Wang
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> John Zhuge
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>

Re: [VOTE] SPIP: Change Data Capture (CDC) Support

Reply via email to