Re: SPIP: Auto CDC support for Apache Spark

陈小健 Sat, 28 Mar 2026 06:35:03 -0700

unsubscribe

获取Outlook for Android<https://aka.ms/AAb9ysg>
________________________________
From: Andreas Neumann <[email protected]>
Sent: Saturday, March 28, 2026 2:43:54 AM
To: [email protected] <[email protected]>
Subject: Re: SPIP: Auto CDC support for Apache Spark


Hi Vaibhav,

The goal of this proposal is not to replace MERGE but to provide a simple 
abstraction for the common use case of CDC.
MERGE itself is a very powerful operator and there will always be use cases 
outside of CDC that will require MERGE.

And thanks for spotting the typo in the SPIP. It is fixed now!

Cheers -Andreas


On Fri, Mar 27, 2026 at 10:53 AM Vaibhav Kumar 
<[email protected]<mailto:[email protected]>> wrote:
Hi Andrew,

Thanks for sharing the SPIP, Does that mean the MERGE statement would be 
deprecated? Also I think there was a small typo I have suggested in the doc.

Regards,
Vaibhav

On Fri, Mar 27, 2026 at 10:15 AM DB Tsai 
<[email protected]<mailto:[email protected]>> wrote:
+1

DB Tsai  |  https://www.dbtsai.com/  |  PGP 42E5B25A8F7A82C1

On Mar 26, 2026, at 6:08 PM, Andreas Neumann 
<[email protected]<mailto:[email protected]>> wrote:


Hi all,

I’d like to start a discussion on a new SPIP to introduce Auto CDC support to 
Apache Spark.

  *
SPIP Document: 
https://docs.google.com/document/d/1Hp5BGEYJRHbk6J7XUph3bAPZKRQXKOuV1PEaqZMMRoQ/
  *   JIRA: <https://issues.apache.org/jira/browse/SPARK-55668> 
https://issues.apache.org/jira/browse/SPARK-5566

Motivation

With the upcoming introduction of standardized CDC 
support<https://issues.apache.org/jira/browse/SPARK-55668>, Spark will soon 
have a unified way to produce change data feeds. However, consuming these feeds 
and applying them to a target table remains a significant challenge.

Common patterns like SCD Type 1 (maintaining a 1:1 replica) and SCD Type 2 
(tracking full change history) often require hand-crafted, complex MERGE logic. 
In distributed systems, these implementations are frequently error-prone when 
handling deletions or out-of-order data.

Proposal

This SPIP proposes a new "Auto CDC" flow type for Spark. It encapsulates the 
complex logic for SCD types and out-of-order data, allowing data engineers to 
configure a declarative flow instead of writing manual MERGE statements. This 
feature will be available in both Python and SQL.

Example SQL:
-- Produce a change feed
CREATE STREAMING TABLE cdc.users AS
SELECT * FROM STREAM my_table CHANGES FROM VERSION 10;

-- Consume the change feed
CREATE FLOW flow
AS AUTO CDC INTO
  target
FROM stream(cdc_data.users)
  KEYS (userId)
  APPLY AS DELETE WHEN operation = "DELETE"
  SEQUENCE BY sequenceNum
  COLUMNS * EXCEPT (operation, sequenceNum)
  STORED AS SCD TYPE 2
  TRACK HISTORY ON * EXCEPT (city);

Please review the full SPIP for the technical details. Looking forward to your 
feedback and discussion!

Best regards,

Andreas

Re: SPIP: Auto CDC support for Apache Spark

Reply via email to