Hi Spark devs,

I'd like to call a vote on the SPIP*: Change Data Capture (CDC) Support*

*Summary:*

This SPIP proposes a unified approach by adding a CHANGES SQL clause and
corresponding DataFrame/DataStream APIs that work across DSv2 connectors.

1. Standardized User API

SQL:

-- Batch: What changed between version 10 and 20?

SELECT * FROM my_table CHANGES FROM VERSION 10 TO VERSION 20;

-- Streaming: Continuously process changes

CREATE STREAMING TABLE cdc_sink AS

SELECT * FROM STREAM my_table CHANGES FROM VERSION 0;

DataFrame API:

spark.read

  .option("startingVersion", "10")

  .option("endingVersion", "20")

  .changes("my_table")

2. Engine-Level Post Processing Under the hood, this proposal introduces a
minimal Changelog interface for DSv2 connectors. Spark's Catalyst optimizer
will take over the CDC post-processing, including:

   -

   Filtering out copy-on-write carry-over rows.
   -

   Deriving pre-image/post-image updates from raw insert/delete pairs.
   -

   Computing net changes.


*Relevant Links:*

   - *SPIP Doc: *
   
https://docs.google.com/document/d/1-4rCS3vsGIyhwnkAwPsEaqyUDg-AuVkdrYLotFPw0U0/edit?usp=sharing
   - *Discuss Thread: *
   https://lists.apache.org/thread/dhxx6pohs7fvqc3knzhtoj4tbcgrwxts
   - *JIRA: *https://issues.apache.org/jira/browse/SPARK-55668


*The vote will be open for at least 72 hours. *Please vote:

[ ] +1: Accept the proposal as an official SPIP

[ ] +0

[ ] -1: I don't think this is a good idea because ...

Thanks,
Gengliang Wang

Reply via email to