Guidance on implementing Hybrid CDC Pattern from "CDC patterns in Apache Iceberg" talk

Nick Del Nano Fri, 22 Sep 2023 14:00:58 -0700

Hi,

I am exploring implementing the Hybrid CDC Pattern explained at 29:26
<https://youtu.be/GM7EvRc7_is?si=mIQ5g2k1uEIMX5DT&t=1766> in Ryan Blue's
talk CDC patterns in Apache Iceberg
<https://trino.io/blog/2023/06/30/trino-fest-2023-apacheiceberg.html>.


The use case is:

   1. Stream CDC logs to an append only Iceberg table named
   *table_changelog* using Flink
   2. Periodically MERGE the CDC logs from *table_changelog* to *table*
      1. The rate of merge depends on the table's requirements. For some it
      may be frequently (hourly), for some it may be infrequent (daily).

I am considering how to implement (2) using Iceberg's incremental read
<https://iceberg.apache.org/docs/latest/spark-queries/#incremental-read> and
would appreciate guidance on the following topics:

   1. What is the recommendation for storing the latest snapshot ID that is
   successfully merged into *table*? Ideally this is committed in the same
   transaction as the MERGE so that reprocessing is minimized. Does Iceberg
   support storing this as table metadata? I do not see any related
   information in the Iceberg Table Spec.
   2. Use the dataframe API or Spark SQL for the incremental read and
   MERGE? From the docs, the incremental read examples are using dataframes,
   and the MERGE uses Spark SQL
   <https://iceberg.apache.org/docs/latest/spark-writes/#merge-into>. Does
   either API support both use cases?

Thanks,
Nick

Guidance on implementing Hybrid CDC Pattern from "CDC patterns in Apache Iceberg" talk

Reply via email to