Re: Guidance on implementing Hybrid CDC Pattern from "CDC patterns in Apache Iceberg" talk

Samarth Jain Fri, 22 Sep 2023 14:48:07 -0700

> What is the recommendation for storing the latest snapshot ID that is
successfully merged into *table*? Ideally this is committed in the same
transaction as the MERGE so that reprocessing is minimized. Does Iceberg
support storing this as table metadata? I do not see any related
information in the Iceberg Table Spec.


Tagging seems like a good option for this.
https://iceberg.apache.org/docs/latest/branching/. So essentially, run the
MERGE and then tag it with a known name that users can then use for reading
data as of a snapshot that has CDC data merged as opposed to always reading
the latest snapshot.

MERGE isn't supported in the dataframe API, yet. See
https://github.com/apache/iceberg/issues/3665



On Fri, Sep 22, 2023 at 2:00 PM Nick Del Nano <nickdeln...@gmail.com> wrote:

> Hi,
>
> I am exploring implementing the Hybrid CDC Pattern explained at 29:26
> <https://youtu.be/GM7EvRc7_is?si=mIQ5g2k1uEIMX5DT&t=1766> in Ryan Blue's
> talk CDC patterns in Apache Iceberg
> <https://trino.io/blog/2023/06/30/trino-fest-2023-apacheiceberg.html>.
>
> The use case is:
>
>    1. Stream CDC logs to an append only Iceberg table named
>    *table_changelog* using Flink
>    2. Periodically MERGE the CDC logs from *table_changelog* to *table*
>       1. The rate of merge depends on the table's requirements. For some
>       it may be frequently (hourly), for some it may be infrequent (daily).
>
> I am considering how to implement (2) using Iceberg's incremental read
> <https://iceberg.apache.org/docs/latest/spark-queries/#incremental-read> and
> would appreciate guidance on the following topics:
>
>    1. What is the recommendation for storing the latest snapshot ID that
>    is successfully merged into *table*? Ideally this is committed in the
>    same transaction as the MERGE so that reprocessing is minimized. Does
>    Iceberg support storing this as table metadata? I do not see any related
>    information in the Iceberg Table Spec.
>    2. Use the dataframe API or Spark SQL for the incremental read and
>    MERGE? From the docs, the incremental read examples are using dataframes,
>    and the MERGE uses Spark SQL
>    <https://iceberg.apache.org/docs/latest/spark-writes/#merge-into>.
>    Does either API support both use cases?
>
> Thanks,
> Nick
>

Re: Guidance on implementing Hybrid CDC Pattern from "CDC patterns in Apache Iceberg" talk

Reply via email to