Hi Otto,

Flink CDC [1] now provides full-DB sync and schema evolution ability as a 
pipeline job. Iceberg sink support was suggested before, and we’re trying to 
implement this in the next few releases. Does it cover the use-cases you 
mentioned?

[1] https://nightlies.apache.org/flink/flink-cdc-docs-stable/
[2] https://issues.apache.org/jira/browse/FLINK-34840

Regards,
Xiqian


De : Andrew Otto <o...@wikimedia.org>
Date : vendredi, 24 mai 2024 à 23:06
À : Giannis Polyzos <ipolyzos...@gmail.com>
Cc : Carlos Sanabria Miranda <sanabria.miranda.car...@gmail.com>, Oscar Perez 
via user <user@flink.apache.org>, Péter Váry <peter.vary.apa...@gmail.com>, 
mbala...@apache.org <mbala...@apache.org>
Objet : Re: "Self-service ingestion pipelines with evolving schema via Flink 
and Iceberg" presentation recording from Flink Forward Seattle 2023
Indeed, using Flink-CDC to write to Flink Sink Tables, including Iceberg, is 
supported.

What is not is the automatic syncing of entire databases, with schema evolution 
and detection of new (and dropped?) tables.  :)




On Fri, May 24, 2024 at 8:58 AM Giannis Polyzos 
<ipolyzos...@gmail.com<mailto:ipolyzos...@gmail.com>> wrote:
https://nightlies.apache.org/flink/flink-cdc-docs-stable/
All these features come from Flink cdc itself. Because Paimon and Flink cdc are 
projects native to Flink there is a strong integration between them.
(I believe it’s on the roadmap to support iceberg as well)

On Fri, 24 May 2024 at 3:52 PM, Andrew Otto 
<o...@wikimedia.org<mailto:o...@wikimedia.org>> wrote:
> I’m curious if there is any reason for choosing Iceberg instead of Paimon

No technical reason that I'm aware of.  We are using it mostly because of 
momentum.  We looked at Flink Table Store (before it was Paimon), but decided 
it was too early and the docs were too sparse at the time to really consider it.

> Especially for a use case like CDC that iceberg struggles to support.

We aren't doing any CDC right now (for many reasons), but I have never seen a 
feature like Paimon's database sync before.  One job to sync and evolve an 
entire database?  That is amazing.

If we could do this with Iceberg, we might be able to make an argument to 
product managers to push for CDC.



On Fri, May 24, 2024 at 8:36 AM Giannis Polyzos 
<ipolyzos...@gmail.com<mailto:ipolyzos...@gmail.com>> wrote:
I’m curious if there is any reason for choosing Iceberg instead of Paimon 
(other than - iceberg is more popular).
Especially for a use case like CDC that iceberg struggles to support.

On Fri, 24 May 2024 at 3:22 PM, Andrew Otto 
<o...@wikimedia.org<mailto:o...@wikimedia.org>> wrote:
Interesting thank you!

I asked this in the Paimon users group:

How coupled to Paimon catalogs and tables is the cdc part of Paimon?  
RichCdcMultiplexRecord<https://github.com/apache/paimon/blob/cc7d308d166a945d8d498231ed8e2fc9c7a27fc5/paimon-flink/paimon-flink-cdc/src/main/java/org/apache/paimon/flink/sink/cdc/RichCdcMultiplexRecord.java>
 and related code seem incredibly useful even outside of the context of the 
Paimon table format.

I'm asking because the database sync 
action<https://paimon.apache.org/docs/master/flink/cdc-ingestion/mysql-cdc/#synchronizing-databases>
 feature is amazing.  At the Wikimedia Foundation, we are on an all-in journey 
with Iceberg.  I'm wondering how hard it would be to extract the CDC logic from 
Paimon and abstract the Sink bits.

Could the table/database sync with schema evolution (without Flink job 
restarts!) potentially work with the Iceberg sink?




On Thu, May 23, 2024 at 4:34 PM Péter Váry 
<peter.vary.apa...@gmail.com<mailto:peter.vary.apa...@gmail.com>> wrote:
If I understand correctly, Paimon is sending `CdcRecord`-s [1] on the wire 
which contain not only the data, but the schema as well.
With Iceberg we currently only send the row data, and expect to receive the 
schema on job start - this is more performant than sending the schema all the 
time, but has the obvious issue that it is not able to handle the schema 
changes. Another part of the dynamic schema synchronization is the update of 
the Iceberg table schema - the schema should be updated for all of the writers 
and the committer / but only a single schema change commit is needed (allowed) 
to the Iceberg table.

This is a very interesting, but non-trivial change.

[1] 
https://github.com/apache/paimon/blob/master/paimon-flink/paimon-flink-cdc/src/main/java/org/apache/paimon/flink/sink/cdc/CdcRecord.java

Andrew Otto <o...@wikimedia.org<mailto:o...@wikimedia.org>> ezt írta (időpont: 
2024. máj. 23., Cs, 21:59):
Ah I see, so just auto-restarting to pick up new stuff.

I'd love to understand how Paimon does this.  They have a database sync 
action<https://paimon.apache.org/docs/master/flink/cdc-ingestion/mysql-cdc/#synchronizing-databases>
 which will sync entire databases, handle schema evolution, and I'm pretty sure 
(I think I saw this in my local test) also pick up new tables.

https://github.com/apache/paimon/blob/master/paimon-flink/paimon-flink-cdc/src/main/java/org/apache/paimon/flink/action/cdc/SyncDatabaseActionBase.java#L45

I'm sure that Paimon table format is great, but at Wikimedia Foundation we are 
on the Iceberg train.  Imagine if there was a flink-cdc full database sync to 
Flink IcebergSink!




On Thu, May 23, 2024 at 3:47 PM Péter Váry 
<peter.vary.apa...@gmail.com<mailto:peter.vary.apa...@gmail.com>> wrote:
I will ask Marton about the slides.

The solution was something like this in a nutshell:
- Make sure that on job start the latest Iceberg schema is read from the 
Iceberg table
- Throw a SuppressRestartsException when data arrives with the wrong schema
- Use Flink Kubernetes Operator to restart your failed jobs by setting
kubernetes.operator.job.restart.failed

Thanks, Peter

On Thu, May 23, 2024, 20:29 Andrew Otto 
<o...@wikimedia.org<mailto:o...@wikimedia.org>> wrote:
Wow, I would LOVE to see this talk.  If there is no recording, perhaps there 
are slides somewhere?

On Thu, May 23, 2024 at 11:00 AM Carlos Sanabria Miranda 
<sanabria.miranda.car...@gmail.com<mailto:sanabria.miranda.car...@gmail.com>> 
wrote:
Hi everyone!

I have found in the Flink Forward website the following presentation: 
"Self-service ingestion pipelines with evolving schema via Flink and 
Iceberg<https://www.flink-forward.org/seattle-2023/agenda#self-service-ingestion-pipelines-with-evolving-schema-via-flink-and-iceberg>"
 by Márton Balassi from the 2023 conference in Seattle, but I cannot find the 
recording anywhere. I have found the recordings of the other presentations in 
the Ververica Academy website<https://www.ververica.academy/app>, but not this 
one.

Does anyone know where I can find it? Or at least the slides?

We are using Flink with the Iceberg sink connector to write streaming events to 
Iceberg tables, and we are researching how to handle schema evolution properly. 
I saw that presentation and I thought it could be of great help to us.

Thanks in advance!

Carlos

Reply via email to