Interesting thank you! I asked this in the Paimon users group:
How coupled to Paimon catalogs and tables is the cdc part of Paimon? RichCdcMultiplexRecord <https://github.com/apache/paimon/blob/cc7d308d166a945d8d498231ed8e2fc9c7a27fc5/paimon-flink/paimon-flink-cdc/src/main/java/org/apache/paimon/flink/sink/cdc/RichCdcMultiplexRecord.java> and related code seem incredibly useful even outside of the context of the Paimon table format. I'm asking because the database sync action <https://paimon.apache.org/docs/master/flink/cdc-ingestion/mysql-cdc/#synchronizing-databases> feature is amazing. At the Wikimedia Foundation, we are on an all-in journey with Iceberg. I'm wondering how hard it would be to extract the CDC logic from Paimon and abstract the Sink bits. Could the table/database sync with schema evolution (without Flink job restarts!) potentially work with the Iceberg sink? On Thu, May 23, 2024 at 4:34 PM Péter Váry <peter.vary.apa...@gmail.com> wrote: > If I understand correctly, Paimon is sending `CdcRecord`-s [1] on the wire > which contain not only the data, but the schema as well. > With Iceberg we currently only send the row data, and expect to receive > the schema on job start - this is more performant than sending the schema > all the time, but has the obvious issue that it is not able to handle the > schema changes. Another part of the dynamic schema synchronization is the > update of the Iceberg table schema - the schema should be updated for all > of the writers and the committer / but only a single schema change commit > is needed (allowed) to the Iceberg table. > > This is a very interesting, but non-trivial change. > > [1] > https://github.com/apache/paimon/blob/master/paimon-flink/paimon-flink-cdc/src/main/java/org/apache/paimon/flink/sink/cdc/CdcRecord.java > > Andrew Otto <o...@wikimedia.org> ezt írta (időpont: 2024. máj. 23., Cs, > 21:59): > >> Ah I see, so just auto-restarting to pick up new stuff. >> >> I'd love to understand how Paimon does this. They have a database sync >> action >> <https://paimon.apache.org/docs/master/flink/cdc-ingestion/mysql-cdc/#synchronizing-databases> >> which will sync entire databases, handle schema evolution, and I'm pretty >> sure (I think I saw this in my local test) also pick up new tables. >> >> >> https://github.com/apache/paimon/blob/master/paimon-flink/paimon-flink-cdc/src/main/java/org/apache/paimon/flink/action/cdc/SyncDatabaseActionBase.java#L45 >> >> I'm sure that Paimon table format is great, but at Wikimedia Foundation >> we are on the Iceberg train. Imagine if there was a flink-cdc full >> database sync to Flink IcebergSink! >> >> >> >> >> On Thu, May 23, 2024 at 3:47 PM Péter Váry <peter.vary.apa...@gmail.com> >> wrote: >> >>> I will ask Marton about the slides. >>> >>> The solution was something like this in a nutshell: >>> - Make sure that on job start the latest Iceberg schema is read from the >>> Iceberg table >>> - Throw a SuppressRestartsException when data arrives with the wrong >>> schema >>> - Use Flink Kubernetes Operator to restart your failed jobs by setting >>> kubernetes.operator.job.restart.failed >>> >>> Thanks, Peter >>> >>> On Thu, May 23, 2024, 20:29 Andrew Otto <o...@wikimedia.org> wrote: >>> >>>> Wow, I would LOVE to see this talk. If there is no recording, perhaps >>>> there are slides somewhere? >>>> >>>> On Thu, May 23, 2024 at 11:00 AM Carlos Sanabria Miranda < >>>> sanabria.miranda.car...@gmail.com> wrote: >>>> >>>>> Hi everyone! >>>>> >>>>> I have found in the Flink Forward website the following presentation: >>>>> "Self-service >>>>> ingestion pipelines with evolving schema via Flink and Iceberg >>>>> <https://www.flink-forward.org/seattle-2023/agenda#self-service-ingestion-pipelines-with-evolving-schema-via-flink-and-iceberg>" >>>>> by Márton Balassi from the 2023 conference in Seattle, but I cannot find >>>>> the recording anywhere. I have found the recordings of the other >>>>> presentations in the Ververica Academy website >>>>> <https://www.ververica.academy/app>, but not this one. >>>>> >>>>> Does anyone know where I can find it? Or at least the slides? >>>>> >>>>> We are using Flink with the Iceberg sink connector to write streaming >>>>> events to Iceberg tables, and we are researching how to handle schema >>>>> evolution properly. I saw that presentation and I thought it could be of >>>>> great help to us. >>>>> >>>>> Thanks in advance! >>>>> >>>>> Carlos >>>>> >>>>