Hi, all. Flink CDC provides the schema evolution ability to sync the entire database. I think it could satisfy your needs. Flink CDC pipeline sources and sinks are listed in [1]. Iceberg pipeline connector is not provided by now.
> What is not is the automatic syncing of entire databases, with schema evolution and detection of new (and dropped?) tables. :) Flink CDC is able to sync the entire database with schema evolutions. If a new table is added to this database, the running pipeline job cannot sync it. But we could enable 'scan.newly-added-table.enabled' and restart this job with a savepoint to catch the new tables. This feature for MySQL pipeline connector is not released now. But the PR[2] has been provided. Best, Hang [1] https://nightlies.apache.org/flink/flink-cdc-docs-release-3.1/docs/connectors/pipeline-connectors/overview/ [2] https://github.com/apache/flink-cdc/pull/3347 Xiqian YU <kono....@outlook.com> 于2024年5月27日周一 10:04写道: > Hi Otto, > > > > Flink CDC [1] now provides full-DB sync and schema evolution ability as a > pipeline job. Iceberg sink support was suggested before, and we’re trying > to implement this in the next few releases. Does it cover the use-cases you > mentioned? > > > > [1] https://nightlies.apache.org/flink/flink-cdc-docs-stable/ > > [2] https://issues.apache.org/jira/browse/FLINK-34840 > > > > Regards, > > Xiqian > > > > > > *De : *Andrew Otto <o...@wikimedia.org> > *Date : *vendredi, 24 mai 2024 à 23:06 > *À : *Giannis Polyzos <ipolyzos...@gmail.com> > *Cc : *Carlos Sanabria Miranda <sanabria.miranda.car...@gmail.com>, Oscar > Perez via user <user@flink.apache.org>, Péter Váry < > peter.vary.apa...@gmail.com>, mbala...@apache.org <mbala...@apache.org> > *Objet : *Re: "Self-service ingestion pipelines with evolving schema via > Flink and Iceberg" presentation recording from Flink Forward Seattle 2023 > > Indeed, using Flink-CDC to write to Flink Sink Tables, including Iceberg, > is supported. > > > > What is not is the automatic syncing of entire databases, with schema > evolution and detection of new (and dropped?) tables. :) > > > > > > > > > > On Fri, May 24, 2024 at 8:58 AM Giannis Polyzos <ipolyzos...@gmail.com> > wrote: > > https://nightlies.apache.org/flink/flink-cdc-docs-stable/ > > All these features come from Flink cdc itself. Because Paimon and Flink > cdc are projects native to Flink there is a strong integration between them. > > (I believe it’s on the roadmap to support iceberg as well) > > > > On Fri, 24 May 2024 at 3:52 PM, Andrew Otto <o...@wikimedia.org> wrote: > > > I’m curious if there is any reason for choosing Iceberg instead of Paimon > > > No technical reason that I'm aware of. We are using it mostly because of > momentum. We looked at Flink Table Store (before it was Paimon), but > decided it was too early and the docs were too sparse at the time to really > consider it. > > > > > Especially for a use case like CDC that iceberg struggles to support. > > > > We aren't doing any CDC right now (for many reasons), but I have never > seen a feature like Paimon's database sync before. One job to sync and > evolve an entire database? That is amazing. > > > > If we could do this with Iceberg, we might be able to make an argument to > product managers to push for CDC. > > > > > > > > On Fri, May 24, 2024 at 8:36 AM Giannis Polyzos <ipolyzos...@gmail.com> > wrote: > > I’m curious if there is any reason for choosing Iceberg instead of Paimon > (other than - iceberg is more popular). > > Especially for a use case like CDC that iceberg struggles to support. > > > > On Fri, 24 May 2024 at 3:22 PM, Andrew Otto <o...@wikimedia.org> wrote: > > Interesting thank you! > > > > I asked this in the Paimon users group: > > > > How coupled to Paimon catalogs and tables is the cdc part of Paimon? > RichCdcMultiplexRecord > <https://github.com/apache/paimon/blob/cc7d308d166a945d8d498231ed8e2fc9c7a27fc5/paimon-flink/paimon-flink-cdc/src/main/java/org/apache/paimon/flink/sink/cdc/RichCdcMultiplexRecord.java> > and > related code seem incredibly useful even outside of the context of the > Paimon table format. > > > > I'm asking because the database sync action > <https://paimon.apache.org/docs/master/flink/cdc-ingestion/mysql-cdc/#synchronizing-databases> > feature > is amazing. At the Wikimedia Foundation, we are on an all-in journey with > Iceberg. I'm wondering how hard it would be to extract the CDC logic from > Paimon and abstract the Sink bits. > > > > Could the table/database sync with schema evolution (without Flink job > restarts!) potentially work with the Iceberg sink? > > > > > > > > > > On Thu, May 23, 2024 at 4:34 PM Péter Váry <peter.vary.apa...@gmail.com> > wrote: > > If I understand correctly, Paimon is sending `CdcRecord`-s [1] on the wire > which contain not only the data, but the schema as well. > > With Iceberg we currently only send the row data, and expect to receive > the schema on job start - this is more performant than sending the schema > all the time, but has the obvious issue that it is not able to handle the > schema changes. Another part of the dynamic schema synchronization is the > update of the Iceberg table schema - the schema should be updated for all > of the writers and the committer / but only a single schema change commit > is needed (allowed) to the Iceberg table. > > > > This is a very interesting, but non-trivial change. > > > > [1] > https://github.com/apache/paimon/blob/master/paimon-flink/paimon-flink-cdc/src/main/java/org/apache/paimon/flink/sink/cdc/CdcRecord.java > > > > Andrew Otto <o...@wikimedia.org> ezt írta (időpont: 2024. máj. 23., Cs, > 21:59): > > Ah I see, so just auto-restarting to pick up new stuff. > > > > I'd love to understand how Paimon does this. They have a database sync > action > <https://paimon.apache.org/docs/master/flink/cdc-ingestion/mysql-cdc/#synchronizing-databases> > which will sync entire databases, handle schema evolution, and I'm pretty > sure (I think I saw this in my local test) also pick up new tables. > > > > > https://github.com/apache/paimon/blob/master/paimon-flink/paimon-flink-cdc/src/main/java/org/apache/paimon/flink/action/cdc/SyncDatabaseActionBase.java#L45 > > > > I'm sure that Paimon table format is great, but at Wikimedia Foundation we > are on the Iceberg train. Imagine if there was a flink-cdc full database > sync to Flink IcebergSink! > > > > > > > > > > On Thu, May 23, 2024 at 3:47 PM Péter Váry <peter.vary.apa...@gmail.com> > wrote: > > I will ask Marton about the slides. > > > > The solution was something like this in a nutshell: > > - Make sure that on job start the latest Iceberg schema is read from the > Iceberg table > > - Throw a SuppressRestartsException when data arrives with the wrong schema > > - Use Flink Kubernetes Operator to restart your failed jobs by setting > > kubernetes.operator.job.restart.failed > > > > Thanks, Peter > > > > On Thu, May 23, 2024, 20:29 Andrew Otto <o...@wikimedia.org> wrote: > > Wow, I would LOVE to see this talk. If there is no recording, perhaps > there are slides somewhere? > > > > On Thu, May 23, 2024 at 11:00 AM Carlos Sanabria Miranda < > sanabria.miranda.car...@gmail.com> wrote: > > Hi everyone! > > > > I have found in the Flink Forward website the following presentation: > "Self-service > ingestion pipelines with evolving schema via Flink and Iceberg > <https://www.flink-forward.org/seattle-2023/agenda#self-service-ingestion-pipelines-with-evolving-schema-via-flink-and-iceberg>" > by Márton Balassi from the 2023 conference in Seattle, but I cannot find > the recording anywhere. I have found the recordings of the other > presentations in the Ververica Academy website > <https://www.ververica.academy/app>, but not this one. > > > > Does anyone know where I can find it? Or at least the slides? > > > > We are using Flink with the Iceberg sink connector to write streaming > events to Iceberg tables, and we are researching how to handle schema > evolution properly. I saw that presentation and I thought it could be of > great help to us. > > > > Thanks in advance! > > > > Carlos > >