Thanks for kicking off this discussion, +1 for this proposal. Currently, we use FlinkCDC as the ingestion tool to write to Iceberg, and need to restart the job if there is any schema evolution, Look forward to seeing this integrated with FlinkCDC
Best, Congxian Leonard Xu <xbjt...@gmail.com> 于2024年11月14日周四 09:37写道: > Thanks Peter and Max for kicking off this discussion, +1 for this proposal. > > Dynamic Sink is a key improvement for Iceberg Sink, and it can be well > integrated with Flink CDC’s pipeline sink in the future as well. > > Best, > Leonard > > > > 2024年11月13日 下午7:57,Maximilian Michels <m...@apache.org> 写道: > > > > Thanks for driving this Peter! The Dynamic Iceberg Sink is meant to solve > > several pain points of the current Flink Iceberg sink. Note that the > > Iceberg sink lives in Apache Iceberg, not Apache Flink, although that may > > change one day when the interfaces are stable enough. > > > > From the user perspective, there are three main advantages to using the > > Dynamic Iceberg sink: > > > > 1. It supports writing to any number of tables (No more 1:1 sink/topic > > relationship). > > 2. It can dynamically create and update tables based on the user-supplied > > routing. > > 3. It can dynamically update the schema and partitioning of tables based > on > > the user-supplied specification. > > > > I'd say these features will have a big impact on Flink Iceberg users. > > > > Cheers, > > Max > > > > On Wed, Nov 13, 2024 at 12:27 PM Péter Váry <peter.vary.apa...@gmail.com > > > > wrote: > > > >> Hi Team, With Max Michels, we started to work on enhancing the current > >> Iceberg Sink to allow inserting evolving records into a changing table. > >> > >> Created the design doc [1] > >> Created the Iceberg proposal [2] > >> Started the conversation on the Iceberg dev list [3] > >> > >> From the abstract: --------- Flink Iceberg connector sink is the tool to > >> write data to an Iceberg table from a continuous Flink stream. The > current > >> Sink implementations emphasize throughput over flexibility. The main > >> limiting factor is that the Iceberg Flink Sink requires static table > >> structure. The table, the schema, the partitioning specification need > to be > >> constant. If one of the previous things changes the Flink Job needs to > be > >> restarted. This allows using optimal record serialization and good > >> performance, but real life use-cases need to work around this limitation > >> when the underlying table has changed. We need to provide a tool to > >> accommodate these changes. [..] The following typical use cases are > >> considered during this design: - Incoming Avro records schema changes > (new > >> columns are added, or other backward compatible changes happen). The > Flink > >> job is expected to update the table schema dynamically, and continue to > >> ingest data with the new and the old schema without a job restart. - > >> Incoming records define the target Iceberg table dynamically. The Flink > job > >> is expected to create the new table(s) and continue writing to them > without > >> a job restart. - The partitioning schema of the table changes. The Flink > >> job is expected to update the specification and continue writing to the > >> target table without a job restart. --------- If you have any questions, > >> ideas, suggestions please let us know on any of the email threads, or in > >> comments on the document. Thanks, Peter > >> > >> [1] - > >> > >> > https://docs.google.com/document/d/1R3NZmi65S4lwnmNjH4gLCuXZbgvZV5GNrQKJ5NYdO9s > >> [2] - https://github.com/orgs/apache/projects/429 > >> [3] - https://lists.apache.org/thread/0sbosd8rzbl5rhqs9dtt99xr44kld3qp > >> > >