Hi openinx, With https://github.com/apache/iceberg/pull/2303 and a potential sequence number based fix for https://github.com/apache/iceberg/issues/2308, I don't see a harder blocker to test out row-level deletions. Please correct if anything else in https://github.com/apache/iceberg/milestone/4 is a must have.
Is it possible to separate flink+iceberg CDC changes and low-level deletions in future releases so that the community can have V2 earlier? Thanks, Huadong On 2021/03/24 02:34:23, OpenInx <open...@gmail.com> wrote: > Hi Himanshu > > Thanks for the email, currently we flink+iceberg support writing CDC > events into apache iceberg table by flink datastream API, besides the > spark/presto/hive could read those events in batch job. > > But there are still some issues that we do not finish yet: > > 1. Expose the iceberg v2 to end users. The row-level delete feature is > actually built on the iceberg format v2, there are still some blockers > that we need to fix (pls see the document > https://docs.google.com/document/d/1FyLJyvzcZbfbjwDMEZd6Dj-LYCfrzK1zC-Bkb3OiICc/edit), > we iceberg team will need some resources to resolve them. > 2. As we know the CDC events depend on iceberg primary key identification > (Then we could define mysql_cdc sql table by using primary key cause) I saw > Jack Ye has published a PR to this > https://github.com/apache/iceberg/pull/2354, I will review it today. > 3. The CDC writers will produce many small files inevitably as the > periodic checkpoints go on, so for the real production env we must provide > the ability to rewrite small files into larger files ( compaction action) > . There are few PRs needing to be reviewing: > a. https://github.com/apache/iceberg/pull/2303/files > b. https://github.com/apache/iceberg/pull/2294 > c. https://github.com/apache/iceberg/pull/2216 > > I think it's better to resolve all those issues before we put the > production data into iceberg ( syncing mysql binlog via debezium). I saw > the last sync notes saying the next release 0.12.0 would be released in > end of this month ideally ( > https://lists.apache.org/x/thread.html/rdb7d1ab221295adec33cf93dcbcac2b9b7b80708b2efd903b7105511@%3Cdev.iceberg.apache.org%3E) > , I think that that deadline is too tight. In my mind, if the release > 0.12.0 won't expose the format v2 to end users, then what are the core > features that we want to release ? If the features that we plan to release > are not major ones, then how about releasing the 0.11.2 ? > > According to my understanding of the needs of community users, the vast > majority of iceberg users have high expectations for format v2. I think we > may need to raise the v2 exposure to a higher priority so that our users > can do the whole PoC tests earlier. > > > > On Wed, Mar 24, 2021 at 3:49 AM Himanshu Rathore > <himanshu.rath...@zomato.com.invalid> wrote: > > > We are planning for use Flink + Iceberg for syncing mysql binlog's via > > debezium and its seams of things are dependent on next release. > > >