Hi dev As we know, we will be happy to cut the iceberg 0.10.0 candidate release this week. I think it may be the time to plan for the future iceberg 0.11.0 now, so I created a Java 0.11.0 Release milestone here [1]
I put the following issues into the newly created milestone: 1. Apache Flink Rewrite Actions in Apache Iceberg. It's possible that we encounter too many small files issues when running the iceberg flink sink in real production because of the frequent checkpoint. we have two approaches to handle the small files: a. As the current spark rewrite actions designed, flink will provide the similar rewrite actions which will be running in a batch job. It's suitable to trigger the whole table or whole partitions compactions periodically, because this kind of rewrites will compact many large files and may consume lots of bandwidth. Currently, I and JunZheng are working on this issue, and we've extracted the base rewrite actions between spark module and flink module. The next step would be implementing rewrite actions in the flink module. b. Compact those small files in the flink streaming job when sinking into iceberg tables. That means we will provide a new rewrite operator chaining to the current IcebergFilesCommitter. Once an iceberg transaction has been committed, the newly introduced rewrite operator will check whether it needs a small compaction. Those actions only choose a few tiny size files (may be several KB, or MB, I think we could provide a configurable threshold) to rewrite, which can be achieved with a minimum cost and a higher efficiency of compaction. Currently, simonsssu from Tencent has provided a WIP PR here [2] 2. Allow to write CDC or UPSERT records by flink streaming jobs. We've almost implemented the row-level delete feature in the iceberg master branch, but still lack the ability to integrate with compute engines (to be precise, we spark/flink could read the expected records if someone has deleted the rows correctly but the write path is not available). I am preparing the patch for sinking CDC into iceberg by flink streaming job here [3], I think it will be ready in the next few weeks. 3. Apache flink streaming reader. We've prepared a POC version in our alibaba internal branch, but still not contribute to apache iceberg now. I think it's worth accomplishing that in the following days. The above are the issues that I think it's worth to merge before iceberg 0.11.0. But I' not quite sure what's the plan for the things: 1. I know @Anton Okolnychyi <aokolnyc...@apple.com> is working on spark-sql extensions for iceberg, I guess there's a high probability to get that ? [4] 2. @Steven Wu <steve...@netflix.com> from netflix is working on flink source based on the new FLIP-27 interface, thoughts ? [5] 3. How about the Spark Row-Delete integration work ? [1]. https://github.com/apache/iceberg/milestone/12 [2]. https://github.com/apache/iceberg/pull/1669/files [3]. https://github.com/apache/iceberg/pull/1663 [4]. https://github.com/apache/iceberg/milestone/11 [5]. https://github.com/apache/iceberg/issues/1626