Thanks for your context about FLIP-27, Steven ! I will take a look for the patches under issues 1626.
On Sat, Oct 31, 2020 at 2:03 AM Steven Wu <stevenz...@gmail.com> wrote: > OpenInx, thanks a lot for kicking off the discussion. Looks like my > previous reply didn't reach the mailing list. > > > flink source based on the new FLIP-27 interface > > Yes, we shall target 0.11.0 release for the FLIP-27 flink source. I have > updated the issue [1] with the following scopes. > > - Support both static/batch and continuous/streaming enumeration modes > - Support only the simple assigner with no ordering/locality guarantee > when handing out split assignment. But make the interface flexible to plug > in different assigners (like the event time alignment assigner or locality > aware assigner) > - It will be @Experimenta status as nobody has run FLIP-27 sources in > production today. Flink 1.12.0 release (ETA end of Nov) will have the first > set of sources (Kafka and file) implemented with FLIP-27 source framework. > We still need to gain more production experiences. > > > [1] https://github.com/apache/iceberg/issues/1626 > > On Wed, Oct 28, 2020 at 12:15 AM OpenInx <open...@gmail.com> wrote: > >> Hi dev >> >> As we know, we will be happy to cut the iceberg 0.10.0 candidate release >> this week. I think it may be the time to plan for the future iceberg >> 0.11.0 now, so I created a Java 0.11.0 Release milestone here [1] >> >> I put the following issues into the newly created milestone: >> >> 1. Apache Flink Rewrite Actions in Apache Iceberg. >> >> It's possible that we encounter too many small files issues when running >> the iceberg flink sink in real production because of the frequent >> checkpoint. we have two approaches to handle the small files: >> >> a. As the current spark rewrite actions designed, flink will provide >> the similar rewrite actions which will be running in a batch job. It's >> suitable to trigger the whole table or whole partitions compactions >> periodically, because this kind of rewrites will compact many large files >> and may consume lots of bandwidth. Currently, I and JunZheng are working >> on this issue, and we've extracted the base rewrite actions between spark >> module and flink module. The next step would be implementing rewrite >> actions in the flink module. >> >> b. Compact those small files in the flink streaming job when sinking into >> iceberg tables. That means we will provide a new rewrite operator chaining >> to the current IcebergFilesCommitter. Once an iceberg transaction has been >> committed, the newly introduced rewrite operator will check whether it >> needs a small compaction. Those actions only choose a few tiny size files >> (may be several KB, or MB, I think we could provide a configurable >> threshold) to rewrite, which can be achieved with a minimum cost and a >> higher efficiency of compaction. Currently, simonsssu from Tencent has >> provided a WIP PR here [2] >> >> >> 2. Allow to write CDC or UPSERT records by flink streaming jobs. >> >> We've almost implemented the row-level delete feature in the iceberg >> master branch, but still lack the ability to integrate with compute engines >> (to be precise, we spark/flink could read the expected records if someone >> has deleted the rows correctly but the write path is not available). I am >> preparing the patch for sinking CDC into iceberg by flink streaming job >> here [3], I think it will be ready in the next few weeks. >> >> 3. Apache flink streaming reader. >> >> We've prepared a POC version in our alibaba internal branch, but still >> not contribute to apache iceberg now. I think it's worth accomplishing >> that in the following days. >> >> >> The above are the issues that I think it's worth to merge before iceberg >> 0.11.0. But I' not quite sure what's the plan for the things: >> >> 1. I know @Anton Okolnychyi <aokolnyc...@apple.com> is working on >> spark-sql extensions for iceberg, I guess there's a high probability to get >> that ? [4] >> >> 2. @Steven Wu <steve...@netflix.com> from netflix is working on flink >> source based on the new FLIP-27 interface, thoughts ? [5] >> >> 3. How about the Spark Row-Delete integration work ? >> >> >> >> [1]. https://github.com/apache/iceberg/milestone/12 >> [2]. https://github.com/apache/iceberg/pull/1669/files >> [3]. https://github.com/apache/iceberg/pull/1663 >> [4]. https://github.com/apache/iceberg/milestone/11 >> [5]. https://github.com/apache/iceberg/issues/1626 >> >