agree with OpenInx that FLIP-27 Flink source is unlikely to make Nov release schedule. Then we should postpone it to 0.12.0
On Mon, Nov 2, 2020 at 5:23 PM OpenInx <open...@gmail.com> wrote: > Hi Ryan > > Got your plan ! If we plan to release 0.11.0 in November tentatively, then > for flink I think we could finish the rewrite actions and flink streaming > reader firstly. > > The flink cdc integration work and FLIP-27 would need more work, it's good > to not block the release 0.11.0. But we could still make good progress in a > separate PR. > > Thanks. > > > On Tue, Nov 3, 2020 at 7:01 AM Ryan Blue <rb...@netflix.com.invalid> > wrote: > >> Thanks for starting the 0.11.0 milestone! In the last sync, we talked >> about having a release in November to make the new catalogs and possibly >> S3FileIO available, so those should tentatively go on the 0.11.0 list as >> well. I say tentatively because I'm in favor of releasing when features are >> ready and trying not to block at this stage in the project. >> >> In addition, I think we can make some progress on Hive integration. There >> is a PR to create tables using Hive DDL without needing to pass a >> JSON-serialized schema that would be good to get in, and I think it would >> be good to get the basic write path committed as well. >> >> On Sun, Nov 1, 2020 at 5:57 PM OpenInx <open...@gmail.com> wrote: >> >>> Thanks for your context about FLIP-27, Steven ! >>> >>> I will take a look for the patches under issues 1626. >>> >>> On Sat, Oct 31, 2020 at 2:03 AM Steven Wu <stevenz...@gmail.com> wrote: >>> >>>> OpenInx, thanks a lot for kicking off the discussion. Looks like my >>>> previous reply didn't reach the mailing list. >>>> >>>> > flink source based on the new FLIP-27 interface >>>> >>>> Yes, we shall target 0.11.0 release for the FLIP-27 flink source. I >>>> have updated the issue [1] with the following scopes. >>>> >>>> - Support both static/batch and continuous/streaming enumeration >>>> modes >>>> - Support only the simple assigner with no ordering/locality >>>> guarantee when handing out split assignment. But make the interface >>>> flexible to plug in different assigners (like the event time alignment >>>> assigner or locality aware assigner) >>>> - It will be @Experimenta status as nobody has run FLIP-27 sources >>>> in production today. Flink 1.12.0 release (ETA end of Nov) will have the >>>> first set of sources (Kafka and file) implemented with FLIP-27 source >>>> framework. We still need to gain more production experiences. >>>> >>>> >>>> [1] https://github.com/apache/iceberg/issues/1626 >>>> >>>> On Wed, Oct 28, 2020 at 12:15 AM OpenInx <open...@gmail.com> wrote: >>>> >>>>> Hi dev >>>>> >>>>> As we know, we will be happy to cut the iceberg 0.10.0 candidate >>>>> release this week. I think it may be the time to plan for the future >>>>> iceberg 0.11.0 now, so I created a Java 0.11.0 Release milestone here [1] >>>>> >>>>> I put the following issues into the newly created milestone: >>>>> >>>>> 1. Apache Flink Rewrite Actions in Apache Iceberg. >>>>> >>>>> It's possible that we encounter too many small files issues when >>>>> running the iceberg flink sink in real production because of the frequent >>>>> checkpoint. we have two approaches to handle the small files: >>>>> >>>>> a. As the current spark rewrite actions designed, flink will provide >>>>> the similar rewrite actions which will be running in a batch job. It's >>>>> suitable to trigger the whole table or whole partitions compactions >>>>> periodically, because this kind of rewrites will compact many large files >>>>> and may consume lots of bandwidth. Currently, I and JunZheng are >>>>> working >>>>> on this issue, and we've extracted the base rewrite actions between spark >>>>> module and flink module. The next step would be implementing rewrite >>>>> actions in the flink module. >>>>> >>>>> b. Compact those small files in the flink streaming job when sinking >>>>> into iceberg tables. That means we will provide a new rewrite operator >>>>> chaining to the current IcebergFilesCommitter. Once an iceberg >>>>> transaction >>>>> has been committed, the newly introduced rewrite operator will check >>>>> whether it needs a small compaction. Those actions only choose a few tiny >>>>> size files (may be several KB, or MB, I think we could provide a >>>>> configurable threshold) to rewrite, which can be achieved with a minimum >>>>> cost and a higher efficiency of compaction. Currently, simonsssu from >>>>> Tencent has provided a WIP PR here [2] >>>>> >>>>> >>>>> 2. Allow to write CDC or UPSERT records by flink streaming jobs. >>>>> >>>>> We've almost implemented the row-level delete feature in the iceberg >>>>> master branch, but still lack the ability to integrate with compute >>>>> engines >>>>> (to be precise, we spark/flink could read the expected records if someone >>>>> has deleted the rows correctly but the write path is not available). I am >>>>> preparing the patch for sinking CDC into iceberg by flink streaming job >>>>> here [3], I think it will be ready in the next few weeks. >>>>> >>>>> 3. Apache flink streaming reader. >>>>> >>>>> We've prepared a POC version in our alibaba internal branch, but still >>>>> not contribute to apache iceberg now. I think it's worth accomplishing >>>>> that in the following days. >>>>> >>>>> >>>>> The above are the issues that I think it's worth to merge before >>>>> iceberg 0.11.0. But I' not quite sure what's the plan for the things: >>>>> >>>>> 1. I know @Anton Okolnychyi <aokolnyc...@apple.com> is working on >>>>> spark-sql extensions for iceberg, I guess there's a high probability to >>>>> get >>>>> that ? [4] >>>>> >>>>> 2. @Steven Wu <steve...@netflix.com> from netflix is working on >>>>> flink source based on the new FLIP-27 interface, thoughts ? [5] >>>>> >>>>> 3. How about the Spark Row-Delete integration work ? >>>>> >>>>> >>>>> >>>>> [1]. https://github.com/apache/iceberg/milestone/12 >>>>> [2]. https://github.com/apache/iceberg/pull/1669/files >>>>> [3]. https://github.com/apache/iceberg/pull/1663 >>>>> [4]. https://github.com/apache/iceberg/milestone/11 >>>>> [5]. https://github.com/apache/iceberg/issues/1626 >>>>> >>>> >> >> -- >> Ryan Blue >> Software Engineer >> Netflix >> >