Re: Plans for the future iceberg 0.11.0 release

Steven Wu Tue, 03 Nov 2020 11:04:24 -0800

agree with OpenInx that FLIP-27 Flink source is unlikely to make Nov
release schedule. Then we should postpone it to 0.12.0


On Mon, Nov 2, 2020 at 5:23 PM OpenInx <open...@gmail.com> wrote:

> Hi Ryan
>
> Got your plan ! If we plan to release 0.11.0 in November tentatively, then
> for flink I think we could finish the rewrite actions and flink streaming
> reader firstly.
>
> The flink cdc integration work and FLIP-27 would need more work, it's good
> to not block the release 0.11.0. But we could still make good progress in a
> separate PR.
>
> Thanks.
>
>
> On Tue, Nov 3, 2020 at 7:01 AM Ryan Blue <rb...@netflix.com.invalid>
> wrote:
>
>> Thanks for starting the 0.11.0 milestone! In the last sync, we talked
>> about having a release in November to make the new catalogs and possibly
>> S3FileIO available, so those should tentatively go on the 0.11.0 list as
>> well. I say tentatively because I'm in favor of releasing when features are
>> ready and trying not to block at this stage in the project.
>>
>> In addition, I think we can make some progress on Hive integration. There
>> is a PR to create tables using Hive DDL without needing to pass a
>> JSON-serialized schema that would be good to get in, and I think it would
>> be good to get the basic write path committed as well.
>>
>> On Sun, Nov 1, 2020 at 5:57 PM OpenInx <open...@gmail.com> wrote:
>>
>>> Thanks for your context about FLIP-27, Steven !
>>>
>>> I will take a look for the patches under issues 1626.
>>>
>>> On Sat, Oct 31, 2020 at 2:03 AM Steven Wu <stevenz...@gmail.com> wrote:
>>>
>>>> OpenInx, thanks a lot for kicking off the discussion. Looks like my
>>>> previous reply didn't reach the mailing list.
>>>>
>>>> > flink source based on the new FLIP-27 interface
>>>>
>>>> Yes, we shall target 0.11.0 release for the FLIP-27 flink source. I
>>>> have updated the issue [1] with the following scopes.
>>>>
>>>>    - Support both static/batch and continuous/streaming enumeration
>>>>    modes
>>>>    - Support only the simple assigner with no ordering/locality
>>>>    guarantee when handing out split assignment. But make the interface
>>>>    flexible to plug in different assigners (like the event time alignment
>>>>    assigner or locality aware assigner)
>>>>    - It will be @Experimenta status as nobody has run FLIP-27 sources
>>>>    in production today. Flink 1.12.0 release (ETA end of Nov) will have the
>>>>    first set of sources (Kafka and file) implemented with FLIP-27 source
>>>>    framework. We still need to gain more production experiences.
>>>>
>>>>
>>>> [1] https://github.com/apache/iceberg/issues/1626
>>>>
>>>> On Wed, Oct 28, 2020 at 12:15 AM OpenInx <open...@gmail.com> wrote:
>>>>
>>>>> Hi  dev
>>>>>
>>>>> As we know, we will be happy to cut the iceberg 0.10.0 candidate
>>>>> release this week.  I think it may be the time to plan for the future
>>>>> iceberg 0.11.0 now, so I created a Java 0.11.0 Release milestone here [1]
>>>>>
>>>>> I put the following issues into the newly created milestone:
>>>>>
>>>>> 1.   Apache Flink Rewrite Actions in Apache Iceberg.
>>>>>
>>>>> It's possible that we encounter too many small files issues when
>>>>> running the iceberg flink sink in real production because of the frequent
>>>>> checkpoint.  we have two approaches to handle the small files:
>>>>>
>>>>> a.  As the current spark rewrite actions designed,  flink will provide
>>>>> the similar rewrite actions which will be running in a batch job.  It's
>>>>> suitable to trigger the whole table or whole partitions compactions
>>>>> periodically, because this kind of rewrites will compact many large files
>>>>> and may consume lots of bandwidth.  Currently,   I and JunZheng are 
>>>>> working
>>>>> on this issue, and we've extracted the base rewrite actions between spark
>>>>> module and flink module.  The next step would be implementing rewrite
>>>>> actions in the flink module.
>>>>>
>>>>> b. Compact those small files in the flink streaming job when sinking
>>>>> into iceberg tables. That means we will provide a new rewrite operator
>>>>> chaining to the current IcebergFilesCommitter.  Once an iceberg 
>>>>> transaction
>>>>> has been committed, the newly introduced rewrite operator will check
>>>>> whether it needs a small compaction. Those actions only choose a few tiny
>>>>> size files (may be several KB, or MB, I think we could provide a
>>>>> configurable threshold) to rewrite, which can be achieved with a minimum
>>>>> cost and a higher efficiency of compaction.   Currently,  simonsssu from
>>>>> Tencent has provided a WIP PR here [2]
>>>>>
>>>>>
>>>>> 2. Allow to write CDC or UPSERT records by flink streaming jobs.
>>>>>
>>>>> We've almost implemented the row-level delete feature in the iceberg
>>>>> master branch, but still lack the ability to integrate with compute 
>>>>> engines
>>>>> (to be precise,  we spark/flink could read the expected records if someone
>>>>> has deleted the rows correctly but the write path is not available).  I am
>>>>> preparing the patch for sinking CDC into iceberg by flink streaming job
>>>>> here [3], I think it will be ready in the next few weeks.
>>>>>
>>>>> 3.  Apache flink streaming reader.
>>>>>
>>>>> We've prepared a POC version in our alibaba internal branch, but still
>>>>> not contribute to apache iceberg now.  I think it's worth accomplishing
>>>>> that in the following days.
>>>>>
>>>>>
>>>>> The above are the issues that I think it's worth to merge before
>>>>> iceberg 0.11.0.  But  I' not quite sure what's the plan for the things:
>>>>>
>>>>> 1.  I know @Anton Okolnychyi <aokolnyc...@apple.com> is working on
>>>>> spark-sql extensions for iceberg, I guess there's a high probability to 
>>>>> get
>>>>> that ?  [4]
>>>>>
>>>>> 2.  @Steven Wu <steve...@netflix.com> from netflix is working on
>>>>> flink source based on the new FLIP-27 interface,  thoughts ? [5]
>>>>>
>>>>> 3.  How about the Spark Row-Delete integration work ?
>>>>>
>>>>>
>>>>>
>>>>> [1].  https://github.com/apache/iceberg/milestone/12
>>>>> [2]. https://github.com/apache/iceberg/pull/1669/files
>>>>> [3]. https://github.com/apache/iceberg/pull/1663
>>>>> [4]. https://github.com/apache/iceberg/milestone/11
>>>>> [5]. https://github.com/apache/iceberg/issues/1626
>>>>>
>>>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>

Re: Plans for the future iceberg 0.11.0 release

Reply via email to