Re: Plans for the future iceberg 0.11.0 release

Ryan Blue Mon, 02 Nov 2020 15:01:50 -0800

Thanks for starting the 0.11.0 milestone! In the last sync, we talked about
having a release in November to make the new catalogs and possibly S3FileIO
available, so those should tentatively go on the 0.11.0 list as well. I say
tentatively because I'm in favor of releasing when features are ready and
trying not to block at this stage in the project.


In addition, I think we can make some progress on Hive integration. There
is a PR to create tables using Hive DDL without needing to pass a
JSON-serialized schema that would be good to get in, and I think it would
be good to get the basic write path committed as well.

On Sun, Nov 1, 2020 at 5:57 PM OpenInx <[email protected]> wrote:

> Thanks for your context about FLIP-27, Steven !
>
> I will take a look for the patches under issues 1626.
>
> On Sat, Oct 31, 2020 at 2:03 AM Steven Wu <[email protected]> wrote:
>
>> OpenInx, thanks a lot for kicking off the discussion. Looks like my
>> previous reply didn't reach the mailing list.
>>
>> > flink source based on the new FLIP-27 interface
>>
>> Yes, we shall target 0.11.0 release for the FLIP-27 flink source. I have
>> updated the issue [1] with the following scopes.
>>
>>    - Support both static/batch and continuous/streaming enumeration modes
>>    - Support only the simple assigner with no ordering/locality
>>    guarantee when handing out split assignment. But make the interface
>>    flexible to plug in different assigners (like the event time alignment
>>    assigner or locality aware assigner)
>>    - It will be @Experimenta status as nobody has run FLIP-27 sources in
>>    production today. Flink 1.12.0 release (ETA end of Nov) will have the 
>> first
>>    set of sources (Kafka and file) implemented with FLIP-27 source framework.
>>    We still need to gain more production experiences.
>>
>>
>> [1] https://github.com/apache/iceberg/issues/1626
>>
>> On Wed, Oct 28, 2020 at 12:15 AM OpenInx <[email protected]> wrote:
>>
>>> Hi  dev
>>>
>>> As we know, we will be happy to cut the iceberg 0.10.0 candidate release
>>> this week.  I think it may be the time to plan for the future iceberg
>>> 0.11.0 now, so I created a Java 0.11.0 Release milestone here [1]
>>>
>>> I put the following issues into the newly created milestone:
>>>
>>> 1.   Apache Flink Rewrite Actions in Apache Iceberg.
>>>
>>> It's possible that we encounter too many small files issues when running
>>> the iceberg flink sink in real production because of the frequent
>>> checkpoint.  we have two approaches to handle the small files:
>>>
>>> a.  As the current spark rewrite actions designed,  flink will provide
>>> the similar rewrite actions which will be running in a batch job.  It's
>>> suitable to trigger the whole table or whole partitions compactions
>>> periodically, because this kind of rewrites will compact many large files
>>> and may consume lots of bandwidth.  Currently,   I and JunZheng are working
>>> on this issue, and we've extracted the base rewrite actions between spark
>>> module and flink module.  The next step would be implementing rewrite
>>> actions in the flink module.
>>>
>>> b. Compact those small files in the flink streaming job when sinking
>>> into iceberg tables. That means we will provide a new rewrite operator
>>> chaining to the current IcebergFilesCommitter.  Once an iceberg transaction
>>> has been committed, the newly introduced rewrite operator will check
>>> whether it needs a small compaction. Those actions only choose a few tiny
>>> size files (may be several KB, or MB, I think we could provide a
>>> configurable threshold) to rewrite, which can be achieved with a minimum
>>> cost and a higher efficiency of compaction.   Currently,  simonsssu from
>>> Tencent has provided a WIP PR here [2]
>>>
>>>
>>> 2. Allow to write CDC or UPSERT records by flink streaming jobs.
>>>
>>> We've almost implemented the row-level delete feature in the iceberg
>>> master branch, but still lack the ability to integrate with compute engines
>>> (to be precise,  we spark/flink could read the expected records if someone
>>> has deleted the rows correctly but the write path is not available).  I am
>>> preparing the patch for sinking CDC into iceberg by flink streaming job
>>> here [3], I think it will be ready in the next few weeks.
>>>
>>> 3.  Apache flink streaming reader.
>>>
>>> We've prepared a POC version in our alibaba internal branch, but still
>>> not contribute to apache iceberg now.  I think it's worth accomplishing
>>> that in the following days.
>>>
>>>
>>> The above are the issues that I think it's worth to merge before iceberg
>>> 0.11.0.  But  I' not quite sure what's the plan for the things:
>>>
>>> 1.  I know @Anton Okolnychyi <[email protected]> is working on
>>> spark-sql extensions for iceberg, I guess there's a high probability to get
>>> that ?  [4]
>>>
>>> 2.  @Steven Wu <[email protected]> from netflix is working on flink
>>> source based on the new FLIP-27 interface,  thoughts ? [5]
>>>
>>> 3.  How about the Spark Row-Delete integration work ?
>>>
>>>
>>>
>>> [1].  https://github.com/apache/iceberg/milestone/12
>>> [2]. https://github.com/apache/iceberg/pull/1669/files
>>> [3]. https://github.com/apache/iceberg/pull/1663
>>> [4]. https://github.com/apache/iceberg/milestone/11
>>> [5]. https://github.com/apache/iceberg/issues/1626
>>>
>>

-- 
Ryan Blue
Software Engineer
Netflix

Re: Plans for the future iceberg 0.11.0 release

Reply via email to