Hi Ryan

Got your plan ! If we plan to release 0.11.0 in November tentatively, then
for flink I think we could finish the rewrite actions and flink streaming
reader firstly.

The flink cdc integration work and FLIP-27 would need more work, it's good
to not block the release 0.11.0. But we could still make good progress in a
separate PR.

Thanks.


On Tue, Nov 3, 2020 at 7:01 AM Ryan Blue <rb...@netflix.com.invalid> wrote:

> Thanks for starting the 0.11.0 milestone! In the last sync, we talked
> about having a release in November to make the new catalogs and possibly
> S3FileIO available, so those should tentatively go on the 0.11.0 list as
> well. I say tentatively because I'm in favor of releasing when features are
> ready and trying not to block at this stage in the project.
>
> In addition, I think we can make some progress on Hive integration. There
> is a PR to create tables using Hive DDL without needing to pass a
> JSON-serialized schema that would be good to get in, and I think it would
> be good to get the basic write path committed as well.
>
> On Sun, Nov 1, 2020 at 5:57 PM OpenInx <open...@gmail.com> wrote:
>
>> Thanks for your context about FLIP-27, Steven !
>>
>> I will take a look for the patches under issues 1626.
>>
>> On Sat, Oct 31, 2020 at 2:03 AM Steven Wu <stevenz...@gmail.com> wrote:
>>
>>> OpenInx, thanks a lot for kicking off the discussion. Looks like my
>>> previous reply didn't reach the mailing list.
>>>
>>> > flink source based on the new FLIP-27 interface
>>>
>>> Yes, we shall target 0.11.0 release for the FLIP-27 flink source. I have
>>> updated the issue [1] with the following scopes.
>>>
>>>    - Support both static/batch and continuous/streaming enumeration
>>>    modes
>>>    - Support only the simple assigner with no ordering/locality
>>>    guarantee when handing out split assignment. But make the interface
>>>    flexible to plug in different assigners (like the event time alignment
>>>    assigner or locality aware assigner)
>>>    - It will be @Experimenta status as nobody has run FLIP-27 sources
>>>    in production today. Flink 1.12.0 release (ETA end of Nov) will have the
>>>    first set of sources (Kafka and file) implemented with FLIP-27 source
>>>    framework. We still need to gain more production experiences.
>>>
>>>
>>> [1] https://github.com/apache/iceberg/issues/1626
>>>
>>> On Wed, Oct 28, 2020 at 12:15 AM OpenInx <open...@gmail.com> wrote:
>>>
>>>> Hi  dev
>>>>
>>>> As we know, we will be happy to cut the iceberg 0.10.0 candidate
>>>> release this week.  I think it may be the time to plan for the future
>>>> iceberg 0.11.0 now, so I created a Java 0.11.0 Release milestone here [1]
>>>>
>>>> I put the following issues into the newly created milestone:
>>>>
>>>> 1.   Apache Flink Rewrite Actions in Apache Iceberg.
>>>>
>>>> It's possible that we encounter too many small files issues when
>>>> running the iceberg flink sink in real production because of the frequent
>>>> checkpoint.  we have two approaches to handle the small files:
>>>>
>>>> a.  As the current spark rewrite actions designed,  flink will provide
>>>> the similar rewrite actions which will be running in a batch job.  It's
>>>> suitable to trigger the whole table or whole partitions compactions
>>>> periodically, because this kind of rewrites will compact many large files
>>>> and may consume lots of bandwidth.  Currently,   I and JunZheng are working
>>>> on this issue, and we've extracted the base rewrite actions between spark
>>>> module and flink module.  The next step would be implementing rewrite
>>>> actions in the flink module.
>>>>
>>>> b. Compact those small files in the flink streaming job when sinking
>>>> into iceberg tables. That means we will provide a new rewrite operator
>>>> chaining to the current IcebergFilesCommitter.  Once an iceberg transaction
>>>> has been committed, the newly introduced rewrite operator will check
>>>> whether it needs a small compaction. Those actions only choose a few tiny
>>>> size files (may be several KB, or MB, I think we could provide a
>>>> configurable threshold) to rewrite, which can be achieved with a minimum
>>>> cost and a higher efficiency of compaction.   Currently,  simonsssu from
>>>> Tencent has provided a WIP PR here [2]
>>>>
>>>>
>>>> 2. Allow to write CDC or UPSERT records by flink streaming jobs.
>>>>
>>>> We've almost implemented the row-level delete feature in the iceberg
>>>> master branch, but still lack the ability to integrate with compute engines
>>>> (to be precise,  we spark/flink could read the expected records if someone
>>>> has deleted the rows correctly but the write path is not available).  I am
>>>> preparing the patch for sinking CDC into iceberg by flink streaming job
>>>> here [3], I think it will be ready in the next few weeks.
>>>>
>>>> 3.  Apache flink streaming reader.
>>>>
>>>> We've prepared a POC version in our alibaba internal branch, but still
>>>> not contribute to apache iceberg now.  I think it's worth accomplishing
>>>> that in the following days.
>>>>
>>>>
>>>> The above are the issues that I think it's worth to merge before
>>>> iceberg 0.11.0.  But  I' not quite sure what's the plan for the things:
>>>>
>>>> 1.  I know @Anton Okolnychyi <aokolnyc...@apple.com> is working on
>>>> spark-sql extensions for iceberg, I guess there's a high probability to get
>>>> that ?  [4]
>>>>
>>>> 2.  @Steven Wu <steve...@netflix.com> from netflix is working on flink
>>>> source based on the new FLIP-27 interface,  thoughts ? [5]
>>>>
>>>> 3.  How about the Spark Row-Delete integration work ?
>>>>
>>>>
>>>>
>>>> [1].  https://github.com/apache/iceberg/milestone/12
>>>> [2]. https://github.com/apache/iceberg/pull/1669/files
>>>> [3]. https://github.com/apache/iceberg/pull/1663
>>>> [4]. https://github.com/apache/iceberg/milestone/11
>>>> [5]. https://github.com/apache/iceberg/issues/1626
>>>>
>>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Reply via email to