Besides I'd like to share some work in my flink team, hope it will be
helpful for you.

We have customers who want to try the flink+iceberg to build their business
data lake, the classic scenarios are: 1. streaming click events into
iceberg and analyze by other olap engines ; 2.  streaming CDC events to
iceberg to reflect the freshness data.
Now we've built a public repo[1] to start our development and PoC, we are
developing in a separate repo because we want to do some technical
verification as soon as possible, not trying to split the apache iceberg
community. On the contrary,
we hope that some of our exploration work will have the opportunity to be
merged into the Apache Iceberg repository :-)

We've finished a fully functional flink sink connector[2] so far:
1.  Support exactly-once streaming semantic.  we've designed the state of
sink connector carefully so that it could failover correctly. Compared to
the Netflix FLINK connector, we're allowed to run multiple sink operator
instead of one operator which could improve the throughput a lot.
2.  Support almost all data types in iceberg. we provided the conversion
between iceberg and flink table so that different
engine(flink/spark/hive/presto) could share the the same table (the netflix
version is binded to a AVRO format, seems not a good choice.)
3.  Support both partition table and unpartitioned table now, similar to
the iceberg spark writer.
4.  Provided complete unit test and end-to-end test to verify the feature.
5.  Support  FLINK table API so that we could write the iceberg table by
FLINK SQL, such as INSERT INTO test SELECT xxxx.

The next step we will provide the following features:
1.  Provide a FLINK streaming reader to consume the incremental events from
upstream, so that we would accomplish the data pipeline:
(flink)->(iceberg)->(flink)->(iceberg) ...
2.  Implement the upsert feature in primary key cases,  say each row in
iceberg table will have a pk and the CDC will upsert the table with pk. The
current design seems could meet the customer's requirement so we plan to do
the PoC.

[1]. https://github.com/generic-datalake/iceberg
[2]. https://github.com/generic-datalake/iceberg/tree/master/flink/src

On Wed, May 6, 2020 at 11:44 AM OpenInx <open...@gmail.com> wrote:

> The two-phrase approach  sounds good to me. the precondition is we have
> limited number of delete files so that memory can hold all of them, we will
> have the compaction service to reduce the delete files so it seems not a
> problem.
>

Reply via email to