Besides I'd like to share some work in my flink team, hope it will be helpful for you.
We have customers who want to try the flink+iceberg to build their business data lake, the classic scenarios are: 1. streaming click events into iceberg and analyze by other olap engines ; 2. streaming CDC events to iceberg to reflect the freshness data. Now we've built a public repo[1] to start our development and PoC, we are developing in a separate repo because we want to do some technical verification as soon as possible, not trying to split the apache iceberg community. On the contrary, we hope that some of our exploration work will have the opportunity to be merged into the Apache Iceberg repository :-) We've finished a fully functional flink sink connector[2] so far: 1. Support exactly-once streaming semantic. we've designed the state of sink connector carefully so that it could failover correctly. Compared to the Netflix FLINK connector, we're allowed to run multiple sink operator instead of one operator which could improve the throughput a lot. 2. Support almost all data types in iceberg. we provided the conversion between iceberg and flink table so that different engine(flink/spark/hive/presto) could share the the same table (the netflix version is binded to a AVRO format, seems not a good choice.) 3. Support both partition table and unpartitioned table now, similar to the iceberg spark writer. 4. Provided complete unit test and end-to-end test to verify the feature. 5. Support FLINK table API so that we could write the iceberg table by FLINK SQL, such as INSERT INTO test SELECT xxxx. The next step we will provide the following features: 1. Provide a FLINK streaming reader to consume the incremental events from upstream, so that we would accomplish the data pipeline: (flink)->(iceberg)->(flink)->(iceberg) ... 2. Implement the upsert feature in primary key cases, say each row in iceberg table will have a pk and the CDC will upsert the table with pk. The current design seems could meet the customer's requirement so we plan to do the PoC. [1]. https://github.com/generic-datalake/iceberg [2]. https://github.com/generic-datalake/iceberg/tree/master/flink/src On Wed, May 6, 2020 at 11:44 AM OpenInx <open...@gmail.com> wrote: > The two-phrase approach sounds good to me. the precondition is we have > limited number of delete files so that memory can hold all of them, we will > have the compaction service to reduce the delete files so it seems not a > problem. >