Re: Flink table maintenance

2024-04-18 Thread Gen Luo
Hi Peter, Thanks for the reply and explanation! > Unaligned checkpoint & AsyncIO The issue only occurs when the inflight queue is full. Synchronous operators can not do checkpointing until the processing request is done, while async operators are always ready to do checkpointing, unless the infli

Spark metadata deletion reliability issues of high memory usage cause s3 client failures, potentially due to cached high volume of manifests

2024-04-18 Thread Pucheng Yang
Hi community, We are seeing Spark Iceberg table metadata deletion consuming very high diver memory and seems causing s3 client failures, I would like to present my findings and seek comments from community: - My table is a v1 table with 3k manifests, each manifest is around 20-30mb so in to

Re: [Proposal] Add support for Materialized Views in Iceberg

2024-04-18 Thread Benny Chow
+1 for separate view and table objects. Walaa's Spark implementation demonstrates how little change it takes on the Iceberg APIs to start sharing MVs between engines. Thanks Benny On Thu, Apr 18, 2024 at 9:52 AM Walaa Eldin Moustafa wrote: > Hi everyone, > > I would like to make a proposal for

[VOTE] Release Apache Iceberg 1.5.1 RC0

2024-04-18 Thread Amogh Jahagirdar
Hi Everyone, I propose that we release the following RC as the official Apache Iceberg 1.5.1 release. The commit ID is cbb853073e681b4075d7c8707610dceecbee3a82 * This corresponds to the tag: apache-iceberg-1.5.1-rc0 * https://github.com/apache/iceberg/commits/apache-iceberg-1.5.1-rc0 * https://gi

Re: [VOTE] Release Apache PyIceberg 0.6.1rc3

2024-04-18 Thread Kevin Liu
+1 nonbinding - Checked the signatures, checksums, and licenses. - Ran tests (`make test`, `make test-integration`) I also found this page to be very helpful in learning how to verify a release https://py.iceberg.apache.org/verify-release/ Best, Kevin Liu On Thu, Apr 18, 2024 at 4:14 AM Fokko D

Re: Flink table maintenance

2024-04-18 Thread Péter Váry
Thanks Zhu and Gen for your reply! *Unaligned checkpoints* The maintenance tasks are big and complex. There are many parts which can be parallelized to multiple tasks, so I think we should implement them with chains of Flink Operators to exploit the advantages of Flink. A single maintenance task c

Re: [Proposal] Add support for Materialized Views in Iceberg

2024-04-18 Thread Frank
UNSUBSCRIBE Frank Gilroy 484-868-7097 On Thu, Apr 18, 2024 at 12:52 PM Walaa Eldin Moustafa wrote: > Hi everyone, > > I would like to make a proposal for issue [1] to support materialized > views in Iceberg. The support leverages two separate objects, an Iceberg > view and an Iceberg table to

[Proposal] Add support for Materialized Views in Iceberg

2024-04-18 Thread Walaa Eldin Moustafa
Hi everyone, I would like to make a proposal for issue [1] to support materialized views in Iceberg. The support leverages two separate objects, an Iceberg view and an Iceberg table to implement materialized views. Each object retains relevant metadata to support the MV operations. An initial desi

Re: Query regarding UPSERT Mode in Flink

2024-04-18 Thread Péter Váry
Hi Aditya, The definition of UPSERT is that we have 2 types of messages: - DELETE - we need to remove the old record with the given id. - UPSERT - we need to remove the old version of the record based on the id, and should add a new version See: https://nightlies.apache.org/flink/flink-docs-maste

Iceberg active column bug during schema evolution on AWS

2024-04-18 Thread Sanket Chaure
Hi Team, We have used Iceberg file format for our Data pipeline setup on AWS. We are using Glue for Data processing and output tables are then read in Quicksight for dashboard. Our pipeline was working fine and Quicksight dashboard were seeing table columns which were same as per the iceberg tab

Re: Flink table maintenance

2024-04-18 Thread Gen Luo
Hi, Thanks for the reply, Peter. And thanks Zhu for joining the discussion. > In the current design, using unaligned checkpoints, and splitting heavy tasks to smaller ones could avoid maintenance tasks blocking the checkpointing. > How do you suggest executing asynchronous tasks As far as I know

Re: Flink table maintenance

2024-04-18 Thread Zhu Zhu
Hi Peter, Thanks for starting this discussion. I'm a bit uncertain about the necessity of using extended sink topology to run maintenance tasks. Creating a separate pipeline for maintenance monitor, scheduler and tasks sounds a better choice to me in most cases. Advantages I can think of include

Re: [VOTE] Release Apache PyIceberg 0.6.1rc3

2024-04-18 Thread Fokko Driesprong
Thanks Honah for the quick follow-up with RC3. +1 binding - Ran the signatures, checksums, and licenses. - Double-checked that it installs from a clean Python 3.10 doc

Re: Query regarding UPSERT Mode in Flink

2024-04-18 Thread Gabor Kaszab
Hey, I had the chance to explore this area of eq-deletes recently myself too. Apparently, this behavior is by design in Flink. The reason why it unconditionally writes an eq-delete too for each insert (only in upsert-mode, though) is to guarantee the uniqueness of the primary key. So it drops the p

[VOTE] Release Apache PyIceberg 0.6.1rc3

2024-04-18 Thread Honah J.
Hi Everyone, I propose that we release the following RC as the official PyIceberg 0.6.1 release. This is a patch release due to the following bugs: - Fail to create version 1 table with non-empty partition-spec and sort-order - Hive Ca