Re: Proposal: Introduce deletion vector file to reduce write amplification

2023-10-11 Thread Renjie Liu
Hi, Russell: > The main things I’m still interested are alternative approaches. I think > that some of the work that Anton is working on have shown some different > bottlenecks in applying delete files that I’m not sure are addressed by > this proposal. I'm also interested. Could you share some

Re: Scan column metrics

2023-10-11 Thread Péter Váry
Based on our discussion here, I have created a PR for the feature: https://github.com/apache/iceberg/pull/8803 I think this is not a big change, and the flexibility/reduced memory consumption would be worth the additional complexity. Please review the PR to see for yourselves :) Thanks, Peter M

Re: Proposal: Introduce deletion vector file to reduce write amplification

2023-10-11 Thread Anton Okolnychyi
I tried to summarize notes from our previous discussions here: https://docs.google.com/document/d/1M4L6o-qnGRwGhbhkW8BnravoTwvCrJV8VvzVQDRJO5I/ I am going to iterate on the doc later today. On 2023/10/11 07:06:07 Renjie Liu wrote: > Hi, Russell: > > > > The main things I’m still interested are

Re: [Proposal] Partition stats in Iceberg

2023-10-11 Thread Ajantha Bhat
Hi All, As per the above proposal, I have worked on a POC ( https://github.com/apache/iceberg/pull/8488). *But to move things forward, first we need to merge the spec PR (https://github.com/apache/iceberg/pull/7105 ). *I don't see any blocker for the s

Re: [Proposal] Partition stats in Iceberg

2023-10-11 Thread Anton Okolnychyi
I think the question is what we mean by doing this synchronously. For instance, I have doubts it would be a good idea to do this in each commit attempt unless we can prove the overhead is negligible with a benchmark. We risk failing a job for the sake of updating partition stats. I can see the n

Re: Proposal: Introduce deletion vector file to reduce write amplification

2023-10-11 Thread Renjie Liu
Hi, Anton: I've gone through the doc, and we are trying to solve the same problems of position deletes, but with different approaches. It's quite interesting. On Thu, Oct 12, 2023 at 12:11 AM Anton Okolnychyi wrote: > I tried to summarize notes from our previous discussions here: > > https://doc

Re: [Proposal] Partition stats in Iceberg

2023-10-11 Thread Ajantha Bhat
> > I think the question is what we mean by doing this synchronously. For > instance, I have doubts it would be a good idea to do this in each commit > attempt unless we can prove the overhead is negligible with a benchmark. Yeah, I can share the benchmarks. As I also mentioned, synchronous writi