Re: Scan column metrics

2023-11-10 Thread Manish Malhotra
Thanks Peter! On Wed, Oct 11, 2023 at 5:36 AM Péter Váry wrote: > Based on our discussion here, I have created a PR for the feature: > https://github.com/apache/iceberg/pull/8803 > > I think this is not a big change, and the flexibility/reduced memory > consumption would be worth the additional

Re: Scan column metrics

2023-10-11 Thread Péter Váry
Based on our discussion here, I have created a PR for the feature: https://github.com/apache/iceberg/pull/8803 I think this is not a big change, and the flexibility/reduced memory consumption would be worth the additional complexity. Please review the PR to see for yourselves :) Thanks, Peter M

Re: Scan column metrics

2023-10-10 Thread Manish Malhotra
Thanks Ryan, Good point, that makes sense. Though +1 for the feature. We can avoid it during ingestion as well, though we might need the stats some time later, so having options during reading will help. Thanks, Manish On Mon, Oct 9, 2023 at 4:38 PM Ryan Blue wrote: > For that use case, it s

Re: Scan column metrics

2023-10-09 Thread Ryan Blue
For that use case, it sounds like you'd be much better off not storing all the stats rather that skipping them at read time. I understand the user wants to keep them, but it may still not be a great choice. I'm just worried that this is going to be a lot of effort for you that doesn't really genera

Re: Scan column metrics

2023-10-09 Thread Péter Váry
The owner of the table wanted to keep the column stats for all of the columns, claiming that other users might/are using the statistics of the columns. Even if I am not sure that their case was defendable, I think the reader of the table is often not in the position to optimize the table for their

Re: Scan column metrics

2023-10-07 Thread Steven Wu
I am sure dropping column stats can be helpful. Just that it has some challenges in practice. It requires table owners to know the query pattern and decide what column stats to keep and what to drop. While automation can help ease the decisions based on the query history, it can't predict future us

Re: Scan column metrics

2023-10-06 Thread Ryan Blue
I understand wanting to keep more in general, that's why we have the 100 column threshold set fairly high. But in the case you're describing those column stats are causing a problem. I'd expect you to be able to drop some of them on such a large table to solve the problem, rather than filter them o

Re: Scan column metrics

2023-10-05 Thread Steven Wu
It is definitely good to only track column stats that are used. Otherwise, we are just creating wasteful metadata that can increase manifest file size and slow down scan planning. If a table has 100 columns, it is very unlikely we need stats for all columns. But in practice, it is a bit difficult

Re: Scan column metrics

2023-10-05 Thread Ryan Blue
I can think of situations in which you may want something like this, but I'm curious what other options you've used to solve the problem. This seems like exactly what `write.metadata.metrics.*` was intended to solve and I'm a bit surprised that you need metrics for so many columns in the table. The

Re: Scan column metrics

2023-10-05 Thread Steven Wu
+1 for this feature of column stats projection. I will add some additional inputs. 1) In the previous discussion, there are comments on only enabling column stats that are needed. That is definitely a recommended best practice. But there are some practical challenges. By default, Iceberg enables

Scan column metrics

2023-10-05 Thread Péter Váry
Hi Team, TL;DR: I would like to introduce the possibility to parametrize the Iceberg table scans to include the metadata metrics only for specific columns. We discussed this previously on the mailing list [1], but we did not finalize the direction there. *To recap* Currently there are two ways t