Thanks Peter!
On Wed, Oct 11, 2023 at 5:36 AM Péter Váry
wrote:
> Based on our discussion here, I have created a PR for the feature:
> https://github.com/apache/iceberg/pull/8803
>
> I think this is not a big change, and the flexibility/reduced memory
> consumption would be worth the additional
Based on our discussion here, I have created a PR for the feature:
https://github.com/apache/iceberg/pull/8803
I think this is not a big change, and the flexibility/reduced memory
consumption would be worth the additional complexity.
Please review the PR to see for yourselves :)
Thanks,
Peter
M
Thanks Ryan,
Good point, that makes sense.
Though +1 for the feature.
We can avoid it during ingestion as well, though we might need the stats
some time later, so having options during reading will help.
Thanks,
Manish
On Mon, Oct 9, 2023 at 4:38 PM Ryan Blue wrote:
> For that use case, it s
For that use case, it sounds like you'd be much better off not storing all
the stats rather that skipping them at read time. I understand the user
wants to keep them, but it may still not be a great choice. I'm just
worried that this is going to be a lot of effort for you that doesn't
really genera
The owner of the table wanted to keep the column stats for all of the
columns, claiming that other users might/are using the statistics of the
columns. Even if I am not sure that their case was defendable, I think the
reader of the table is often not in the position to optimize the table for
their
I am sure dropping column stats can be helpful. Just that it has some
challenges in practice. It requires table owners to know the query pattern
and decide what column stats to keep and what to drop. While automation can
help ease the decisions based on the query history, it can't predict future
us
I understand wanting to keep more in general, that's why we have the 100
column threshold set fairly high. But in the case you're describing those
column stats are causing a problem. I'd expect you to be able to drop some
of them on such a large table to solve the problem, rather than filter them
o
It is definitely good to only track column stats that are used. Otherwise,
we are just creating wasteful metadata that can increase manifest file size
and slow down scan planning. If a table has 100 columns, it is very
unlikely we need stats for all columns.
But in practice, it is a bit difficult
I can think of situations in which you may want something like this, but
I'm curious what other options you've used to solve the problem. This seems
like exactly what `write.metadata.metrics.*` was intended to solve and I'm
a bit surprised that you need metrics for so many columns in the table. The
+1 for this feature of column stats projection.
I will add some additional inputs.
1) In the previous discussion, there are comments on only enabling column
stats that are needed. That is definitely a recommended best practice. But
there are some practical challenges. By default, Iceberg enables
Hi Team,
TL;DR: I would like to introduce the possibility to parametrize the Iceberg
table scans to include the metadata metrics only for specific columns.
We discussed this previously on the mailing list [1], but we did not
finalize the direction there.
*To recap*
Currently there are two ways t
11 matches
Mail list logo