Hi Community,
I did a proposal write-up for the partition stats in Iceberg.
Please have a look and let me know what you think. I would like to work on
it.

https://docs.google.com/document/d/1vaufuD47kMijz97LxM67X8OX-W2Wq7nmlz3jRo8J5Qk/edit?usp=sharing

Requirement background snippet from the above document.

> For some query engines that use cost-based-optimizer instead or along with
> rule-based-optimizer (like Dremio, Trino, etc), at the planning time,
> it is good to know the partition level stats like total rows per partition
> and total files per partition to take decisions for CBO (
> like deciding on the join reordering and join type, identifying the
> parallelism).
> Currently, the only way to do this is to read the partition info from 
> data_file
> in manifest_entry of the manifest file and compute partition-level
> statistics (the same thing that ‘partitions’ metadata table is doing [see 
> Appendix
> A
> <https://docs.google.com/document/d/1vaufuD47kMijz97LxM67X8OX-W2Wq7nmlz3jRo8J5Qk/edit#heading=h.s8iywtu7x8m6>
> ]).
> Doing this on each query is expensive. Hence, this is a proposal for
> computing and storing partition-level stats for Iceberg tables and using
> them during queries.



Thanks,
Ajantha

Reply via email to