Hi Community, I did a proposal write-up for the partition stats in Iceberg. Please have a look and let me know what you think. I would like to work on it.
https://docs.google.com/document/d/1vaufuD47kMijz97LxM67X8OX-W2Wq7nmlz3jRo8J5Qk/edit?usp=sharing Requirement background snippet from the above document. > For some query engines that use cost-based-optimizer instead or along with > rule-based-optimizer (like Dremio, Trino, etc), at the planning time, > it is good to know the partition level stats like total rows per partition > and total files per partition to take decisions for CBO ( > like deciding on the join reordering and join type, identifying the > parallelism). > Currently, the only way to do this is to read the partition info from > data_file > in manifest_entry of the manifest file and compute partition-level > statistics (the same thing that ‘partitions’ metadata table is doing [see > Appendix > A > <https://docs.google.com/document/d/1vaufuD47kMijz97LxM67X8OX-W2Wq7nmlz3jRo8J5Qk/edit#heading=h.s8iywtu7x8m6> > ]). > Doing this on each query is expensive. Hence, this is a proposal for > computing and storing partition-level stats for Iceberg tables and using > them during queries. Thanks, Ajantha