Thanks for writing this up, Ajantha! I think that we have all the upstream pieces in place to work on this so it's great to have a proposal.
The proposal does a good job of summarizing the choices for how to store the data, but I think there is also a decision that we need to make before that: Should Iceberg require writers to maintain the partition stats? If we do want writers to participate, then we may want to make choices that are easier for writers. But I think that is going to be a challenge. Adding requirements for writers would mean that we need to bump the spec version. Otherwise, we aren't guaranteed that writers will update the files correctly. I think I would prefer to take a lazy approach and not assume that writers will keep the partition stats up to date, in which case we need a way to know which parts of a table are newer than the most recent stats. Ryan On Wed, Nov 23, 2022 at 4:36 AM Ajantha Bhat <ajanthab...@gmail.com> wrote: > Thanks Piotr for taking a look at it. > I have replied to all the comments in the document. > I might need your support in standardising the existing `StatisticsFile` > interface to adopt partition stats as mentioned in the design. > > > > *We do need more eyes on the design. Once I get approval for the design, I > can start the implementation. * > Thanks, > Ajantha > > On Wed, Nov 23, 2022 at 3:28 PM Piotr Findeisen <pi...@starburstdata.com> > wrote: > >> Hi Ajantha, >> >> this is very interesting document, thank you for your work on this! >> I've added a few comments there. >> >> I have one high-level design comment so I thought it would be nicer to >> everyone if I re-post it here >> >> is "partition" the right level of keeping the stats? >>> We do this in Hive, but was it an accidental choice? or just the only >>> thing that was possible to be implemented many years ago? >> >> >>> Iceberg allows to have higher number of partitions compared to Hive, >>> because it scales better. But that means partition-level may or may not be >>> the right granularity. >> >> >>> A self-optimizing system would gather stats on "per query unit" basis -- >>> for example if i partition by [ day x country ], but usually query by day, >>> the days are the "query unit" and from stats perspective country can be >>> ignored. >>> Having more fine-grained partitions may lead to expensive planning time, >>> so it's not theoretical problem. >> >> >>> I am not saying we should implement all this logic right now, but I >>> think we should decouple partitioning scheme from stats partitions, to >>> allow query engine to become smarter. >> >> >> >> cc @Alexander Jo <alex...@starburstdata.com> >> >> Best >> PF >> >> >> >> >> On Mon, Nov 14, 2022 at 12:47 PM Ajantha Bhat <ajanthab...@gmail.com> >> wrote: >> >>> Hi Community, >>> I did a proposal write-up for the partition stats in Iceberg. >>> Please have a look and let me know what you think. I would like to work >>> on it. >>> >>> >>> https://docs.google.com/document/d/1vaufuD47kMijz97LxM67X8OX-W2Wq7nmlz3jRo8J5Qk/edit?usp=sharing >>> >>> Requirement background snippet from the above document. >>> >>>> For some query engines that use cost-based-optimizer instead or along >>>> with rule-based-optimizer (like Dremio, Trino, etc), at the planning time, >>>> it is good to know the partition level stats like total rows per >>>> partition and total files per partition to take decisions for CBO ( >>>> like deciding on the join reordering and join type, identifying the >>>> parallelism). >>>> Currently, the only way to do this is to read the partition info from >>>> data_file >>>> in manifest_entry of the manifest file and compute partition-level >>>> statistics (the same thing that ‘partitions’ metadata table is doing [see >>>> Appendix A >>>> <https://docs.google.com/document/d/1vaufuD47kMijz97LxM67X8OX-W2Wq7nmlz3jRo8J5Qk/edit#heading=h.s8iywtu7x8m6> >>>> ]). >>>> Doing this on each query is expensive. Hence, this is a proposal for >>>> computing and storing partition-level stats for Iceberg tables and using >>>> them during queries. >>> >>> >>> >>> Thanks, >>> Ajantha >>> >> -- Ryan Blue Tabular