Hi Ryan, Thanks a lot for the review and suggestions. but I think there is also a decision that we need to make before that: > Should Iceberg require writers to maintain the partition stats?
I think I would prefer to take a lazy approach and not assume that writers > will keep the partition stats up to date, in which case we need a way to know which parts of a table are newer than > the most recent stats. This is a common problem for existing table-level puffin stats too. Not just for partition stats. As mentioned in the "integration with the current code" section point 8), I was planning to introduce a table property "write.stats.enabled" with a default value set to false. And as per point 7), I was planning to introduce an "ANALYZE table" or "CALL procedure" SQL (maybe table-level API too) to asynchronously compute the stats on demand from the previous checkpoints. But currently, `TableMetadata` doesn't have a clean Interface to provide the statistics file for the current snapshot. If stats are not present, we need another interface to provide a last successful snapshot id for which stats was computed. Also, there is some confusion around reusing the statistics file (because the spec only has a computed snapshot id, not the referenced snapshot id). I am planning to open up a PR to handle these interface updates this week. (same things as you suggested in the last Iceberg sync). This should serve as a good foundation to get insights for lazy & incremental stats computing. Thanks, Ajantha On Thu, Nov 24, 2022 at 12:50 AM Ryan Blue <b...@tabular.io> wrote: > Thanks for writing this up, Ajantha! I think that we have all the upstream > pieces in place to work on this so it's great to have a proposal. > > The proposal does a good job of summarizing the choices for how to store > the data, but I think there is also a decision that we need to make before > that: Should Iceberg require writers to maintain the partition stats? > > If we do want writers to participate, then we may want to make choices > that are easier for writers. But I think that is going to be a challenge. > Adding requirements for writers would mean that we need to bump the spec > version. Otherwise, we aren't guaranteed that writers will update the files > correctly. I think I would prefer to take a lazy approach and not assume > that writers will keep the partition stats up to date, in which case we > need a way to know which parts of a table are newer than the most recent > stats. > > Ryan > > On Wed, Nov 23, 2022 at 4:36 AM Ajantha Bhat <ajanthab...@gmail.com> > wrote: > >> Thanks Piotr for taking a look at it. >> I have replied to all the comments in the document. >> I might need your support in standardising the existing `StatisticsFile` >> interface to adopt partition stats as mentioned in the design. >> >> >> >> *We do need more eyes on the design. Once I get approval for the design, >> I can start the implementation. * >> Thanks, >> Ajantha >> >> On Wed, Nov 23, 2022 at 3:28 PM Piotr Findeisen <pi...@starburstdata.com> >> wrote: >> >>> Hi Ajantha, >>> >>> this is very interesting document, thank you for your work on this! >>> I've added a few comments there. >>> >>> I have one high-level design comment so I thought it would be nicer to >>> everyone if I re-post it here >>> >>> is "partition" the right level of keeping the stats? >>>> We do this in Hive, but was it an accidental choice? or just the only >>>> thing that was possible to be implemented many years ago? >>> >>> >>>> Iceberg allows to have higher number of partitions compared to Hive, >>>> because it scales better. But that means partition-level may or may not be >>>> the right granularity. >>> >>> >>>> A self-optimizing system would gather stats on "per query unit" basis >>>> -- for example if i partition by [ day x country ], but usually query by >>>> day, the days are the "query unit" and from stats perspective country can >>>> be ignored. >>>> Having more fine-grained partitions may lead to expensive planning >>>> time, so it's not theoretical problem. >>> >>> >>>> I am not saying we should implement all this logic right now, but I >>>> think we should decouple partitioning scheme from stats partitions, to >>>> allow query engine to become smarter. >>> >>> >>> >>> cc @Alexander Jo <alex...@starburstdata.com> >>> >>> Best >>> PF >>> >>> >>> >>> >>> On Mon, Nov 14, 2022 at 12:47 PM Ajantha Bhat <ajanthab...@gmail.com> >>> wrote: >>> >>>> Hi Community, >>>> I did a proposal write-up for the partition stats in Iceberg. >>>> Please have a look and let me know what you think. I would like to work >>>> on it. >>>> >>>> >>>> https://docs.google.com/document/d/1vaufuD47kMijz97LxM67X8OX-W2Wq7nmlz3jRo8J5Qk/edit?usp=sharing >>>> >>>> Requirement background snippet from the above document. >>>> >>>>> For some query engines that use cost-based-optimizer instead or along >>>>> with rule-based-optimizer (like Dremio, Trino, etc), at the planning time, >>>>> it is good to know the partition level stats like total rows per >>>>> partition and total files per partition to take decisions for CBO ( >>>>> like deciding on the join reordering and join type, identifying the >>>>> parallelism). >>>>> Currently, the only way to do this is to read the partition info from >>>>> data_file >>>>> in manifest_entry of the manifest file and compute partition-level >>>>> statistics (the same thing that ‘partitions’ metadata table is doing [see >>>>> Appendix A >>>>> <https://docs.google.com/document/d/1vaufuD47kMijz97LxM67X8OX-W2Wq7nmlz3jRo8J5Qk/edit#heading=h.s8iywtu7x8m6> >>>>> ]). >>>>> Doing this on each query is expensive. Hence, this is a proposal for >>>>> computing and storing partition-level stats for Iceberg tables and using >>>>> them during queries. >>>> >>>> >>>> >>>> Thanks, >>>> Ajantha >>>> >>> > > -- > Ryan Blue > Tabular >