Ajantha, are you saying that you think the partition-level stats should not be required? I think that would be best.
I’m all for improving the interface for retrieving stats. It’s a separate issue, but I think that Iceberg should provide both access to the Puffin files and metadata as well as a higher-level interface for retrieving information like a column’s NDV. Something like this: int ndv = table.findStat(Statistics.NDV).limitSnapshotDistance(3).forColumn("x"); On Thu, Nov 24, 2022 at 2:31 AM Ajantha Bhat <ajanthab...@gmail.com> wrote: > Hi Ryan, > Thanks a lot for the review and suggestions. > > but I think there is also a decision that we need to make before that: >> Should Iceberg require writers to maintain the partition stats? > > I think I would prefer to take a lazy approach and not assume that writers >> will keep the partition stats up to date, > > in which case we need a way to know which parts of a table are newer than >> the most recent stats. > > > This is a common problem for existing table-level puffin stats too. Not > just for partition stats. > As mentioned in the "integration with the current code" section point 8), > I was planning to introduce a table property "write.stats.enabled" with a > default value set to false. > And as per point 7), I was planning to introduce an "ANALYZE table" or > "CALL procedure" SQL (maybe table-level API too) to asynchronously > compute the stats on demand from the previous checkpoints. > > But currently, `TableMetadata` doesn't have a clean Interface to provide > the statistics file for the current snapshot. > If stats are not present, we need another interface to provide a last > successful snapshot id for which stats was computed. > Also, there is some confusion around reusing the statistics file (because > the spec only has a computed snapshot id, not the referenced snapshot id). > I am planning to open up a PR to handle these interface updates this week. > (same things as you suggested in the last Iceberg sync). > This should serve as a good foundation to get insights for lazy & > incremental stats computing. > > Thanks, > Ajantha > > On Thu, Nov 24, 2022 at 12:50 AM Ryan Blue <b...@tabular.io> wrote: > >> Thanks for writing this up, Ajantha! I think that we have all the >> upstream pieces in place to work on this so it's great to have a proposal. >> >> The proposal does a good job of summarizing the choices for how to store >> the data, but I think there is also a decision that we need to make before >> that: Should Iceberg require writers to maintain the partition stats? >> >> If we do want writers to participate, then we may want to make choices >> that are easier for writers. But I think that is going to be a challenge. >> Adding requirements for writers would mean that we need to bump the spec >> version. Otherwise, we aren't guaranteed that writers will update the files >> correctly. I think I would prefer to take a lazy approach and not assume >> that writers will keep the partition stats up to date, in which case we >> need a way to know which parts of a table are newer than the most recent >> stats. >> >> Ryan >> >> On Wed, Nov 23, 2022 at 4:36 AM Ajantha Bhat <ajanthab...@gmail.com> >> wrote: >> >>> Thanks Piotr for taking a look at it. >>> I have replied to all the comments in the document. >>> I might need your support in standardising the existing `StatisticsFile` >>> interface to adopt partition stats as mentioned in the design. >>> >>> >>> >>> *We do need more eyes on the design. Once I get approval for the design, >>> I can start the implementation. * >>> Thanks, >>> Ajantha >>> >>> On Wed, Nov 23, 2022 at 3:28 PM Piotr Findeisen <pi...@starburstdata.com> >>> wrote: >>> >>>> Hi Ajantha, >>>> >>>> this is very interesting document, thank you for your work on this! >>>> I've added a few comments there. >>>> >>>> I have one high-level design comment so I thought it would be nicer to >>>> everyone if I re-post it here >>>> >>>> is "partition" the right level of keeping the stats? >>>>> We do this in Hive, but was it an accidental choice? or just the only >>>>> thing that was possible to be implemented many years ago? >>>> >>>> >>>>> Iceberg allows to have higher number of partitions compared to Hive, >>>>> because it scales better. But that means partition-level may or may not be >>>>> the right granularity. >>>> >>>> >>>>> A self-optimizing system would gather stats on "per query unit" basis >>>>> -- for example if i partition by [ day x country ], but usually query by >>>>> day, the days are the "query unit" and from stats perspective country can >>>>> be ignored. >>>>> Having more fine-grained partitions may lead to expensive planning >>>>> time, so it's not theoretical problem. >>>> >>>> >>>>> I am not saying we should implement all this logic right now, but I >>>>> think we should decouple partitioning scheme from stats partitions, to >>>>> allow query engine to become smarter. >>>> >>>> >>>> >>>> cc @Alexander Jo <alex...@starburstdata.com> >>>> >>>> Best >>>> PF >>>> >>>> >>>> >>>> >>>> On Mon, Nov 14, 2022 at 12:47 PM Ajantha Bhat <ajanthab...@gmail.com> >>>> wrote: >>>> >>>>> Hi Community, >>>>> I did a proposal write-up for the partition stats in Iceberg. >>>>> Please have a look and let me know what you think. I would like to >>>>> work on it. >>>>> >>>>> >>>>> https://docs.google.com/document/d/1vaufuD47kMijz97LxM67X8OX-W2Wq7nmlz3jRo8J5Qk/edit?usp=sharing >>>>> >>>>> Requirement background snippet from the above document. >>>>> >>>>>> For some query engines that use cost-based-optimizer instead or along >>>>>> with rule-based-optimizer (like Dremio, Trino, etc), at the planning >>>>>> time, >>>>>> it is good to know the partition level stats like total rows per >>>>>> partition and total files per partition to take decisions for CBO ( >>>>>> like deciding on the join reordering and join type, identifying the >>>>>> parallelism). >>>>>> Currently, the only way to do this is to read the partition info from >>>>>> data_file >>>>>> in manifest_entry of the manifest file and compute partition-level >>>>>> statistics (the same thing that ‘partitions’ metadata table is doing [see >>>>>> Appendix A >>>>>> <https://docs.google.com/document/d/1vaufuD47kMijz97LxM67X8OX-W2Wq7nmlz3jRo8J5Qk/edit#heading=h.s8iywtu7x8m6> >>>>>> ]). >>>>>> Doing this on each query is expensive. Hence, this is a proposal for >>>>>> computing and storing partition-level stats for Iceberg tables and using >>>>>> them during queries. >>>>> >>>>> >>>>> >>>>> Thanks, >>>>> Ajantha >>>>> >>>> >> >> -- >> Ryan Blue >> Tabular >> > -- Ryan Blue Tabular