Hi Ryan, are you saying that you think the partition-level stats should not be > required? I think that would be best.
I think there is some confusion here. Partition-level stats are required (hence the proposal). But does the writer always write it? (with the append/delete/replace operation) or writer skips writing it and then the user generates it using DML like "Analyze table" was the point of discussion. I think we can have both options with the writer stats writing controlled by a table property "write.stats.enabled" I’m all for improving the interface for retrieving stats. It’s a separate > issue Agree. Let us discuss it in a separate thread. Thanks, Ajantha On Sat, Nov 26, 2022 at 12:12 AM Ryan Blue <b...@tabular.io> wrote: > Ajantha, are you saying that you think the partition-level stats should > not be required? I think that would be best. > > I’m all for improving the interface for retrieving stats. It’s a separate > issue, but I think that Iceberg should provide both access to the Puffin > files and metadata as well as a higher-level interface for retrieving > information like a column’s NDV. Something like this: > > int ndv = > table.findStat(Statistics.NDV).limitSnapshotDistance(3).forColumn("x"); > > > On Thu, Nov 24, 2022 at 2:31 AM Ajantha Bhat <ajanthab...@gmail.com> > wrote: > >> Hi Ryan, >> Thanks a lot for the review and suggestions. >> >> but I think there is also a decision that we need to make before that: >>> Should Iceberg require writers to maintain the partition stats? >> >> I think I would prefer to take a lazy approach and not assume that >>> writers will keep the partition stats up to date, >> >> in which case we need a way to know which parts of a table are newer than >>> the most recent stats. >> >> >> This is a common problem for existing table-level puffin stats too. Not >> just for partition stats. >> As mentioned in the "integration with the current code" section point 8), >> I was planning to introduce a table property "write.stats.enabled" with a >> default value set to false. >> And as per point 7), I was planning to introduce an "ANALYZE table" or >> "CALL procedure" SQL (maybe table-level API too) to asynchronously >> compute the stats on demand from the previous checkpoints. >> >> But currently, `TableMetadata` doesn't have a clean Interface to provide >> the statistics file for the current snapshot. >> If stats are not present, we need another interface to provide a last >> successful snapshot id for which stats was computed. >> Also, there is some confusion around reusing the statistics file (because >> the spec only has a computed snapshot id, not the referenced snapshot id). >> I am planning to open up a PR to handle these interface updates >> this week. (same things as you suggested in the last Iceberg sync). >> This should serve as a good foundation to get insights for lazy & >> incremental stats computing. >> >> Thanks, >> Ajantha >> >> On Thu, Nov 24, 2022 at 12:50 AM Ryan Blue <b...@tabular.io> wrote: >> >>> Thanks for writing this up, Ajantha! I think that we have all the >>> upstream pieces in place to work on this so it's great to have a proposal. >>> >>> The proposal does a good job of summarizing the choices for how to store >>> the data, but I think there is also a decision that we need to make before >>> that: Should Iceberg require writers to maintain the partition stats? >>> >>> If we do want writers to participate, then we may want to make choices >>> that are easier for writers. But I think that is going to be a challenge. >>> Adding requirements for writers would mean that we need to bump the spec >>> version. Otherwise, we aren't guaranteed that writers will update the files >>> correctly. I think I would prefer to take a lazy approach and not assume >>> that writers will keep the partition stats up to date, in which case we >>> need a way to know which parts of a table are newer than the most recent >>> stats. >>> >>> Ryan >>> >>> On Wed, Nov 23, 2022 at 4:36 AM Ajantha Bhat <ajanthab...@gmail.com> >>> wrote: >>> >>>> Thanks Piotr for taking a look at it. >>>> I have replied to all the comments in the document. >>>> I might need your support in standardising the existing >>>> `StatisticsFile` interface to adopt partition stats as mentioned in the >>>> design. >>>> >>>> >>>> >>>> *We do need more eyes on the design. Once I get approval for the >>>> design, I can start the implementation. * >>>> Thanks, >>>> Ajantha >>>> >>>> On Wed, Nov 23, 2022 at 3:28 PM Piotr Findeisen < >>>> pi...@starburstdata.com> wrote: >>>> >>>>> Hi Ajantha, >>>>> >>>>> this is very interesting document, thank you for your work on this! >>>>> I've added a few comments there. >>>>> >>>>> I have one high-level design comment so I thought it would be nicer to >>>>> everyone if I re-post it here >>>>> >>>>> is "partition" the right level of keeping the stats? >>>>>> We do this in Hive, but was it an accidental choice? or just the only >>>>>> thing that was possible to be implemented many years ago? >>>>> >>>>> >>>>>> Iceberg allows to have higher number of partitions compared to Hive, >>>>>> because it scales better. But that means partition-level may or may not >>>>>> be >>>>>> the right granularity. >>>>> >>>>> >>>>>> A self-optimizing system would gather stats on "per query unit" basis >>>>>> -- for example if i partition by [ day x country ], but usually query by >>>>>> day, the days are the "query unit" and from stats perspective country can >>>>>> be ignored. >>>>>> Having more fine-grained partitions may lead to expensive planning >>>>>> time, so it's not theoretical problem. >>>>> >>>>> >>>>>> I am not saying we should implement all this logic right now, but I >>>>>> think we should decouple partitioning scheme from stats partitions, to >>>>>> allow query engine to become smarter. >>>>> >>>>> >>>>> >>>>> cc @Alexander Jo <alex...@starburstdata.com> >>>>> >>>>> Best >>>>> PF >>>>> >>>>> >>>>> >>>>> >>>>> On Mon, Nov 14, 2022 at 12:47 PM Ajantha Bhat <ajanthab...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hi Community, >>>>>> I did a proposal write-up for the partition stats in Iceberg. >>>>>> Please have a look and let me know what you think. I would like to >>>>>> work on it. >>>>>> >>>>>> >>>>>> https://docs.google.com/document/d/1vaufuD47kMijz97LxM67X8OX-W2Wq7nmlz3jRo8J5Qk/edit?usp=sharing >>>>>> >>>>>> Requirement background snippet from the above document. >>>>>> >>>>>>> For some query engines that use cost-based-optimizer instead or >>>>>>> along with rule-based-optimizer (like Dremio, Trino, etc), at the >>>>>>> planning >>>>>>> time, >>>>>>> it is good to know the partition level stats like total rows per >>>>>>> partition and total files per partition to take decisions for CBO ( >>>>>>> like deciding on the join reordering and join type, identifying the >>>>>>> parallelism). >>>>>>> Currently, the only way to do this is to read the partition info >>>>>>> from data_file in manifest_entry of the manifest file and compute >>>>>>> partition-level statistics (the same thing that ‘partitions’ metadata >>>>>>> table >>>>>>> is doing [see Appendix A >>>>>>> <https://docs.google.com/document/d/1vaufuD47kMijz97LxM67X8OX-W2Wq7nmlz3jRo8J5Qk/edit#heading=h.s8iywtu7x8m6> >>>>>>> ]). >>>>>>> Doing this on each query is expensive. Hence, this is a proposal >>>>>>> for computing and storing partition-level stats for Iceberg tables and >>>>>>> using them during queries. >>>>>> >>>>>> >>>>>> >>>>>> Thanks, >>>>>> Ajantha >>>>>> >>>>> >>> >>> -- >>> Ryan Blue >>> Tabular >>> >> > > -- > Ryan Blue > Tabular >