A big thanks to everyone who was involved in the review and the discussions so far.
Please find the meeting minutes from the last iceberg sync about the partition stats. a. Writers should not write the partition stats or any stats as of now. Because it requires bumping the spec to V3. (We can have it as part of the v3 spec later on. But not anytime soon). b. So, there can be an async way of generating the stats like ANALYZE table or call procedure. Which will compute the stats till the current snapshot and store it as a partition stats file. c. In phase 1, partition stats will just store the row_count and file_count per partition value as mentioned in the design document. Later it can be enhanced to store puffin file location and other metrics per partition value. d. These tuples are stored in a single sorted Avro/parquet file (we need to finalize this). e. Each time "analyze table" will rewrite the whole stats file as keeping multiple delta files will just make the read path messy. Also, even with million rows, it can be of a few MB size. Once the writers start writing the stats (V3 spec), we can revisit storing as the delta files if there are any performance issues. The next immediate plan is to a. Get these PRs merged (open points in existing StatictisFile interface added during Puffin) #6267 <https://github.com/apache/iceberg/pull/6267>, #6090 <https://github.com/apache/iceberg/pull/6090>, #6091 <https://github.com/apache/iceberg/pull/6091> b. Figure out how to give accurate stats with row-level deletes and how to mask dropped partition values from stats. https://github.com/apache/iceberg/issues/6042 c. Standardize the `StatictisFile` interface to hold the parquet/Avro stats file (instead of always assuming it as a Puffin file) and introduce a `StatisticsType` enum. d. Conclude the storage format and get approval for the design. I will wait another week or two for some more people to take a look at the document before jumping into the implementation. Thanks, Ajantha. On Sat, Nov 26, 2022 at 8:25 AM Ajantha Bhat <ajanthab...@gmail.com> wrote: > Hi Ryan, > > are you saying that you think the partition-level stats should not be >> required? I think that would be best. > > I think there is some confusion here. Partition-level stats are > required (hence the proposal). > But does the writer always write it? (with the append/delete/replace > operation) > or writer skips writing it and then the user generates it using DML like > "Analyze table" was the point of discussion. > I think we can have both options with the writer stats writing controlled > by a table property "write.stats.enabled" > > I’m all for improving the interface for retrieving stats. It’s a separate >> issue > > Agree. Let us discuss it in a separate thread. > > Thanks, > Ajantha > > On Sat, Nov 26, 2022 at 12:12 AM Ryan Blue <b...@tabular.io> wrote: > >> Ajantha, are you saying that you think the partition-level stats should >> not be required? I think that would be best. >> >> I’m all for improving the interface for retrieving stats. It’s a separate >> issue, but I think that Iceberg should provide both access to the Puffin >> files and metadata as well as a higher-level interface for retrieving >> information like a column’s NDV. Something like this: >> >> int ndv = >> table.findStat(Statistics.NDV).limitSnapshotDistance(3).forColumn("x"); >> >> >> On Thu, Nov 24, 2022 at 2:31 AM Ajantha Bhat <ajanthab...@gmail.com> >> wrote: >> >>> Hi Ryan, >>> Thanks a lot for the review and suggestions. >>> >>> but I think there is also a decision that we need to make before that: >>>> Should Iceberg require writers to maintain the partition stats? >>> >>> I think I would prefer to take a lazy approach and not assume that >>>> writers will keep the partition stats up to date, >>> >>> in which case we need a way to know which parts of a table are newer >>>> than the most recent stats. >>> >>> >>> This is a common problem for existing table-level puffin stats too. Not >>> just for partition stats. >>> As mentioned in the "integration with the current code" section point >>> 8), >>> I was planning to introduce a table property "write.stats.enabled" with >>> a default value set to false. >>> And as per point 7), I was planning to introduce an "ANALYZE table" or >>> "CALL procedure" SQL (maybe table-level API too) to asynchronously >>> compute the stats on demand from the previous checkpoints. >>> >>> But currently, `TableMetadata` doesn't have a clean Interface to provide >>> the statistics file for the current snapshot. >>> If stats are not present, we need another interface to provide a last >>> successful snapshot id for which stats was computed. >>> Also, there is some confusion around reusing the statistics file >>> (because the spec only has a computed snapshot id, not the referenced >>> snapshot id). >>> I am planning to open up a PR to handle these interface updates >>> this week. (same things as you suggested in the last Iceberg sync). >>> This should serve as a good foundation to get insights for lazy & >>> incremental stats computing. >>> >>> Thanks, >>> Ajantha >>> >>> On Thu, Nov 24, 2022 at 12:50 AM Ryan Blue <b...@tabular.io> wrote: >>> >>>> Thanks for writing this up, Ajantha! I think that we have all the >>>> upstream pieces in place to work on this so it's great to have a proposal. >>>> >>>> The proposal does a good job of summarizing the choices for how to >>>> store the data, but I think there is also a decision that we need to make >>>> before that: Should Iceberg require writers to maintain the partition >>>> stats? >>>> >>>> If we do want writers to participate, then we may want to make choices >>>> that are easier for writers. But I think that is going to be a challenge. >>>> Adding requirements for writers would mean that we need to bump the spec >>>> version. Otherwise, we aren't guaranteed that writers will update the files >>>> correctly. I think I would prefer to take a lazy approach and not assume >>>> that writers will keep the partition stats up to date, in which case we >>>> need a way to know which parts of a table are newer than the most recent >>>> stats. >>>> >>>> Ryan >>>> >>>> On Wed, Nov 23, 2022 at 4:36 AM Ajantha Bhat <ajanthab...@gmail.com> >>>> wrote: >>>> >>>>> Thanks Piotr for taking a look at it. >>>>> I have replied to all the comments in the document. >>>>> I might need your support in standardising the existing >>>>> `StatisticsFile` interface to adopt partition stats as mentioned in the >>>>> design. >>>>> >>>>> >>>>> >>>>> *We do need more eyes on the design. Once I get approval for the >>>>> design, I can start the implementation. * >>>>> Thanks, >>>>> Ajantha >>>>> >>>>> On Wed, Nov 23, 2022 at 3:28 PM Piotr Findeisen < >>>>> pi...@starburstdata.com> wrote: >>>>> >>>>>> Hi Ajantha, >>>>>> >>>>>> this is very interesting document, thank you for your work on this! >>>>>> I've added a few comments there. >>>>>> >>>>>> I have one high-level design comment so I thought it would be nicer >>>>>> to everyone if I re-post it here >>>>>> >>>>>> is "partition" the right level of keeping the stats? >>>>>>> We do this in Hive, but was it an accidental choice? or just the >>>>>>> only thing that was possible to be implemented many years ago? >>>>>> >>>>>> >>>>>>> Iceberg allows to have higher number of partitions compared to Hive, >>>>>>> because it scales better. But that means partition-level may or may not >>>>>>> be >>>>>>> the right granularity. >>>>>> >>>>>> >>>>>>> A self-optimizing system would gather stats on "per query unit" >>>>>>> basis -- for example if i partition by [ day x country ], but usually >>>>>>> query >>>>>>> by day, the days are the "query unit" and from stats perspective country >>>>>>> can be ignored. >>>>>>> Having more fine-grained partitions may lead to expensive planning >>>>>>> time, so it's not theoretical problem. >>>>>> >>>>>> >>>>>>> I am not saying we should implement all this logic right now, but I >>>>>>> think we should decouple partitioning scheme from stats partitions, to >>>>>>> allow query engine to become smarter. >>>>>> >>>>>> >>>>>> >>>>>> cc @Alexander Jo <alex...@starburstdata.com> >>>>>> >>>>>> Best >>>>>> PF >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Mon, Nov 14, 2022 at 12:47 PM Ajantha Bhat <ajanthab...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Hi Community, >>>>>>> I did a proposal write-up for the partition stats in Iceberg. >>>>>>> Please have a look and let me know what you think. I would like to >>>>>>> work on it. >>>>>>> >>>>>>> >>>>>>> https://docs.google.com/document/d/1vaufuD47kMijz97LxM67X8OX-W2Wq7nmlz3jRo8J5Qk/edit?usp=sharing >>>>>>> >>>>>>> Requirement background snippet from the above document. >>>>>>> >>>>>>>> For some query engines that use cost-based-optimizer instead or >>>>>>>> along with rule-based-optimizer (like Dremio, Trino, etc), at the >>>>>>>> planning >>>>>>>> time, >>>>>>>> it is good to know the partition level stats like total rows per >>>>>>>> partition and total files per partition to take decisions for CBO ( >>>>>>>> like deciding on the join reordering and join type, identifying the >>>>>>>> parallelism). >>>>>>>> Currently, the only way to do this is to read the partition info >>>>>>>> from data_file in manifest_entry of the manifest file and compute >>>>>>>> partition-level statistics (the same thing that ‘partitions’ metadata >>>>>>>> table >>>>>>>> is doing [see Appendix A >>>>>>>> <https://docs.google.com/document/d/1vaufuD47kMijz97LxM67X8OX-W2Wq7nmlz3jRo8J5Qk/edit#heading=h.s8iywtu7x8m6> >>>>>>>> ]). >>>>>>>> Doing this on each query is expensive. Hence, this is a proposal >>>>>>>> for computing and storing partition-level stats for Iceberg tables and >>>>>>>> using them during queries. >>>>>>> >>>>>>> >>>>>>> >>>>>>> Thanks, >>>>>>> Ajantha >>>>>>> >>>>>> >>>> >>>> -- >>>> Ryan Blue >>>> Tabular >>>> >>> >> >> -- >> Ryan Blue >> Tabular >> >