Hi All, In the same design document ( https://docs.google.com/document/d/1vaufuD47kMijz97LxM67X8OX-W2Wq7nmlz3jRo8J5Qk/edit?usp=sharing ), I have added a section called *"Design for approval". *It also contains a potential PR breakdown for the phase 1 implementation and future development scope. Please take a look and please vote if you think the design is ok.
Thanks, Ajantha On Mon, Dec 5, 2022 at 8:37 PM Ajantha Bhat <ajanthab...@gmail.com> wrote: > A big thanks to everyone who was involved in the review and the > discussions so far. > > Please find the meeting minutes from the last iceberg sync about the > partition stats. > a. Writers should not write the partition stats or any stats as of > now. > Because it requires bumping the spec to V3. (We can have it as > part of the v3 spec later on. But not anytime soon). > b. So, there can be an async way of generating the stats like ANALYZE > table or call procedure. > Which will compute the stats till the current snapshot and store > it as a partition stats file. > c. In phase 1, partition stats will just store the row_count and > file_count per partition value as mentioned in the design document. > Later it can be enhanced to store puffin file location and other > metrics per partition value. > d. These tuples are stored in a single sorted Avro/parquet file (we > need to finalize this). > e. Each time "analyze table" will rewrite the whole stats file as > keeping multiple delta files will just make the read path messy. > Also, even with million rows, it can be of a few MB size. > Once the writers start writing the stats (V3 spec), we can revisit > storing as the delta files if there are any performance issues. > > The next immediate plan is to > a. Get these PRs merged (open points in existing StatictisFile > interface added during Puffin) > #6267 <https://github.com/apache/iceberg/pull/6267>, #6090 > <https://github.com/apache/iceberg/pull/6090>, #6091 > <https://github.com/apache/iceberg/pull/6091> > b. Figure out how to give accurate stats with row-level deletes and > how to mask dropped partition values from stats. > https://github.com/apache/iceberg/issues/6042 > c. Standardize the `StatictisFile` interface to hold the parquet/Avro > stats file (instead of always assuming it as a Puffin file) > and introduce a `StatisticsType` enum. > d. Conclude the storage format and get approval for the design. > > I will wait another week or two for some more people to take a look at the > document > before jumping into the implementation. > > Thanks, > Ajantha. > > On Sat, Nov 26, 2022 at 8:25 AM Ajantha Bhat <ajanthab...@gmail.com> > wrote: > >> Hi Ryan, >> >> are you saying that you think the partition-level stats should not be >>> required? I think that would be best. >> >> I think there is some confusion here. Partition-level stats are >> required (hence the proposal). >> But does the writer always write it? (with the append/delete/replace >> operation) >> or writer skips writing it and then the user generates it using DML like >> "Analyze table" was the point of discussion. >> I think we can have both options with the writer stats writing controlled >> by a table property "write.stats.enabled" >> >> I’m all for improving the interface for retrieving stats. It’s a separate >>> issue >> >> Agree. Let us discuss it in a separate thread. >> >> Thanks, >> Ajantha >> >> On Sat, Nov 26, 2022 at 12:12 AM Ryan Blue <b...@tabular.io> wrote: >> >>> Ajantha, are you saying that you think the partition-level stats should >>> not be required? I think that would be best. >>> >>> I’m all for improving the interface for retrieving stats. It’s a >>> separate issue, but I think that Iceberg should provide both access to the >>> Puffin files and metadata as well as a higher-level interface for >>> retrieving information like a column’s NDV. Something like this: >>> >>> int ndv = >>> table.findStat(Statistics.NDV).limitSnapshotDistance(3).forColumn("x"); >>> >>> >>> On Thu, Nov 24, 2022 at 2:31 AM Ajantha Bhat <ajanthab...@gmail.com> >>> wrote: >>> >>>> Hi Ryan, >>>> Thanks a lot for the review and suggestions. >>>> >>>> but I think there is also a decision that we need to make before that: >>>>> Should Iceberg require writers to maintain the partition stats? >>>> >>>> I think I would prefer to take a lazy approach and not assume that >>>>> writers will keep the partition stats up to date, >>>> >>>> in which case we need a way to know which parts of a table are newer >>>>> than the most recent stats. >>>> >>>> >>>> This is a common problem for existing table-level puffin stats too. >>>> Not just for partition stats. >>>> As mentioned in the "integration with the current code" section point >>>> 8), >>>> I was planning to introduce a table property "write.stats.enabled" with >>>> a default value set to false. >>>> And as per point 7), I was planning to introduce an "ANALYZE table" or >>>> "CALL procedure" SQL (maybe table-level API too) to asynchronously >>>> compute the stats on demand from the previous checkpoints. >>>> >>>> But currently, `TableMetadata` doesn't have a clean Interface to >>>> provide the statistics file for the current snapshot. >>>> If stats are not present, we need another interface to provide a last >>>> successful snapshot id for which stats was computed. >>>> Also, there is some confusion around reusing the statistics file >>>> (because the spec only has a computed snapshot id, not the referenced >>>> snapshot id). >>>> I am planning to open up a PR to handle these interface updates >>>> this week. (same things as you suggested in the last Iceberg sync). >>>> This should serve as a good foundation to get insights for lazy & >>>> incremental stats computing. >>>> >>>> Thanks, >>>> Ajantha >>>> >>>> On Thu, Nov 24, 2022 at 12:50 AM Ryan Blue <b...@tabular.io> wrote: >>>> >>>>> Thanks for writing this up, Ajantha! I think that we have all the >>>>> upstream pieces in place to work on this so it's great to have a proposal. >>>>> >>>>> The proposal does a good job of summarizing the choices for how to >>>>> store the data, but I think there is also a decision that we need to make >>>>> before that: Should Iceberg require writers to maintain the partition >>>>> stats? >>>>> >>>>> If we do want writers to participate, then we may want to make choices >>>>> that are easier for writers. But I think that is going to be a challenge. >>>>> Adding requirements for writers would mean that we need to bump the spec >>>>> version. Otherwise, we aren't guaranteed that writers will update the >>>>> files >>>>> correctly. I think I would prefer to take a lazy approach and not assume >>>>> that writers will keep the partition stats up to date, in which case we >>>>> need a way to know which parts of a table are newer than the most recent >>>>> stats. >>>>> >>>>> Ryan >>>>> >>>>> On Wed, Nov 23, 2022 at 4:36 AM Ajantha Bhat <ajanthab...@gmail.com> >>>>> wrote: >>>>> >>>>>> Thanks Piotr for taking a look at it. >>>>>> I have replied to all the comments in the document. >>>>>> I might need your support in standardising the existing >>>>>> `StatisticsFile` interface to adopt partition stats as mentioned in the >>>>>> design. >>>>>> >>>>>> >>>>>> >>>>>> *We do need more eyes on the design. Once I get approval for the >>>>>> design, I can start the implementation. * >>>>>> Thanks, >>>>>> Ajantha >>>>>> >>>>>> On Wed, Nov 23, 2022 at 3:28 PM Piotr Findeisen < >>>>>> pi...@starburstdata.com> wrote: >>>>>> >>>>>>> Hi Ajantha, >>>>>>> >>>>>>> this is very interesting document, thank you for your work on this! >>>>>>> I've added a few comments there. >>>>>>> >>>>>>> I have one high-level design comment so I thought it would be nicer >>>>>>> to everyone if I re-post it here >>>>>>> >>>>>>> is "partition" the right level of keeping the stats? >>>>>>>> We do this in Hive, but was it an accidental choice? or just the >>>>>>>> only thing that was possible to be implemented many years ago? >>>>>>> >>>>>>> >>>>>>>> Iceberg allows to have higher number of partitions compared to >>>>>>>> Hive, because it scales better. But that means partition-level may or >>>>>>>> may >>>>>>>> not be the right granularity. >>>>>>> >>>>>>> >>>>>>>> A self-optimizing system would gather stats on "per query unit" >>>>>>>> basis -- for example if i partition by [ day x country ], but usually >>>>>>>> query >>>>>>>> by day, the days are the "query unit" and from stats perspective >>>>>>>> country >>>>>>>> can be ignored. >>>>>>>> Having more fine-grained partitions may lead to expensive planning >>>>>>>> time, so it's not theoretical problem. >>>>>>> >>>>>>> >>>>>>>> I am not saying we should implement all this logic right now, but I >>>>>>>> think we should decouple partitioning scheme from stats partitions, to >>>>>>>> allow query engine to become smarter. >>>>>>> >>>>>>> >>>>>>> >>>>>>> cc @Alexander Jo <alex...@starburstdata.com> >>>>>>> >>>>>>> Best >>>>>>> PF >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Mon, Nov 14, 2022 at 12:47 PM Ajantha Bhat <ajanthab...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi Community, >>>>>>>> I did a proposal write-up for the partition stats in Iceberg. >>>>>>>> Please have a look and let me know what you think. I would like to >>>>>>>> work on it. >>>>>>>> >>>>>>>> >>>>>>>> https://docs.google.com/document/d/1vaufuD47kMijz97LxM67X8OX-W2Wq7nmlz3jRo8J5Qk/edit?usp=sharing >>>>>>>> >>>>>>>> Requirement background snippet from the above document. >>>>>>>> >>>>>>>>> For some query engines that use cost-based-optimizer instead or >>>>>>>>> along with rule-based-optimizer (like Dremio, Trino, etc), at the >>>>>>>>> planning >>>>>>>>> time, >>>>>>>>> it is good to know the partition level stats like total rows per >>>>>>>>> partition and total files per partition to take decisions for CBO ( >>>>>>>>> like deciding on the join reordering and join type, identifying >>>>>>>>> the parallelism). >>>>>>>>> Currently, the only way to do this is to read the partition info >>>>>>>>> from data_file in manifest_entry of the manifest file and compute >>>>>>>>> partition-level statistics (the same thing that ‘partitions’ metadata >>>>>>>>> table >>>>>>>>> is doing [see Appendix A >>>>>>>>> <https://docs.google.com/document/d/1vaufuD47kMijz97LxM67X8OX-W2Wq7nmlz3jRo8J5Qk/edit#heading=h.s8iywtu7x8m6> >>>>>>>>> ]). >>>>>>>>> Doing this on each query is expensive. Hence, this is a proposal >>>>>>>>> for computing and storing partition-level stats for Iceberg tables and >>>>>>>>> using them during queries. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Ajantha >>>>>>>> >>>>>>> >>>>> >>>>> -- >>>>> Ryan Blue >>>>> Tabular >>>>> >>>> >>> >>> -- >>> Ryan Blue >>> Tabular >>> >>