Hi Ajantha and the community, I am interested and I am wondering where we can see the latest progress of this feature?
Regarding the partition stats in Iceberg, I am specifically curious if we can consider a new field called "last modified time" to be included for the partitions stats (or have a plugable way to allow users to configure partition stats they need). My use case is to find out if a partition is changed or not given two snapshots (old and new) with a quick and light way process. I previously was suggested by the community to use the change log (CDC) but I think that is too heavy (I guess, since it requires to run SparkSQL procedure) and it is over do the work (I don't need what rows are changed, I just need true or false for whether a partition is changed). Thanks On Tue, Feb 7, 2023 at 11:36 AM Mayur Srivastava < mayur.srivast...@twosigma.com> wrote: > Thanks Ajantha. > > > > > It should be very easy to add a few more fields to it like the latest > sequence number or last modified time per partition. > > > > Among sequence number and modified time, which one do you think is more > likely to be available in Iceberg partition stats? Note that we would like > to avoid compaction change the sequence number or modified time stats. > > > > Thanks, > > Mayur > > > > *From:* Ajantha Bhat <ajanthab...@gmail.com> > *Sent:* Tuesday, February 7, 2023 10:02 AM > *To:* dev@iceberg.apache.org > *Subject:* Re: [Proposal] Partition stats in Iceberg > > > > Hi Hrishi and Mayur, thanks for the inputs. > > To get things moving I have frozen the scope of phase 1 implementation. > (Recently added the delete file stats to phase 1 too). You can find the > scope in the "Design for approval" section of the design doc. > > That said, once we have phase 1 implemented, It should be very easy to add > a few more fields to it like the latest sequence number or last modified > time per partition. > I will be opening up the discussion about phase 2 schema again once phase > 1 implementation is done. > > Thanks, > Ajantha > > > > On Tue, Feb 7, 2023 at 8:15 PM Mayur Srivastava < > mayur.srivast...@twosigma.com> wrote: > > +1 for the initiative. > > > > We’ve been exploring options for storing last-modified-time per partition. > It an important building block for data pipelines – especially if there is > a dependency between jobs with strong consistency requirements. > > > > Is partition stats a good place for storing last-modified-time per > partition? > > > > Thanks, > > Mayur > > > > *From:* Ajantha Bhat <ajanthab...@gmail.com> > *Sent:* Monday, January 23, 2023 11:56 AM > *To:* dev@iceberg.apache.org > *Subject:* Re: [Proposal] Partition stats in Iceberg > > > > Hi All, > > In the same design document ( > https://docs.google.com/document/d/1vaufuD47kMijz97LxM67X8OX-W2Wq7nmlz3jRo8J5Qk/edit?usp=sharing > ), > I have added a section called > *"Design for approval". *It also contains a potential PR breakdown for > the phase 1 implementation and future development scope. > Please take a look and please vote if you think the design is ok. > > Thanks, > Ajantha > > > > On Mon, Dec 5, 2022 at 8:37 PM Ajantha Bhat <ajanthab...@gmail.com> wrote: > > A big thanks to everyone who was involved in the review and the > discussions so far. > > Please find the meeting minutes from the last iceberg sync about the > partition stats. > a. Writers should not write the partition stats or any stats as of > now. > Because it requires bumping the spec to V3. (We can have it as > part of the v3 spec later on. But not anytime soon). > b. So, there can be an async way of generating the stats like ANALYZE > table or call procedure. > Which will compute the stats till the current snapshot and store > it as a partition stats file. > c. In phase 1, partition stats will just store the row_count and > file_count per partition value as mentioned in the design document. > Later it can be enhanced to store puffin file location and other > metrics per partition value. > d. These tuples are stored in a single sorted Avro/parquet file (we > need to finalize this). > e. Each time "analyze table" will rewrite the whole stats file as > keeping multiple delta files will just make the read path messy. > Also, even with million rows, it can be of a few MB size. > Once the writers start writing the stats (V3 spec), we can revisit > storing as the delta files if there are any performance issues. > > The next immediate plan is to > a. Get these PRs merged (open points in existing StatictisFile > interface added during Puffin) > #6267 <https://github.com/apache/iceberg/pull/6267>, #6090 > <https://github.com/apache/iceberg/pull/6090>, #6091 > <https://github.com/apache/iceberg/pull/6091> > b. Figure out how to give accurate stats with row-level deletes and > how to mask dropped partition values from stats. > https://github.com/apache/iceberg/issues/6042 > c. Standardize the `StatictisFile` interface to hold the parquet/Avro > stats file (instead of always assuming it as a Puffin file) > and introduce a `StatisticsType` enum. > d. Conclude the storage format and get approval for the design. > > I will wait another week or two for some more people to take a look at the > document > > before jumping into the implementation. > > Thanks, > Ajantha. > > > > On Sat, Nov 26, 2022 at 8:25 AM Ajantha Bhat <ajanthab...@gmail.com> > wrote: > > Hi Ryan, > > are you saying that you think the partition-level stats should not be > required? I think that would be best. > > I think there is some confusion here. Partition-level stats are > required (hence the proposal). > But does the writer always write it? (with the append/delete/replace > operation) > or writer skips writing it and then the user generates it using DML like > "Analyze table" was the point of discussion. > I think we can have both options with the writer stats writing controlled > by a table property "write.stats.enabled" > > > > I’m all for improving the interface for retrieving stats. It’s a separate > issue > > Agree. Let us discuss it in a separate thread. > > Thanks, > Ajantha > > > > On Sat, Nov 26, 2022 at 12:12 AM Ryan Blue <b...@tabular.io> wrote: > > Ajantha, are you saying that you think the partition-level stats should > not be required? I think that would be best. > > I’m all for improving the interface for retrieving stats. It’s a separate > issue, but I think that Iceberg should provide both access to the Puffin > files and metadata as well as a higher-level interface for retrieving > information like a column’s NDV. Something like this: > > int ndv = > table.findStat(Statistics.NDV).limitSnapshotDistance(3).forColumn("x"); > > > > On Thu, Nov 24, 2022 at 2:31 AM Ajantha Bhat <ajanthab...@gmail.com> > wrote: > > Hi Ryan, > Thanks a lot for the review and suggestions. > > but I think there is also a decision that we need to make before that: > Should Iceberg require writers to maintain the partition stats? > > I think I would prefer to take a lazy approach and not assume that writers > will keep the partition stats up to date, > > in which case we need a way to know which parts of a table are newer than > the most recent stats. > > > > This is a common problem for existing table-level puffin stats too. Not > just for partition stats. > As mentioned in the "integration with the current code" section point 8), > I was planning to introduce a table property "write.stats.enabled" with a > default value set to false. > > And as per point 7), I was planning to introduce an "ANALYZE table" or > "CALL procedure" SQL (maybe table-level API too) to asynchronously > compute the stats on demand from the previous checkpoints. > > But currently, `TableMetadata` doesn't have a clean Interface to provide > the statistics file for the current snapshot. > If stats are not present, we need another interface to provide a last > successful snapshot id for which stats was computed. > Also, there is some confusion around reusing the statistics file (because > the spec only has a computed snapshot id, not the referenced snapshot id). > I am planning to open up a PR to handle these interface updates this week. > (same things as you suggested in the last Iceberg sync). > This should serve as a good foundation to get insights for lazy & > incremental stats computing. > > > Thanks, > Ajantha > > > > On Thu, Nov 24, 2022 at 12:50 AM Ryan Blue <b...@tabular.io> wrote: > > Thanks for writing this up, Ajantha! I think that we have all the upstream > pieces in place to work on this so it's great to have a proposal. > > > > The proposal does a good job of summarizing the choices for how to store > the data, but I think there is also a decision that we need to make before > that: Should Iceberg require writers to maintain the partition stats? > > > > If we do want writers to participate, then we may want to make choices > that are easier for writers. But I think that is going to be a challenge. > Adding requirements for writers would mean that we need to bump the spec > version. Otherwise, we aren't guaranteed that writers will update the files > correctly. I think I would prefer to take a lazy approach and not assume > that writers will keep the partition stats up to date, in which case we > need a way to know which parts of a table are newer than the most recent > stats. > > > > Ryan > > > > On Wed, Nov 23, 2022 at 4:36 AM Ajantha Bhat <ajanthab...@gmail.com> > wrote: > > Thanks Piotr for taking a look at it. > I have replied to all the comments in the document. > I might need your support in standardising the existing `StatisticsFile` > interface to adopt partition stats as mentioned in the design. > > > > *We do need more eyes on the design. Once I get approval for the design, I > can start the implementation. * > Thanks, > Ajantha > > > > On Wed, Nov 23, 2022 at 3:28 PM Piotr Findeisen <pi...@starburstdata.com> > wrote: > > Hi Ajantha, > > > > this is very interesting document, thank you for your work on this! > > I've added a few comments there. > > > > I have one high-level design comment so I thought it would be nicer to > everyone if I re-post it here > > > > is "partition" the right level of keeping the stats? > We do this in Hive, but was it an accidental choice? or just the only > thing that was possible to be implemented many years ago? > > > > Iceberg allows to have higher number of partitions compared to Hive, > because it scales better. But that means partition-level may or may not be > the right granularity. > > > A self-optimizing system would gather stats on "per query unit" basis -- > for example if i partition by [ day x country ], but usually query by day, > the days are the "query unit" and from stats perspective country can be > ignored. > Having more fine-grained partitions may lead to expensive planning time, > so it's not theoretical problem. > > > I am not saying we should implement all this logic right now, but I think > we should decouple partitioning scheme from stats partitions, to allow > query engine to become smarter. > > > > > > cc @Alexander Jo <alex...@starburstdata.com> > > > > Best > > PF > > > > > > > > > > On Mon, Nov 14, 2022 at 12:47 PM Ajantha Bhat <ajanthab...@gmail.com> > wrote: > > Hi Community, > I did a proposal write-up for the partition stats in Iceberg. > Please have a look and let me know what you think. I would like to work on > it. > > > https://docs.google.com/document/d/1vaufuD47kMijz97LxM67X8OX-W2Wq7nmlz3jRo8J5Qk/edit?usp=sharing > > Requirement background snippet from the above document. > > For some query engines that use cost-based-optimizer instead or along with > rule-based-optimizer (like Dremio, Trino, etc), at the planning time, > it is good to know the partition level stats like total rows per partition > and total files per partition to take decisions for CBO ( > like deciding on the join reordering and join type, identifying the > parallelism). > Currently, the only way to do this is to read the partition info from > data_file > in manifest_entry of the manifest file and compute partition-level > statistics (the same thing that ‘partitions’ metadata table is doing *[see > **Appendix A* > <https://docs.google.com/document/d/1vaufuD47kMijz97LxM67X8OX-W2Wq7nmlz3jRo8J5Qk/edit#heading=h.s8iywtu7x8m6> > *]*). > Doing this on each query is expensive. *Hence, this is a proposal for > computing and storing partition-level stats for Iceberg tables and using > them during queries.* > > > > Thanks, > Ajantha > > > > > -- > > Ryan Blue > > Tabular > > > > > -- > > Ryan Blue > > Tabular > >