Thanks for the response. One of the use cases that we have is where one business day of data is added at a time to a DAY partitioned table. With 25 years of this data, there will be ~6250 partitions and as many snapshots. Are these many snapshots recommended to be kept around?
On Tue, May 2, 2023 at 7:45 PM Szehon Ho <szehon.apa...@gmail.com> wrote: > > Does snapshot expiration needs to be disabled for this to work? Thanks, >> Mayur >> > > Yes, the snapshot that last updated the partition needs to be around for > this to work. > > Szehon, the query you shared requires a SparkSQL job to be run which >> means latency will be high. However, I am glad you are also thinking of >> adding these directly to the partition table and it seems we share the same >> interests. > > > Yea the partitions table currently still goes through SparkSQL, so it will > be the same. Maybe you mean add this to partition stats? We do need to > reconcile partition table and partition stats at some point though. Not > sure if it was designed/discussed yet, I think there was some thoughts on > short-circuiting Partitions table to read from Partition stats, if stats > exist for the current snapshot. > > Thanks > Szehon > > On Tue, May 2, 2023 at 4:34 PM Pucheng Yang <py...@pinterest.com.invalid> > wrote: > >> Thanks Ryan and Szehon! >> >> Szehon, the query you shared requires a SparkSQL job to be run which >> means latency will be high. However, I am glad you are also thinking of >> adding these directly to the partition table and it seems we share the same >> interests. I am looking forward to the work in the phase 2 implementation. >> Let me know if I can help, thanks. >> >> On Tue, May 2, 2023 at 4:28 PM Szehon Ho <szehon.apa...@gmail.com> wrote: >> >>> Yea I agree, I had a handy query for the last update time of partition. >>> >>> SELECT >>> >>> e.data_file.partition, >>> >>> MAX(s.committed_at) AS last_modified_time >>> >>> FROM db.table.snapshots s >>> >>> JOIN db.table.entries e >>> >>> WHERE s.snapshot_id = e.snapshot_id >>> >>> GROUP BY by e.data_file.partition >>> >>> It's a bit lengthy currently. >>> >>> I have been indeed thinking to look at adding these fields to the >>> Partitions table directly, after Ajantha's pending changes to add delete >>> files to this table. >>> >>> Thanks >>> Szehon >>> >>> On Tue, May 2, 2023 at 4:08 PM Ryan Blue <b...@tabular.io> wrote: >>> >>>> Pucheng, >>>> >>>> Rather than using the changelog, I'd just look at the metadata tables. >>>> You should be able to query the all_entries metadata table to see file >>>> additions or deletions for a given snapshot. Then from there you can join >>>> to the snapshots table for timestamps and aggregate to the partition level. >>>> >>>> Ryan >>>> >>>> On Fri, Apr 28, 2023 at 12:49 PM Pucheng Yang >>>> <py...@pinterest.com.invalid> wrote: >>>> >>>>> Hi Ajantha and the community, >>>>> >>>>> I am interested and I am wondering where we can see the latest >>>>> progress of this feature? >>>>> >>>>> Regarding the partition stats in Iceberg, I am specifically curious if >>>>> we can consider a new field called "last modified time" to be included for >>>>> the partitions stats (or have a plugable way to allow users to >>>>> configure partition stats they need). My use case is to find out if a >>>>> partition is changed or not given two snapshots (old and new) with a >>>>> quick and light way process. I previously was suggested by the community >>>>> to >>>>> use the change log (CDC) but I think that is too heavy (I guess, since it >>>>> requires to run SparkSQL procedure) and it is over do the work (I don't >>>>> need what rows are changed, I just need true or false for whether a >>>>> partition is changed). >>>>> >>>>> Thanks >>>>> >>>>> On Tue, Feb 7, 2023 at 11:36 AM Mayur Srivastava < >>>>> mayur.srivast...@twosigma.com> wrote: >>>>> >>>>>> Thanks Ajantha. >>>>>> >>>>>> >>>>>> >>>>>> > It should be very easy to add a few more fields to it like the >>>>>> latest sequence number or last modified time per partition. >>>>>> >>>>>> >>>>>> >>>>>> Among sequence number and modified time, which one do you think is >>>>>> more likely to be available in Iceberg partition stats? Note that we >>>>>> would >>>>>> like to avoid compaction change the sequence number or modified time >>>>>> stats. >>>>>> >>>>>> >>>>>> >>>>>> Thanks, >>>>>> >>>>>> Mayur >>>>>> >>>>>> >>>>>> >>>>>> *From:* Ajantha Bhat <ajanthab...@gmail.com> >>>>>> *Sent:* Tuesday, February 7, 2023 10:02 AM >>>>>> *To:* dev@iceberg.apache.org >>>>>> *Subject:* Re: [Proposal] Partition stats in Iceberg >>>>>> >>>>>> >>>>>> >>>>>> Hi Hrishi and Mayur, thanks for the inputs. >>>>>> >>>>>> To get things moving I have frozen the scope of phase 1 >>>>>> implementation. (Recently added the delete file stats to phase 1 too). >>>>>> You >>>>>> can find the scope in the "Design for approval" section of the design >>>>>> doc. >>>>>> >>>>>> That said, once we have phase 1 implemented, It should be very easy >>>>>> to add a few more fields to it like the latest sequence number or last >>>>>> modified time per partition. >>>>>> I will be opening up the discussion about phase 2 schema again once >>>>>> phase 1 implementation is done. >>>>>> >>>>>> Thanks, >>>>>> Ajantha >>>>>> >>>>>> >>>>>> >>>>>> On Tue, Feb 7, 2023 at 8:15 PM Mayur Srivastava < >>>>>> mayur.srivast...@twosigma.com> wrote: >>>>>> >>>>>> +1 for the initiative. >>>>>> >>>>>> >>>>>> >>>>>> We’ve been exploring options for storing last-modified-time per >>>>>> partition. It an important building block for data pipelines – especially >>>>>> if there is a dependency between jobs with strong consistency >>>>>> requirements. >>>>>> >>>>>> >>>>>> >>>>>> Is partition stats a good place for storing last-modified-time per >>>>>> partition? >>>>>> >>>>>> >>>>>> >>>>>> Thanks, >>>>>> >>>>>> Mayur >>>>>> >>>>>> >>>>>> >>>>>> *From:* Ajantha Bhat <ajanthab...@gmail.com> >>>>>> *Sent:* Monday, January 23, 2023 11:56 AM >>>>>> *To:* dev@iceberg.apache.org >>>>>> *Subject:* Re: [Proposal] Partition stats in Iceberg >>>>>> >>>>>> >>>>>> >>>>>> Hi All, >>>>>> >>>>>> In the same design document ( >>>>>> https://docs.google.com/document/d/1vaufuD47kMijz97LxM67X8OX-W2Wq7nmlz3jRo8J5Qk/edit?usp=sharing >>>>>> ), >>>>>> I have added a section called >>>>>> *"Design for approval". *It also contains a potential PR breakdown >>>>>> for the phase 1 implementation and future development scope. >>>>>> Please take a look and please vote if you think the design is ok. >>>>>> >>>>>> Thanks, >>>>>> Ajantha >>>>>> >>>>>> >>>>>> >>>>>> On Mon, Dec 5, 2022 at 8:37 PM Ajantha Bhat <ajanthab...@gmail.com> >>>>>> wrote: >>>>>> >>>>>> A big thanks to everyone who was involved in the review and the >>>>>> discussions so far. >>>>>> >>>>>> Please find the meeting minutes from the last iceberg sync about the >>>>>> partition stats. >>>>>> a. Writers should not write the partition stats or any stats as >>>>>> of now. >>>>>> Because it requires bumping the spec to V3. (We can have it >>>>>> as part of the v3 spec later on. But not anytime soon). >>>>>> b. So, there can be an async way of generating the stats like >>>>>> ANALYZE table or call procedure. >>>>>> Which will compute the stats till the current snapshot and >>>>>> store it as a partition stats file. >>>>>> c. In phase 1, partition stats will just store the row_count and >>>>>> file_count per partition value as mentioned in the design document. >>>>>> Later it can be enhanced to store puffin file location and >>>>>> other metrics per partition value. >>>>>> d. These tuples are stored in a single sorted Avro/parquet file >>>>>> (we need to finalize this). >>>>>> e. Each time "analyze table" will rewrite the whole stats file as >>>>>> keeping multiple delta files will just make the read path messy. >>>>>> Also, even with million rows, it can be of a few MB size. >>>>>> Once the writers start writing the stats (V3 spec), we can >>>>>> revisit storing as the delta files if there are any performance issues. >>>>>> >>>>>> The next immediate plan is to >>>>>> a. Get these PRs merged (open points in existing StatictisFile >>>>>> interface added during Puffin) >>>>>> #6267 <https://github.com/apache/iceberg/pull/6267>, #6090 >>>>>> <https://github.com/apache/iceberg/pull/6090>, #6091 >>>>>> <https://github.com/apache/iceberg/pull/6091> >>>>>> b. Figure out how to give accurate stats with row-level deletes >>>>>> and how to mask dropped partition values from stats. >>>>>> https://github.com/apache/iceberg/issues/6042 >>>>>> c. Standardize the `StatictisFile` interface to hold the >>>>>> parquet/Avro stats file (instead of always assuming it as a Puffin file) >>>>>> and introduce a `StatisticsType` enum. >>>>>> d. Conclude the storage format and get approval for the design. >>>>>> >>>>>> I will wait another week or two for some more people to take a look >>>>>> at the document >>>>>> >>>>>> before jumping into the implementation. >>>>>> >>>>>> Thanks, >>>>>> Ajantha. >>>>>> >>>>>> >>>>>> >>>>>> On Sat, Nov 26, 2022 at 8:25 AM Ajantha Bhat <ajanthab...@gmail.com> >>>>>> wrote: >>>>>> >>>>>> Hi Ryan, >>>>>> >>>>>> are you saying that you think the partition-level stats should not be >>>>>> required? I think that would be best. >>>>>> >>>>>> I think there is some confusion here. Partition-level stats are >>>>>> required (hence the proposal). >>>>>> But does the writer always write it? (with the append/delete/replace >>>>>> operation) >>>>>> or writer skips writing it and then the user generates it using DML >>>>>> like "Analyze table" was the point of discussion. >>>>>> I think we can have both options with the writer stats writing >>>>>> controlled by a table property "write.stats.enabled" >>>>>> >>>>>> >>>>>> >>>>>> I’m all for improving the interface for retrieving stats. It’s a >>>>>> separate issue >>>>>> >>>>>> Agree. Let us discuss it in a separate thread. >>>>>> >>>>>> Thanks, >>>>>> Ajantha >>>>>> >>>>>> >>>>>> >>>>>> On Sat, Nov 26, 2022 at 12:12 AM Ryan Blue <b...@tabular.io> wrote: >>>>>> >>>>>> Ajantha, are you saying that you think the partition-level stats >>>>>> should not be required? I think that would be best. >>>>>> >>>>>> I’m all for improving the interface for retrieving stats. It’s a >>>>>> separate issue, but I think that Iceberg should provide both access to >>>>>> the >>>>>> Puffin files and metadata as well as a higher-level interface for >>>>>> retrieving information like a column’s NDV. Something like this: >>>>>> >>>>>> int ndv = >>>>>> table.findStat(Statistics.NDV).limitSnapshotDistance(3).forColumn("x"); >>>>>> >>>>>> >>>>>> >>>>>> On Thu, Nov 24, 2022 at 2:31 AM Ajantha Bhat <ajanthab...@gmail.com> >>>>>> wrote: >>>>>> >>>>>> Hi Ryan, >>>>>> Thanks a lot for the review and suggestions. >>>>>> >>>>>> but I think there is also a decision that we need to make before >>>>>> that: Should Iceberg require writers to maintain the partition stats? >>>>>> >>>>>> I think I would prefer to take a lazy approach and not assume that >>>>>> writers will keep the partition stats up to date, >>>>>> >>>>>> in which case we need a way to know which parts of a table are newer >>>>>> than the most recent stats. >>>>>> >>>>>> >>>>>> >>>>>> This is a common problem for existing table-level puffin stats too. >>>>>> Not just for partition stats. >>>>>> As mentioned in the "integration with the current code" section point >>>>>> 8), >>>>>> I was planning to introduce a table property "write.stats.enabled" >>>>>> with a default value set to false. >>>>>> >>>>>> And as per point 7), I was planning to introduce an "ANALYZE table" >>>>>> or "CALL procedure" SQL (maybe table-level API too) to asynchronously >>>>>> compute the stats on demand from the previous checkpoints. >>>>>> >>>>>> But currently, `TableMetadata` doesn't have a clean Interface to >>>>>> provide the statistics file for the current snapshot. >>>>>> If stats are not present, we need another interface to provide a last >>>>>> successful snapshot id for which stats was computed. >>>>>> Also, there is some confusion around reusing the statistics file >>>>>> (because the spec only has a computed snapshot id, not the referenced >>>>>> snapshot id). >>>>>> I am planning to open up a PR to handle these interface updates >>>>>> this week. (same things as you suggested in the last Iceberg sync). >>>>>> This should serve as a good foundation to get insights for lazy & >>>>>> incremental stats computing. >>>>>> >>>>>> >>>>>> Thanks, >>>>>> Ajantha >>>>>> >>>>>> >>>>>> >>>>>> On Thu, Nov 24, 2022 at 12:50 AM Ryan Blue <b...@tabular.io> wrote: >>>>>> >>>>>> Thanks for writing this up, Ajantha! I think that we have all the >>>>>> upstream pieces in place to work on this so it's great to have a >>>>>> proposal. >>>>>> >>>>>> >>>>>> >>>>>> The proposal does a good job of summarizing the choices for how to >>>>>> store the data, but I think there is also a decision that we need to make >>>>>> before that: Should Iceberg require writers to maintain the partition >>>>>> stats? >>>>>> >>>>>> >>>>>> >>>>>> If we do want writers to participate, then we may want to make >>>>>> choices that are easier for writers. But I think that is going to be a >>>>>> challenge. Adding requirements for writers would mean that we need to >>>>>> bump >>>>>> the spec version. Otherwise, we aren't guaranteed that writers will >>>>>> update >>>>>> the files correctly. I think I would prefer to take a lazy approach and >>>>>> not >>>>>> assume that writers will keep the partition stats up to date, in which >>>>>> case >>>>>> we need a way to know which parts of a table are newer than the most >>>>>> recent >>>>>> stats. >>>>>> >>>>>> >>>>>> >>>>>> Ryan >>>>>> >>>>>> >>>>>> >>>>>> On Wed, Nov 23, 2022 at 4:36 AM Ajantha Bhat <ajanthab...@gmail.com> >>>>>> wrote: >>>>>> >>>>>> Thanks Piotr for taking a look at it. >>>>>> I have replied to all the comments in the document. >>>>>> I might need your support in standardising the existing >>>>>> `StatisticsFile` interface to adopt partition stats as mentioned in the >>>>>> design. >>>>>> >>>>>> >>>>>> >>>>>> *We do need more eyes on the design. Once I get approval for the >>>>>> design, I can start the implementation. * >>>>>> Thanks, >>>>>> Ajantha >>>>>> >>>>>> >>>>>> >>>>>> On Wed, Nov 23, 2022 at 3:28 PM Piotr Findeisen < >>>>>> pi...@starburstdata.com> wrote: >>>>>> >>>>>> Hi Ajantha, >>>>>> >>>>>> >>>>>> >>>>>> this is very interesting document, thank you for your work on this! >>>>>> >>>>>> I've added a few comments there. >>>>>> >>>>>> >>>>>> >>>>>> I have one high-level design comment so I thought it would be nicer >>>>>> to everyone if I re-post it here >>>>>> >>>>>> >>>>>> >>>>>> is "partition" the right level of keeping the stats? >>>>>> We do this in Hive, but was it an accidental choice? or just the only >>>>>> thing that was possible to be implemented many years ago? >>>>>> >>>>>> >>>>>> >>>>>> Iceberg allows to have higher number of partitions compared to Hive, >>>>>> because it scales better. But that means partition-level may or may not >>>>>> be >>>>>> the right granularity. >>>>>> >>>>>> >>>>>> A self-optimizing system would gather stats on "per query unit" basis >>>>>> -- for example if i partition by [ day x country ], but usually query by >>>>>> day, the days are the "query unit" and from stats perspective country can >>>>>> be ignored. >>>>>> Having more fine-grained partitions may lead to expensive planning >>>>>> time, so it's not theoretical problem. >>>>>> >>>>>> >>>>>> I am not saying we should implement all this logic right now, but I >>>>>> think we should decouple partitioning scheme from stats partitions, to >>>>>> allow query engine to become smarter. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> cc @Alexander Jo <alex...@starburstdata.com> >>>>>> >>>>>> >>>>>> >>>>>> Best >>>>>> >>>>>> PF >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Mon, Nov 14, 2022 at 12:47 PM Ajantha Bhat <ajanthab...@gmail.com> >>>>>> wrote: >>>>>> >>>>>> Hi Community, >>>>>> I did a proposal write-up for the partition stats in Iceberg. >>>>>> Please have a look and let me know what you think. I would like to >>>>>> work on it. >>>>>> >>>>>> >>>>>> https://docs.google.com/document/d/1vaufuD47kMijz97LxM67X8OX-W2Wq7nmlz3jRo8J5Qk/edit?usp=sharing >>>>>> >>>>>> Requirement background snippet from the above document. >>>>>> >>>>>> For some query engines that use cost-based-optimizer instead or along >>>>>> with rule-based-optimizer (like Dremio, Trino, etc), at the planning >>>>>> time, >>>>>> it is good to know the partition level stats like total rows per >>>>>> partition and total files per partition to take decisions for CBO ( >>>>>> like deciding on the join reordering and join type, identifying the >>>>>> parallelism). >>>>>> Currently, the only way to do this is to read the partition info from >>>>>> data_file >>>>>> in manifest_entry of the manifest file and compute partition-level >>>>>> statistics (the same thing that ‘partitions’ metadata table is doing >>>>>> *[see >>>>>> **Appendix A* >>>>>> <https://docs.google.com/document/d/1vaufuD47kMijz97LxM67X8OX-W2Wq7nmlz3jRo8J5Qk/edit#heading=h.s8iywtu7x8m6> >>>>>> *]*). >>>>>> Doing this on each query is expensive. *Hence, this is a proposal >>>>>> for computing and storing partition-level stats for Iceberg tables and >>>>>> using them during queries.* >>>>>> >>>>>> >>>>>> >>>>>> Thanks, >>>>>> Ajantha >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> >>>>>> Ryan Blue >>>>>> >>>>>> Tabular >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> >>>>>> Ryan Blue >>>>>> >>>>>> Tabular >>>>>> >>>>>> >>>> >>>> -- >>>> Ryan Blue >>>> Tabular >>>> >>>