Does snapshot expiration needs to be disabled for this to work? Thanks, Mayur
On Tue, May 2, 2023 at 7:28 PM Szehon Ho <szehon.apa...@gmail.com> wrote: > Yea I agree, I had a handy query for the last update time of partition. > > SELECT > > e.data_file.partition, > > MAX(s.committed_at) AS last_modified_time > > FROM db.table.snapshots s > > JOIN db.table.entries e > > WHERE s.snapshot_id = e.snapshot_id > > GROUP BY by e.data_file.partition > > It's a bit lengthy currently. > > I have been indeed thinking to look at adding these fields to the > Partitions table directly, after Ajantha's pending changes to add delete > files to this table. > > Thanks > Szehon > > On Tue, May 2, 2023 at 4:08 PM Ryan Blue <b...@tabular.io> wrote: > >> Pucheng, >> >> Rather than using the changelog, I'd just look at the metadata tables. >> You should be able to query the all_entries metadata table to see file >> additions or deletions for a given snapshot. Then from there you can join >> to the snapshots table for timestamps and aggregate to the partition level. >> >> Ryan >> >> On Fri, Apr 28, 2023 at 12:49 PM Pucheng Yang <py...@pinterest.com.invalid> >> wrote: >> >>> Hi Ajantha and the community, >>> >>> I am interested and I am wondering where we can see the latest progress >>> of this feature? >>> >>> Regarding the partition stats in Iceberg, I am specifically curious if >>> we can consider a new field called "last modified time" to be included for >>> the partitions stats (or have a plugable way to allow users to >>> configure partition stats they need). My use case is to find out if a >>> partition is changed or not given two snapshots (old and new) with a >>> quick and light way process. I previously was suggested by the community to >>> use the change log (CDC) but I think that is too heavy (I guess, since it >>> requires to run SparkSQL procedure) and it is over do the work (I don't >>> need what rows are changed, I just need true or false for whether a >>> partition is changed). >>> >>> Thanks >>> >>> On Tue, Feb 7, 2023 at 11:36 AM Mayur Srivastava < >>> mayur.srivast...@twosigma.com> wrote: >>> >>>> Thanks Ajantha. >>>> >>>> >>>> >>>> > It should be very easy to add a few more fields to it like the latest >>>> sequence number or last modified time per partition. >>>> >>>> >>>> >>>> Among sequence number and modified time, which one do you think is more >>>> likely to be available in Iceberg partition stats? Note that we would like >>>> to avoid compaction change the sequence number or modified time stats. >>>> >>>> >>>> >>>> Thanks, >>>> >>>> Mayur >>>> >>>> >>>> >>>> *From:* Ajantha Bhat <ajanthab...@gmail.com> >>>> *Sent:* Tuesday, February 7, 2023 10:02 AM >>>> *To:* dev@iceberg.apache.org >>>> *Subject:* Re: [Proposal] Partition stats in Iceberg >>>> >>>> >>>> >>>> Hi Hrishi and Mayur, thanks for the inputs. >>>> >>>> To get things moving I have frozen the scope of phase 1 implementation. >>>> (Recently added the delete file stats to phase 1 too). You can find the >>>> scope in the "Design for approval" section of the design doc. >>>> >>>> That said, once we have phase 1 implemented, It should be very easy to >>>> add a few more fields to it like the latest sequence number or last >>>> modified time per partition. >>>> I will be opening up the discussion about phase 2 schema again once >>>> phase 1 implementation is done. >>>> >>>> Thanks, >>>> Ajantha >>>> >>>> >>>> >>>> On Tue, Feb 7, 2023 at 8:15 PM Mayur Srivastava < >>>> mayur.srivast...@twosigma.com> wrote: >>>> >>>> +1 for the initiative. >>>> >>>> >>>> >>>> We’ve been exploring options for storing last-modified-time per >>>> partition. It an important building block for data pipelines – especially >>>> if there is a dependency between jobs with strong consistency requirements. >>>> >>>> >>>> >>>> Is partition stats a good place for storing last-modified-time per >>>> partition? >>>> >>>> >>>> >>>> Thanks, >>>> >>>> Mayur >>>> >>>> >>>> >>>> *From:* Ajantha Bhat <ajanthab...@gmail.com> >>>> *Sent:* Monday, January 23, 2023 11:56 AM >>>> *To:* dev@iceberg.apache.org >>>> *Subject:* Re: [Proposal] Partition stats in Iceberg >>>> >>>> >>>> >>>> Hi All, >>>> >>>> In the same design document ( >>>> https://docs.google.com/document/d/1vaufuD47kMijz97LxM67X8OX-W2Wq7nmlz3jRo8J5Qk/edit?usp=sharing >>>> ), >>>> I have added a section called >>>> *"Design for approval". *It also contains a potential PR breakdown >>>> for the phase 1 implementation and future development scope. >>>> Please take a look and please vote if you think the design is ok. >>>> >>>> Thanks, >>>> Ajantha >>>> >>>> >>>> >>>> On Mon, Dec 5, 2022 at 8:37 PM Ajantha Bhat <ajanthab...@gmail.com> >>>> wrote: >>>> >>>> A big thanks to everyone who was involved in the review and the >>>> discussions so far. >>>> >>>> Please find the meeting minutes from the last iceberg sync about the >>>> partition stats. >>>> a. Writers should not write the partition stats or any stats as of >>>> now. >>>> Because it requires bumping the spec to V3. (We can have it as >>>> part of the v3 spec later on. But not anytime soon). >>>> b. So, there can be an async way of generating the stats like >>>> ANALYZE table or call procedure. >>>> Which will compute the stats till the current snapshot and >>>> store it as a partition stats file. >>>> c. In phase 1, partition stats will just store the row_count and >>>> file_count per partition value as mentioned in the design document. >>>> Later it can be enhanced to store puffin file location and >>>> other metrics per partition value. >>>> d. These tuples are stored in a single sorted Avro/parquet file (we >>>> need to finalize this). >>>> e. Each time "analyze table" will rewrite the whole stats file as >>>> keeping multiple delta files will just make the read path messy. >>>> Also, even with million rows, it can be of a few MB size. >>>> Once the writers start writing the stats (V3 spec), we can >>>> revisit storing as the delta files if there are any performance issues. >>>> >>>> The next immediate plan is to >>>> a. Get these PRs merged (open points in existing StatictisFile >>>> interface added during Puffin) >>>> #6267 <https://github.com/apache/iceberg/pull/6267>, #6090 >>>> <https://github.com/apache/iceberg/pull/6090>, #6091 >>>> <https://github.com/apache/iceberg/pull/6091> >>>> b. Figure out how to give accurate stats with row-level deletes and >>>> how to mask dropped partition values from stats. >>>> https://github.com/apache/iceberg/issues/6042 >>>> c. Standardize the `StatictisFile` interface to hold the >>>> parquet/Avro stats file (instead of always assuming it as a Puffin file) >>>> and introduce a `StatisticsType` enum. >>>> d. Conclude the storage format and get approval for the design. >>>> >>>> I will wait another week or two for some more people to take a look at >>>> the document >>>> >>>> before jumping into the implementation. >>>> >>>> Thanks, >>>> Ajantha. >>>> >>>> >>>> >>>> On Sat, Nov 26, 2022 at 8:25 AM Ajantha Bhat <ajanthab...@gmail.com> >>>> wrote: >>>> >>>> Hi Ryan, >>>> >>>> are you saying that you think the partition-level stats should not be >>>> required? I think that would be best. >>>> >>>> I think there is some confusion here. Partition-level stats are >>>> required (hence the proposal). >>>> But does the writer always write it? (with the append/delete/replace >>>> operation) >>>> or writer skips writing it and then the user generates it using DML >>>> like "Analyze table" was the point of discussion. >>>> I think we can have both options with the writer stats writing >>>> controlled by a table property "write.stats.enabled" >>>> >>>> >>>> >>>> I’m all for improving the interface for retrieving stats. It’s a >>>> separate issue >>>> >>>> Agree. Let us discuss it in a separate thread. >>>> >>>> Thanks, >>>> Ajantha >>>> >>>> >>>> >>>> On Sat, Nov 26, 2022 at 12:12 AM Ryan Blue <b...@tabular.io> wrote: >>>> >>>> Ajantha, are you saying that you think the partition-level stats should >>>> not be required? I think that would be best. >>>> >>>> I’m all for improving the interface for retrieving stats. It’s a >>>> separate issue, but I think that Iceberg should provide both access to the >>>> Puffin files and metadata as well as a higher-level interface for >>>> retrieving information like a column’s NDV. Something like this: >>>> >>>> int ndv = >>>> table.findStat(Statistics.NDV).limitSnapshotDistance(3).forColumn("x"); >>>> >>>> >>>> >>>> On Thu, Nov 24, 2022 at 2:31 AM Ajantha Bhat <ajanthab...@gmail.com> >>>> wrote: >>>> >>>> Hi Ryan, >>>> Thanks a lot for the review and suggestions. >>>> >>>> but I think there is also a decision that we need to make before that: >>>> Should Iceberg require writers to maintain the partition stats? >>>> >>>> I think I would prefer to take a lazy approach and not assume that >>>> writers will keep the partition stats up to date, >>>> >>>> in which case we need a way to know which parts of a table are newer >>>> than the most recent stats. >>>> >>>> >>>> >>>> This is a common problem for existing table-level puffin stats too. >>>> Not just for partition stats. >>>> As mentioned in the "integration with the current code" section point >>>> 8), >>>> I was planning to introduce a table property "write.stats.enabled" with >>>> a default value set to false. >>>> >>>> And as per point 7), I was planning to introduce an "ANALYZE table" or >>>> "CALL procedure" SQL (maybe table-level API too) to asynchronously >>>> compute the stats on demand from the previous checkpoints. >>>> >>>> But currently, `TableMetadata` doesn't have a clean Interface to >>>> provide the statistics file for the current snapshot. >>>> If stats are not present, we need another interface to provide a last >>>> successful snapshot id for which stats was computed. >>>> Also, there is some confusion around reusing the statistics file >>>> (because the spec only has a computed snapshot id, not the referenced >>>> snapshot id). >>>> I am planning to open up a PR to handle these interface updates >>>> this week. (same things as you suggested in the last Iceberg sync). >>>> This should serve as a good foundation to get insights for lazy & >>>> incremental stats computing. >>>> >>>> >>>> Thanks, >>>> Ajantha >>>> >>>> >>>> >>>> On Thu, Nov 24, 2022 at 12:50 AM Ryan Blue <b...@tabular.io> wrote: >>>> >>>> Thanks for writing this up, Ajantha! I think that we have all the >>>> upstream pieces in place to work on this so it's great to have a proposal. >>>> >>>> >>>> >>>> The proposal does a good job of summarizing the choices for how to >>>> store the data, but I think there is also a decision that we need to make >>>> before that: Should Iceberg require writers to maintain the partition >>>> stats? >>>> >>>> >>>> >>>> If we do want writers to participate, then we may want to make choices >>>> that are easier for writers. But I think that is going to be a challenge. >>>> Adding requirements for writers would mean that we need to bump the spec >>>> version. Otherwise, we aren't guaranteed that writers will update the files >>>> correctly. I think I would prefer to take a lazy approach and not assume >>>> that writers will keep the partition stats up to date, in which case we >>>> need a way to know which parts of a table are newer than the most recent >>>> stats. >>>> >>>> >>>> >>>> Ryan >>>> >>>> >>>> >>>> On Wed, Nov 23, 2022 at 4:36 AM Ajantha Bhat <ajanthab...@gmail.com> >>>> wrote: >>>> >>>> Thanks Piotr for taking a look at it. >>>> I have replied to all the comments in the document. >>>> I might need your support in standardising the existing >>>> `StatisticsFile` interface to adopt partition stats as mentioned in the >>>> design. >>>> >>>> >>>> >>>> *We do need more eyes on the design. Once I get approval for the >>>> design, I can start the implementation. * >>>> Thanks, >>>> Ajantha >>>> >>>> >>>> >>>> On Wed, Nov 23, 2022 at 3:28 PM Piotr Findeisen < >>>> pi...@starburstdata.com> wrote: >>>> >>>> Hi Ajantha, >>>> >>>> >>>> >>>> this is very interesting document, thank you for your work on this! >>>> >>>> I've added a few comments there. >>>> >>>> >>>> >>>> I have one high-level design comment so I thought it would be nicer to >>>> everyone if I re-post it here >>>> >>>> >>>> >>>> is "partition" the right level of keeping the stats? >>>> We do this in Hive, but was it an accidental choice? or just the only >>>> thing that was possible to be implemented many years ago? >>>> >>>> >>>> >>>> Iceberg allows to have higher number of partitions compared to Hive, >>>> because it scales better. But that means partition-level may or may not be >>>> the right granularity. >>>> >>>> >>>> A self-optimizing system would gather stats on "per query unit" basis >>>> -- for example if i partition by [ day x country ], but usually query by >>>> day, the days are the "query unit" and from stats perspective country can >>>> be ignored. >>>> Having more fine-grained partitions may lead to expensive planning >>>> time, so it's not theoretical problem. >>>> >>>> >>>> I am not saying we should implement all this logic right now, but I >>>> think we should decouple partitioning scheme from stats partitions, to >>>> allow query engine to become smarter. >>>> >>>> >>>> >>>> >>>> >>>> cc @Alexander Jo <alex...@starburstdata.com> >>>> >>>> >>>> >>>> Best >>>> >>>> PF >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> On Mon, Nov 14, 2022 at 12:47 PM Ajantha Bhat <ajanthab...@gmail.com> >>>> wrote: >>>> >>>> Hi Community, >>>> I did a proposal write-up for the partition stats in Iceberg. >>>> Please have a look and let me know what you think. I would like to work >>>> on it. >>>> >>>> >>>> https://docs.google.com/document/d/1vaufuD47kMijz97LxM67X8OX-W2Wq7nmlz3jRo8J5Qk/edit?usp=sharing >>>> >>>> Requirement background snippet from the above document. >>>> >>>> For some query engines that use cost-based-optimizer instead or along >>>> with rule-based-optimizer (like Dremio, Trino, etc), at the planning time, >>>> it is good to know the partition level stats like total rows per >>>> partition and total files per partition to take decisions for CBO ( >>>> like deciding on the join reordering and join type, identifying the >>>> parallelism). >>>> Currently, the only way to do this is to read the partition info from >>>> data_file >>>> in manifest_entry of the manifest file and compute partition-level >>>> statistics (the same thing that ‘partitions’ metadata table is doing *[see >>>> **Appendix A* >>>> <https://docs.google.com/document/d/1vaufuD47kMijz97LxM67X8OX-W2Wq7nmlz3jRo8J5Qk/edit#heading=h.s8iywtu7x8m6> >>>> *]*). >>>> Doing this on each query is expensive. *Hence, this is a proposal for >>>> computing and storing partition-level stats for Iceberg tables and using >>>> them during queries.* >>>> >>>> >>>> >>>> Thanks, >>>> Ajantha >>>> >>>> >>>> >>>> >>>> -- >>>> >>>> Ryan Blue >>>> >>>> Tabular >>>> >>>> >>>> >>>> >>>> -- >>>> >>>> Ryan Blue >>>> >>>> Tabular >>>> >>>> >> >> -- >> Ryan Blue >> Tabular >> >