Re: [Proposal] Partition stats in Iceberg

Ryan Blue Wed, 03 May 2023 11:00:02 -0700

Mayur, your use case may require a lot of snapshots, but we generally
recommend expiring them after a few days. You can tag snapshots to keep
them around longer than that.


On Tue, May 2, 2023 at 4:52 PM Mayur Srivastava <
mayur.p.srivast...@gmail.com> wrote:

> Thanks for the response.
> One of the use cases that we have is where one business day of data is
> added at a time to a DAY partitioned table. With 25 years of this data,
> there will be ~6250 partitions and as many snapshots. Are these many
> snapshots recommended to be kept around?
>
> On Tue, May 2, 2023 at 7:45 PM Szehon Ho <szehon.apa...@gmail.com> wrote:
>
>>
>> Does snapshot expiration needs to be disabled for this to work? Thanks,
>>> Mayur
>>>
>>
>> Yes, the snapshot that last updated the partition needs to be around for
>> this to work.
>>
>>  Szehon, the query you shared requires a SparkSQL job to be run which
>>> means latency will be high. However, I am glad you are also thinking of
>>> adding these directly to the partition table and it seems we share the same
>>> interests.
>>
>>
>> Yea the partitions table currently still goes through SparkSQL, so it
>> will be the same.  Maybe you mean add this to partition stats?  We do need
>> to reconcile partition table and partition stats at some point though.  Not
>> sure if it was designed/discussed yet, I think there was some thoughts on
>> short-circuiting Partitions table to read from Partition stats, if stats
>> exist for the current snapshot.
>>
>> Thanks
>> Szehon
>>
>> On Tue, May 2, 2023 at 4:34 PM Pucheng Yang <py...@pinterest.com.invalid>
>> wrote:
>>
>>> Thanks Ryan and Szehon!
>>>
>>> Szehon, the query you shared requires a SparkSQL job to be run which
>>> means latency will be high. However, I am glad you are also thinking of
>>> adding these directly to the partition table and it seems we share the same
>>> interests. I am looking forward to the work in the phase 2 implementation.
>>> Let me know if I can help, thanks.
>>>
>>> On Tue, May 2, 2023 at 4:28 PM Szehon Ho <szehon.apa...@gmail.com>
>>> wrote:
>>>
>>>> Yea I agree, I had a handy query for the last update time of partition.
>>>>
>>>> SELECT
>>>>
>>>> e.data_file.partition,
>>>>
>>>> MAX(s.committed_at) AS last_modified_time
>>>>
>>>> FROM db.table.snapshots s
>>>>
>>>> JOIN db.table.entries e
>>>>
>>>> WHERE s.snapshot_id = e.snapshot_id
>>>>
>>>> GROUP BY by e.data_file.partition
>>>>
>>>> It's a bit lengthy currently.
>>>>
>>>> I have been indeed thinking to look at adding these fields to the
>>>> Partitions table directly, after Ajantha's pending changes to add delete
>>>> files to this table.
>>>>
>>>> Thanks
>>>> Szehon
>>>>
>>>> On Tue, May 2, 2023 at 4:08 PM Ryan Blue <b...@tabular.io> wrote:
>>>>
>>>>> Pucheng,
>>>>>
>>>>> Rather than using the changelog, I'd just look at the metadata tables.
>>>>> You should be able to query the all_entries metadata table to see file
>>>>> additions or deletions for a given snapshot. Then from there you can join
>>>>> to the snapshots table for timestamps and aggregate to the partition 
>>>>> level.
>>>>>
>>>>> Ryan
>>>>>
>>>>> On Fri, Apr 28, 2023 at 12:49 PM Pucheng Yang
>>>>> <py...@pinterest.com.invalid> wrote:
>>>>>
>>>>>> Hi Ajantha and the community,
>>>>>>
>>>>>> I am interested and I am wondering where we can see the latest
>>>>>> progress of this feature?
>>>>>>
>>>>>> Regarding the partition stats in Iceberg, I am specifically curious
>>>>>> if we can consider a new field called "last modified time" to be included
>>>>>> for the partitions stats (or have a plugable way to allow users to
>>>>>> configure partition stats they need). My use case is to find out if a
>>>>>> partition is changed or not given two snapshots (old and new) with a
>>>>>> quick and light way process. I previously was suggested by the community 
>>>>>> to
>>>>>> use the change log (CDC) but I think that is too heavy (I guess, since it
>>>>>> requires to run SparkSQL procedure) and it is over do the work (I don't
>>>>>> need what rows are changed, I just need true or false for whether a
>>>>>> partition is changed).
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> On Tue, Feb 7, 2023 at 11:36 AM Mayur Srivastava <
>>>>>> mayur.srivast...@twosigma.com> wrote:
>>>>>>
>>>>>>> Thanks Ajantha.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> > It should be very easy to add a few more fields to it like the
>>>>>>> latest sequence number or last modified time per partition.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Among sequence number and modified time, which one do you think is
>>>>>>> more likely to be available in Iceberg partition stats? Note that we 
>>>>>>> would
>>>>>>> like to avoid compaction change the sequence number or modified time 
>>>>>>> stats.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Mayur
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *From:* Ajantha Bhat <ajanthab...@gmail.com>
>>>>>>> *Sent:* Tuesday, February 7, 2023 10:02 AM
>>>>>>> *To:* dev@iceberg.apache.org
>>>>>>> *Subject:* Re: [Proposal] Partition stats in Iceberg
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Hi Hrishi and Mayur, thanks for the inputs.
>>>>>>>
>>>>>>> To get things moving I have frozen the scope of phase 1
>>>>>>> implementation. (Recently added the delete file stats to phase 1 too). 
>>>>>>> You
>>>>>>> can find the scope in the "Design for approval" section of the design 
>>>>>>> doc.
>>>>>>>
>>>>>>> That said, once we have phase 1 implemented, It should be very easy
>>>>>>> to add a few more fields to it like the latest sequence number or last
>>>>>>> modified time per partition.
>>>>>>> I will be opening up the discussion about phase 2 schema again once
>>>>>>> phase 1 implementation is done.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Ajantha
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Feb 7, 2023 at 8:15 PM Mayur Srivastava <
>>>>>>> mayur.srivast...@twosigma.com> wrote:
>>>>>>>
>>>>>>> +1 for the initiative.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> We’ve been exploring options for storing last-modified-time per
>>>>>>> partition. It an important building block for data pipelines – 
>>>>>>> especially
>>>>>>> if there is a dependency between jobs with strong consistency 
>>>>>>> requirements.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Is partition stats a good place for storing last-modified-time per
>>>>>>> partition?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Mayur
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *From:* Ajantha Bhat <ajanthab...@gmail.com>
>>>>>>> *Sent:* Monday, January 23, 2023 11:56 AM
>>>>>>> *To:* dev@iceberg.apache.org
>>>>>>> *Subject:* Re: [Proposal] Partition stats in Iceberg
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Hi All,
>>>>>>>
>>>>>>> In the same design document (
>>>>>>> https://docs.google.com/document/d/1vaufuD47kMijz97LxM67X8OX-W2Wq7nmlz3jRo8J5Qk/edit?usp=sharing
>>>>>>> ),
>>>>>>> I have added a section called
>>>>>>> *"Design for approval".  *It also contains a potential PR breakdown
>>>>>>> for the phase 1 implementation and future development scope.
>>>>>>> Please take a look and please vote if you think the design is ok.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Ajantha
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Dec 5, 2022 at 8:37 PM Ajantha Bhat <ajanthab...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> A big thanks to everyone who was involved in the review and the
>>>>>>> discussions so far.
>>>>>>>
>>>>>>> Please find the meeting minutes from the last iceberg sync about the
>>>>>>> partition stats.
>>>>>>>     a. Writers should not write the partition stats or any stats as
>>>>>>> of now.
>>>>>>>         Because it requires bumping the spec to V3. (We can have it
>>>>>>> as part of the v3 spec later on. But not anytime soon).
>>>>>>>     b. So, there can be an async way of generating the stats like
>>>>>>> ANALYZE table or call procedure.
>>>>>>>         Which will compute the stats till the current snapshot and
>>>>>>> store it as a partition stats file.
>>>>>>>     c. In phase 1, partition stats will just store the row_count and
>>>>>>> file_count per partition value as mentioned in the design document.
>>>>>>>         Later it can be enhanced to store puffin file location and
>>>>>>> other metrics per partition value.
>>>>>>>     d. These tuples are stored in a single sorted Avro/parquet file
>>>>>>> (we need to finalize this).
>>>>>>>     e. Each time "analyze table" will rewrite the whole stats file
>>>>>>> as keeping multiple delta files will just make the read path messy.
>>>>>>>         Also, even with million rows, it can be of a few MB size.
>>>>>>>         Once the writers start writing the stats (V3 spec), we can
>>>>>>> revisit storing as the delta files if there are any performance issues.
>>>>>>>
>>>>>>> The next immediate plan is to
>>>>>>>     a. Get these PRs merged (open points in existing StatictisFile
>>>>>>> interface added during Puffin)
>>>>>>>         #6267 <https://github.com/apache/iceberg/pull/6267>, #6090
>>>>>>> <https://github.com/apache/iceberg/pull/6090>, #6091
>>>>>>> <https://github.com/apache/iceberg/pull/6091>
>>>>>>>     b. Figure out how to give accurate stats with row-level deletes
>>>>>>> and how to mask dropped partition values from stats.
>>>>>>>         https://github.com/apache/iceberg/issues/6042
>>>>>>>     c. Standardize the `StatictisFile` interface to hold the
>>>>>>> parquet/Avro stats file (instead of always assuming it as a Puffin file)
>>>>>>>         and introduce a `StatisticsType` enum.
>>>>>>>     d. Conclude the storage format and get approval for the design.
>>>>>>>
>>>>>>> I will wait another week or two for some more people to take a look
>>>>>>> at the document
>>>>>>>
>>>>>>> before jumping into the implementation.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Ajantha.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Sat, Nov 26, 2022 at 8:25 AM Ajantha Bhat <ajanthab...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Hi Ryan,
>>>>>>>
>>>>>>> are you saying that you think the partition-level stats should not
>>>>>>> be required? I think that would be best.
>>>>>>>
>>>>>>> I think there is some confusion here. Partition-level stats are
>>>>>>> required (hence the proposal).
>>>>>>> But does the writer always write it? (with the append/delete/replace
>>>>>>> operation)
>>>>>>> or writer skips writing it and then the user generates it using DML
>>>>>>> like "Analyze table" was the point of discussion.
>>>>>>> I think we can have both options with the writer stats writing
>>>>>>> controlled by a table property "write.stats.enabled"
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> I’m all for improving the interface for retrieving stats. It’s a
>>>>>>> separate issue
>>>>>>>
>>>>>>> Agree. Let us discuss it in a separate thread.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Ajantha
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Sat, Nov 26, 2022 at 12:12 AM Ryan Blue <b...@tabular.io> wrote:
>>>>>>>
>>>>>>> Ajantha, are you saying that you think the partition-level stats
>>>>>>> should not be required? I think that would be best.
>>>>>>>
>>>>>>> I’m all for improving the interface for retrieving stats. It’s a
>>>>>>> separate issue, but I think that Iceberg should provide both access to 
>>>>>>> the
>>>>>>> Puffin files and metadata as well as a higher-level interface for
>>>>>>> retrieving information like a column’s NDV. Something like this:
>>>>>>>
>>>>>>> int ndv = 
>>>>>>> table.findStat(Statistics.NDV).limitSnapshotDistance(3).forColumn("x");
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Nov 24, 2022 at 2:31 AM Ajantha Bhat <ajanthab...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Hi Ryan,
>>>>>>> Thanks a lot for the review and suggestions.
>>>>>>>
>>>>>>> but I think there is also a decision that we need to make before
>>>>>>> that: Should Iceberg require writers to maintain the partition stats?
>>>>>>>
>>>>>>> I think I would prefer to take a lazy approach and not assume that
>>>>>>> writers will keep the partition stats up to date,
>>>>>>>
>>>>>>> in which case we need a way to know which parts of a table are newer
>>>>>>> than the most recent stats.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> This is a common problem for existing table-level puffin stats too.
>>>>>>> Not just for partition stats.
>>>>>>> As mentioned in the "integration with the current code" section
>>>>>>> point 8),
>>>>>>> I was planning to introduce a table property "write.stats.enabled"
>>>>>>> with a default value set to false.
>>>>>>>
>>>>>>> And as per point 7), I was planning to introduce an "ANALYZE table"
>>>>>>> or "CALL procedure" SQL (maybe table-level API too) to asynchronously
>>>>>>> compute the stats on demand from the previous checkpoints.
>>>>>>>
>>>>>>> But currently, `TableMetadata` doesn't have a clean Interface to
>>>>>>> provide the statistics file for the current snapshot.
>>>>>>> If stats are not present, we need another interface to provide a
>>>>>>> last successful snapshot id for which stats was computed.
>>>>>>> Also, there is some confusion around reusing the statistics file
>>>>>>> (because the spec only has a computed snapshot id, not the referenced
>>>>>>> snapshot id).
>>>>>>> I am planning to open up a PR to handle these interface updates
>>>>>>> this week. (same things as you suggested in the last Iceberg sync).
>>>>>>> This should serve as a good foundation to get insights for lazy &
>>>>>>> incremental stats computing.
>>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Ajantha
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Nov 24, 2022 at 12:50 AM Ryan Blue <b...@tabular.io> wrote:
>>>>>>>
>>>>>>> Thanks for writing this up, Ajantha! I think that we have all the
>>>>>>> upstream pieces in place to work on this so it's great to have a 
>>>>>>> proposal.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> The proposal does a good job of summarizing the choices for how to
>>>>>>> store the data, but I think there is also a decision that we need to 
>>>>>>> make
>>>>>>> before that: Should Iceberg require writers to maintain the partition 
>>>>>>> stats?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> If we do want writers to participate, then we may want to make
>>>>>>> choices that are easier for writers. But I think that is going to be a
>>>>>>> challenge. Adding requirements for writers would mean that we need to 
>>>>>>> bump
>>>>>>> the spec version. Otherwise, we aren't guaranteed that writers will 
>>>>>>> update
>>>>>>> the files correctly. I think I would prefer to take a lazy approach and 
>>>>>>> not
>>>>>>> assume that writers will keep the partition stats up to date, in which 
>>>>>>> case
>>>>>>> we need a way to know which parts of a table are newer than the most 
>>>>>>> recent
>>>>>>> stats.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Ryan
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Nov 23, 2022 at 4:36 AM Ajantha Bhat <ajanthab...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Thanks Piotr for taking a look at it.
>>>>>>> I have replied to all the comments in the document.
>>>>>>> I might need your support in standardising the existing
>>>>>>> `StatisticsFile` interface to adopt partition stats as mentioned in the
>>>>>>> design.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *We do need more eyes on the design. Once I get approval for the
>>>>>>> design, I can start the implementation.  *
>>>>>>> Thanks,
>>>>>>> Ajantha
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Nov 23, 2022 at 3:28 PM Piotr Findeisen <
>>>>>>> pi...@starburstdata.com> wrote:
>>>>>>>
>>>>>>> Hi Ajantha,
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> this is very interesting document, thank you for your work on this!
>>>>>>>
>>>>>>> I've added a few comments there.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> I have one high-level design comment so I thought it would be nicer
>>>>>>> to everyone if I re-post it here
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> is "partition" the right level of keeping the stats?
>>>>>>> We do this in Hive, but was it an accidental choice? or just the
>>>>>>> only thing that was possible to be implemented many years ago?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Iceberg allows to have higher number of partitions compared to Hive,
>>>>>>> because it scales better. But that means partition-level may or may not 
>>>>>>> be
>>>>>>> the right granularity.
>>>>>>>
>>>>>>>
>>>>>>> A self-optimizing system would gather stats on "per query unit"
>>>>>>> basis -- for example if i partition by [ day x country ], but usually 
>>>>>>> query
>>>>>>> by day, the days are the "query unit" and from stats perspective country
>>>>>>> can be ignored.
>>>>>>> Having more fine-grained partitions may lead to expensive planning
>>>>>>> time, so it's not theoretical problem.
>>>>>>>
>>>>>>>
>>>>>>> I am not saying we should implement all this logic right now, but I
>>>>>>> think we should decouple partitioning scheme from stats partitions, to
>>>>>>> allow  query engine to become smarter.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> cc @Alexander Jo <alex...@starburstdata.com>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Best
>>>>>>>
>>>>>>> PF
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Nov 14, 2022 at 12:47 PM Ajantha Bhat <ajanthab...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Hi Community,
>>>>>>> I did a proposal write-up for the partition stats in Iceberg.
>>>>>>> Please have a look and let me know what you think. I would like to
>>>>>>> work on it.
>>>>>>>
>>>>>>>
>>>>>>> https://docs.google.com/document/d/1vaufuD47kMijz97LxM67X8OX-W2Wq7nmlz3jRo8J5Qk/edit?usp=sharing
>>>>>>>
>>>>>>> Requirement background snippet from the above document.
>>>>>>>
>>>>>>> For some query engines that use cost-based-optimizer instead or
>>>>>>> along with rule-based-optimizer (like Dremio, Trino, etc), at the 
>>>>>>> planning
>>>>>>> time,
>>>>>>> it is good to know the partition level stats like total rows per
>>>>>>> partition and total files per partition to take decisions for CBO (
>>>>>>> like deciding on the join reordering and join type, identifying the
>>>>>>> parallelism).
>>>>>>> Currently, the only way to do this is to read the partition info
>>>>>>> from data_file in manifest_entry of the manifest file and compute
>>>>>>> partition-level statistics (the same thing that ‘partitions’ metadata 
>>>>>>> table
>>>>>>> is doing *[see **Appendix A*
>>>>>>> <https://docs.google.com/document/d/1vaufuD47kMijz97LxM67X8OX-W2Wq7nmlz3jRo8J5Qk/edit#heading=h.s8iywtu7x8m6>
>>>>>>> *]*).
>>>>>>> Doing this on each query is expensive. *Hence, this is a proposal
>>>>>>> for computing and storing partition-level stats for Iceberg tables and
>>>>>>> using them during queries.*
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Ajantha
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>> Ryan Blue
>>>>>>>
>>>>>>> Tabular
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>> Ryan Blue
>>>>>>>
>>>>>>> Tabular
>>>>>>>
>>>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Tabular
>>>>>
>>>>

-- 
Ryan Blue
Tabular

Re: [Proposal] Partition stats in Iceberg

Reply via email to