Re: [Proposal] Partition stats in Iceberg

Pucheng Yang Tue, 02 May 2023 16:34:00 -0700

Thanks Ryan and Szehon!

Szehon, the query you shared requires a SparkSQL job to be run which means
latency will be high. However, I am glad you are also thinking of
adding these directly to the partition table and it seems we share the same
interests. I am looking forward to the work in the phase 2 implementation.
Let me know if I can help, thanks.


On Tue, May 2, 2023 at 4:28 PM Szehon Ho <szehon.apa...@gmail.com> wrote:

> Yea I agree, I had a handy query for the last update time of partition.
>
> SELECT
>
> e.data_file.partition,
>
> MAX(s.committed_at) AS last_modified_time
>
> FROM db.table.snapshots s
>
> JOIN db.table.entries e
>
> WHERE s.snapshot_id = e.snapshot_id
>
> GROUP BY by e.data_file.partition
>
> It's a bit lengthy currently.
>
> I have been indeed thinking to look at adding these fields to the
> Partitions table directly, after Ajantha's pending changes to add delete
> files to this table.
>
> Thanks
> Szehon
>
> On Tue, May 2, 2023 at 4:08 PM Ryan Blue <b...@tabular.io> wrote:
>
>> Pucheng,
>>
>> Rather than using the changelog, I'd just look at the metadata tables.
>> You should be able to query the all_entries metadata table to see file
>> additions or deletions for a given snapshot. Then from there you can join
>> to the snapshots table for timestamps and aggregate to the partition level.
>>
>> Ryan
>>
>> On Fri, Apr 28, 2023 at 12:49 PM Pucheng Yang <py...@pinterest.com.invalid>
>> wrote:
>>
>>> Hi Ajantha and the community,
>>>
>>> I am interested and I am wondering where we can see the latest progress
>>> of this feature?
>>>
>>> Regarding the partition stats in Iceberg, I am specifically curious if
>>> we can consider a new field called "last modified time" to be included for
>>> the partitions stats (or have a plugable way to allow users to
>>> configure partition stats they need). My use case is to find out if a
>>> partition is changed or not given two snapshots (old and new) with a
>>> quick and light way process. I previously was suggested by the community to
>>> use the change log (CDC) but I think that is too heavy (I guess, since it
>>> requires to run SparkSQL procedure) and it is over do the work (I don't
>>> need what rows are changed, I just need true or false for whether a
>>> partition is changed).
>>>
>>> Thanks
>>>
>>> On Tue, Feb 7, 2023 at 11:36 AM Mayur Srivastava <
>>> mayur.srivast...@twosigma.com> wrote:
>>>
>>>> Thanks Ajantha.
>>>>
>>>>
>>>>
>>>> > It should be very easy to add a few more fields to it like the latest
>>>> sequence number or last modified time per partition.
>>>>
>>>>
>>>>
>>>> Among sequence number and modified time, which one do you think is more
>>>> likely to be available in Iceberg partition stats? Note that we would like
>>>> to avoid compaction change the sequence number or modified time stats.
>>>>
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Mayur
>>>>
>>>>
>>>>
>>>> *From:* Ajantha Bhat <ajanthab...@gmail.com>
>>>> *Sent:* Tuesday, February 7, 2023 10:02 AM
>>>> *To:* dev@iceberg.apache.org
>>>> *Subject:* Re: [Proposal] Partition stats in Iceberg
>>>>
>>>>
>>>>
>>>> Hi Hrishi and Mayur, thanks for the inputs.
>>>>
>>>> To get things moving I have frozen the scope of phase 1 implementation.
>>>> (Recently added the delete file stats to phase 1 too). You can find the
>>>> scope in the "Design for approval" section of the design doc.
>>>>
>>>> That said, once we have phase 1 implemented, It should be very easy to
>>>> add a few more fields to it like the latest sequence number or last
>>>> modified time per partition.
>>>> I will be opening up the discussion about phase 2 schema again once
>>>> phase 1 implementation is done.
>>>>
>>>> Thanks,
>>>> Ajantha
>>>>
>>>>
>>>>
>>>> On Tue, Feb 7, 2023 at 8:15 PM Mayur Srivastava <
>>>> mayur.srivast...@twosigma.com> wrote:
>>>>
>>>> +1 for the initiative.
>>>>
>>>>
>>>>
>>>> We’ve been exploring options for storing last-modified-time per
>>>> partition. It an important building block for data pipelines – especially
>>>> if there is a dependency between jobs with strong consistency requirements.
>>>>
>>>>
>>>>
>>>> Is partition stats a good place for storing last-modified-time per
>>>> partition?
>>>>
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Mayur
>>>>
>>>>
>>>>
>>>> *From:* Ajantha Bhat <ajanthab...@gmail.com>
>>>> *Sent:* Monday, January 23, 2023 11:56 AM
>>>> *To:* dev@iceberg.apache.org
>>>> *Subject:* Re: [Proposal] Partition stats in Iceberg
>>>>
>>>>
>>>>
>>>> Hi All,
>>>>
>>>> In the same design document (
>>>> https://docs.google.com/document/d/1vaufuD47kMijz97LxM67X8OX-W2Wq7nmlz3jRo8J5Qk/edit?usp=sharing
>>>> ),
>>>> I have added a section called
>>>> *"Design for approval".  *It also contains a potential PR breakdown
>>>> for the phase 1 implementation and future development scope.
>>>> Please take a look and please vote if you think the design is ok.
>>>>
>>>> Thanks,
>>>> Ajantha
>>>>
>>>>
>>>>
>>>> On Mon, Dec 5, 2022 at 8:37 PM Ajantha Bhat <ajanthab...@gmail.com>
>>>> wrote:
>>>>
>>>> A big thanks to everyone who was involved in the review and the
>>>> discussions so far.
>>>>
>>>> Please find the meeting minutes from the last iceberg sync about the
>>>> partition stats.
>>>>     a. Writers should not write the partition stats or any stats as of
>>>> now.
>>>>         Because it requires bumping the spec to V3. (We can have it as
>>>> part of the v3 spec later on. But not anytime soon).
>>>>     b. So, there can be an async way of generating the stats like
>>>> ANALYZE table or call procedure.
>>>>         Which will compute the stats till the current snapshot and
>>>> store it as a partition stats file.
>>>>     c. In phase 1, partition stats will just store the row_count and
>>>> file_count per partition value as mentioned in the design document.
>>>>         Later it can be enhanced to store puffin file location and
>>>> other metrics per partition value.
>>>>     d. These tuples are stored in a single sorted Avro/parquet file (we
>>>> need to finalize this).
>>>>     e. Each time "analyze table" will rewrite the whole stats file as
>>>> keeping multiple delta files will just make the read path messy.
>>>>         Also, even with million rows, it can be of a few MB size.
>>>>         Once the writers start writing the stats (V3 spec), we can
>>>> revisit storing as the delta files if there are any performance issues.
>>>>
>>>> The next immediate plan is to
>>>>     a. Get these PRs merged (open points in existing StatictisFile
>>>> interface added during Puffin)
>>>>         #6267 <https://github.com/apache/iceberg/pull/6267>, #6090
>>>> <https://github.com/apache/iceberg/pull/6090>, #6091
>>>> <https://github.com/apache/iceberg/pull/6091>
>>>>     b. Figure out how to give accurate stats with row-level deletes and
>>>> how to mask dropped partition values from stats.
>>>>         https://github.com/apache/iceberg/issues/6042
>>>>     c. Standardize the `StatictisFile` interface to hold the
>>>> parquet/Avro stats file (instead of always assuming it as a Puffin file)
>>>>         and introduce a `StatisticsType` enum.
>>>>     d. Conclude the storage format and get approval for the design.
>>>>
>>>> I will wait another week or two for some more people to take a look at
>>>> the document
>>>>
>>>> before jumping into the implementation.
>>>>
>>>> Thanks,
>>>> Ajantha.
>>>>
>>>>
>>>>
>>>> On Sat, Nov 26, 2022 at 8:25 AM Ajantha Bhat <ajanthab...@gmail.com>
>>>> wrote:
>>>>
>>>> Hi Ryan,
>>>>
>>>> are you saying that you think the partition-level stats should not be
>>>> required? I think that would be best.
>>>>
>>>> I think there is some confusion here. Partition-level stats are
>>>> required (hence the proposal).
>>>> But does the writer always write it? (with the append/delete/replace
>>>> operation)
>>>> or writer skips writing it and then the user generates it using DML
>>>> like "Analyze table" was the point of discussion.
>>>> I think we can have both options with the writer stats writing
>>>> controlled by a table property "write.stats.enabled"
>>>>
>>>>
>>>>
>>>> I’m all for improving the interface for retrieving stats. It’s a
>>>> separate issue
>>>>
>>>> Agree. Let us discuss it in a separate thread.
>>>>
>>>> Thanks,
>>>> Ajantha
>>>>
>>>>
>>>>
>>>> On Sat, Nov 26, 2022 at 12:12 AM Ryan Blue <b...@tabular.io> wrote:
>>>>
>>>> Ajantha, are you saying that you think the partition-level stats should
>>>> not be required? I think that would be best.
>>>>
>>>> I’m all for improving the interface for retrieving stats. It’s a
>>>> separate issue, but I think that Iceberg should provide both access to the
>>>> Puffin files and metadata as well as a higher-level interface for
>>>> retrieving information like a column’s NDV. Something like this:
>>>>
>>>> int ndv = 
>>>> table.findStat(Statistics.NDV).limitSnapshotDistance(3).forColumn("x");
>>>>
>>>>
>>>>
>>>> On Thu, Nov 24, 2022 at 2:31 AM Ajantha Bhat <ajanthab...@gmail.com>
>>>> wrote:
>>>>
>>>> Hi Ryan,
>>>> Thanks a lot for the review and suggestions.
>>>>
>>>> but I think there is also a decision that we need to make before that:
>>>> Should Iceberg require writers to maintain the partition stats?
>>>>
>>>> I think I would prefer to take a lazy approach and not assume that
>>>> writers will keep the partition stats up to date,
>>>>
>>>> in which case we need a way to know which parts of a table are newer
>>>> than the most recent stats.
>>>>
>>>>
>>>>
>>>> This is a common problem for existing table-level puffin stats too.
>>>> Not just for partition stats.
>>>> As mentioned in the "integration with the current code" section point
>>>> 8),
>>>> I was planning to introduce a table property "write.stats.enabled" with
>>>> a default value set to false.
>>>>
>>>> And as per point 7), I was planning to introduce an "ANALYZE table" or
>>>> "CALL procedure" SQL (maybe table-level API too) to asynchronously
>>>> compute the stats on demand from the previous checkpoints.
>>>>
>>>> But currently, `TableMetadata` doesn't have a clean Interface to
>>>> provide the statistics file for the current snapshot.
>>>> If stats are not present, we need another interface to provide a last
>>>> successful snapshot id for which stats was computed.
>>>> Also, there is some confusion around reusing the statistics file
>>>> (because the spec only has a computed snapshot id, not the referenced
>>>> snapshot id).
>>>> I am planning to open up a PR to handle these interface updates
>>>> this week. (same things as you suggested in the last Iceberg sync).
>>>> This should serve as a good foundation to get insights for lazy &
>>>> incremental stats computing.
>>>>
>>>>
>>>> Thanks,
>>>> Ajantha
>>>>
>>>>
>>>>
>>>> On Thu, Nov 24, 2022 at 12:50 AM Ryan Blue <b...@tabular.io> wrote:
>>>>
>>>> Thanks for writing this up, Ajantha! I think that we have all the
>>>> upstream pieces in place to work on this so it's great to have a proposal.
>>>>
>>>>
>>>>
>>>> The proposal does a good job of summarizing the choices for how to
>>>> store the data, but I think there is also a decision that we need to make
>>>> before that: Should Iceberg require writers to maintain the partition 
>>>> stats?
>>>>
>>>>
>>>>
>>>> If we do want writers to participate, then we may want to make choices
>>>> that are easier for writers. But I think that is going to be a challenge.
>>>> Adding requirements for writers would mean that we need to bump the spec
>>>> version. Otherwise, we aren't guaranteed that writers will update the files
>>>> correctly. I think I would prefer to take a lazy approach and not assume
>>>> that writers will keep the partition stats up to date, in which case we
>>>> need a way to know which parts of a table are newer than the most recent
>>>> stats.
>>>>
>>>>
>>>>
>>>> Ryan
>>>>
>>>>
>>>>
>>>> On Wed, Nov 23, 2022 at 4:36 AM Ajantha Bhat <ajanthab...@gmail.com>
>>>> wrote:
>>>>
>>>> Thanks Piotr for taking a look at it.
>>>> I have replied to all the comments in the document.
>>>> I might need your support in standardising the existing
>>>> `StatisticsFile` interface to adopt partition stats as mentioned in the
>>>> design.
>>>>
>>>>
>>>>
>>>> *We do need more eyes on the design. Once I get approval for the
>>>> design, I can start the implementation.  *
>>>> Thanks,
>>>> Ajantha
>>>>
>>>>
>>>>
>>>> On Wed, Nov 23, 2022 at 3:28 PM Piotr Findeisen <
>>>> pi...@starburstdata.com> wrote:
>>>>
>>>> Hi Ajantha,
>>>>
>>>>
>>>>
>>>> this is very interesting document, thank you for your work on this!
>>>>
>>>> I've added a few comments there.
>>>>
>>>>
>>>>
>>>> I have one high-level design comment so I thought it would be nicer to
>>>> everyone if I re-post it here
>>>>
>>>>
>>>>
>>>> is "partition" the right level of keeping the stats?
>>>> We do this in Hive, but was it an accidental choice? or just the only
>>>> thing that was possible to be implemented many years ago?
>>>>
>>>>
>>>>
>>>> Iceberg allows to have higher number of partitions compared to Hive,
>>>> because it scales better. But that means partition-level may or may not be
>>>> the right granularity.
>>>>
>>>>
>>>> A self-optimizing system would gather stats on "per query unit" basis
>>>> -- for example if i partition by [ day x country ], but usually query by
>>>> day, the days are the "query unit" and from stats perspective country can
>>>> be ignored.
>>>> Having more fine-grained partitions may lead to expensive planning
>>>> time, so it's not theoretical problem.
>>>>
>>>>
>>>> I am not saying we should implement all this logic right now, but I
>>>> think we should decouple partitioning scheme from stats partitions, to
>>>> allow  query engine to become smarter.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> cc @Alexander Jo <alex...@starburstdata.com>
>>>>
>>>>
>>>>
>>>> Best
>>>>
>>>> PF
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Mon, Nov 14, 2022 at 12:47 PM Ajantha Bhat <ajanthab...@gmail.com>
>>>> wrote:
>>>>
>>>> Hi Community,
>>>> I did a proposal write-up for the partition stats in Iceberg.
>>>> Please have a look and let me know what you think. I would like to work
>>>> on it.
>>>>
>>>>
>>>> https://docs.google.com/document/d/1vaufuD47kMijz97LxM67X8OX-W2Wq7nmlz3jRo8J5Qk/edit?usp=sharing
>>>>
>>>> Requirement background snippet from the above document.
>>>>
>>>> For some query engines that use cost-based-optimizer instead or along
>>>> with rule-based-optimizer (like Dremio, Trino, etc), at the planning time,
>>>> it is good to know the partition level stats like total rows per
>>>> partition and total files per partition to take decisions for CBO (
>>>> like deciding on the join reordering and join type, identifying the
>>>> parallelism).
>>>> Currently, the only way to do this is to read the partition info from 
>>>> data_file
>>>> in manifest_entry of the manifest file and compute partition-level
>>>> statistics (the same thing that ‘partitions’ metadata table is doing *[see
>>>> **Appendix A*
>>>> <https://docs.google.com/document/d/1vaufuD47kMijz97LxM67X8OX-W2Wq7nmlz3jRo8J5Qk/edit#heading=h.s8iywtu7x8m6>
>>>> *]*).
>>>> Doing this on each query is expensive. *Hence, this is a proposal for
>>>> computing and storing partition-level stats for Iceberg tables and using
>>>> them during queries.*
>>>>
>>>>
>>>>
>>>> Thanks,
>>>> Ajantha
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Ryan Blue
>>>>
>>>> Tabular
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Ryan Blue
>>>>
>>>> Tabular
>>>>
>>>>
>>
>> --
>> Ryan Blue
>> Tabular
>>
>

Re: [Proposal] Partition stats in Iceberg

Reply via email to