RE: [Proposal] Partition stats in Iceberg

Mayur Srivastava Mon, 15 May 2023 07:55:14 -0700

Thanks Ryan.
For most partition stats, I’m ok with compaction and keeping fewer snapshots. 
My concern was for supporting last modified time. I guess, if we need to keep 
all snapshots to support last modified time, it could have impact on metadata 
access performance.

Thanks,
Mayur

From: Ryan Blue <b...@tabular.io>
Sent: Wednesday, May 3, 2023 2:00 PM
To: dev@iceberg.apache.org
Subject: Re: [Proposal] Partition stats in Iceberg

Mayur, your use case may require a lot of snapshots, but we generally recommend 
expiring them after a few days. You can tag snapshots to keep them around 
longer than that.

On Tue, May 2, 2023 at 4:52 PM Mayur Srivastava 
<mayur.p.srivast...@gmail.com<mailto:mayur.p.srivast...@gmail.com>> wrote:
Thanks for the response.
One of the use cases that we have is where one business day of data is added at 
a time to a DAY partitioned table. With 25 years of this data, there will be 
~6250 partitions and as many snapshots. Are these many snapshots recommended to 
be kept around?

On Tue, May 2, 2023 at 7:45 PM Szehon Ho 
<szehon.apa...@gmail.com<mailto:szehon.apa...@gmail.com>> wrote:

Does snapshot expiration needs to be disabled for this to work? Thanks, Mayur

Yes, the snapshot that last updated the partition needs to be around for this 
to work.

 Szehon, the query you shared requires a SparkSQL job to be run which means 
latency will be high. However, I am glad you are also thinking of adding these 
directly to the partition table and it seems we share the same interests.

Yea the partitions table currently still goes through SparkSQL, so it will be 
the same.  Maybe you mean add this to partition stats?  We do need to reconcile 
partition table and partition stats at some point though.  Not sure if it was 
designed/discussed yet, I think there was some thoughts on short-circuiting 
Partitions table to read from Partition stats, if stats exist for the current 
snapshot.

Thanks
Szehon

On Tue, May 2, 2023 at 4:34 PM Pucheng Yang 
<py...@pinterest.com.invalid<mailto:py...@pinterest.com.invalid>> wrote:
Thanks Ryan and Szehon!

Szehon, the query you shared requires a SparkSQL job to be run which means 
latency will be high. However, I am glad you are also thinking of adding these 
directly to the partition table and it seems we share the same interests. I am 
looking forward to the work in the phase 2 implementation. Let me know if I can 
help, thanks.

On Tue, May 2, 2023 at 4:28 PM Szehon Ho 
<szehon.apa...@gmail.com<mailto:szehon.apa...@gmail.com>> wrote:
Yea I agree, I had a handy query for the last update time of partition.

SELECT

e.data_file.partition,

MAX(s.committed_at) AS last_modified_time

FROM db.table.snapshots s

JOIN db.table.entries e

WHERE s.snapshot_id = e.snapshot_id

GROUP BY by e.data_file.partition

It's a bit lengthy currently.

I have been indeed thinking to look at adding these fields to the Partitions 
table directly, after Ajantha's pending changes to add delete files to this 
table.

Thanks
Szehon

On Tue, May 2, 2023 at 4:08 PM Ryan Blue 
<b...@tabular.io<mailto:b...@tabular.io>> wrote:
Pucheng,

Rather than using the changelog, I'd just look at the metadata tables. You 
should be able to query the all_entries metadata table to see file additions or 
deletions for a given snapshot. Then from there you can join to the snapshots 
table for timestamps and aggregate to the partition level.

Ryan

On Fri, Apr 28, 2023 at 12:49 PM Pucheng Yang 
<py...@pinterest.com.invalid<mailto:py...@pinterest.com.invalid>> wrote:
Hi Ajantha and the community,

I am interested and I am wondering where we can see the latest progress of this 
feature?

Regarding the partition stats in Iceberg, I am specifically curious if we can 
consider a new field called "last modified time" to be included for the 
partitions stats (or have a plugable way to allow users to configure partition 
stats they need). My use case is to find out if a partition is changed or not 
given two snapshots (old and new) with a quick and light way process. I 
previously was suggested by the community to use the change log (CDC) but I 
think that is too heavy (I guess, since it requires to run SparkSQL procedure) 
and it is over do the work (I don't need what rows are changed, I just need 
true or false for whether a partition is changed).

Thanks

On Tue, Feb 7, 2023 at 11:36 AM Mayur Srivastava 
<mayur.srivast...@twosigma.com<mailto:mayur.srivast...@twosigma.com>> wrote:
Thanks Ajantha.

> It should be very easy to add a few more fields to it like the latest 
> sequence number or last modified time per partition.

Among sequence number and modified time, which one do you think is more likely 
to be available in Iceberg partition stats? Note that we would like to avoid 
compaction change the sequence number or modified time stats.

Thanks,
Mayur

From: Ajantha Bhat <ajanthab...@gmail.com<mailto:ajanthab...@gmail.com>>
Sent: Tuesday, February 7, 2023 10:02 AM
To: dev@iceberg.apache.org<mailto:dev@iceberg.apache.org>
Subject: Re: [Proposal] Partition stats in Iceberg

Hi Hrishi and Mayur, thanks for the inputs.

To get things moving I have frozen the scope of phase 1 implementation. 
(Recently added the delete file stats to phase 1 too). You can find the scope 
in the "Design for approval" section of the design doc.

That said, once we have phase 1 implemented, It should be very easy to add a 
few more fields to it like the latest sequence number or last modified time per 
partition.
I will be opening up the discussion about phase 2 schema again once phase 1 
implementation is done.

Thanks,
Ajantha

On Tue, Feb 7, 2023 at 8:15 PM Mayur Srivastava 
<mayur.srivast...@twosigma.com<mailto:mayur.srivast...@twosigma.com>> wrote:
+1 for the initiative.

We’ve been exploring options for storing last-modified-time per partition. It 
an important building block for data pipelines – especially if there is a 
dependency between jobs with strong consistency requirements.

Is partition stats a good place for storing last-modified-time per partition?

Thanks,
Mayur

From: Ajantha Bhat <ajanthab...@gmail.com<mailto:ajanthab...@gmail.com>>
Sent: Monday, January 23, 2023 11:56 AM
To: dev@iceberg.apache.org<mailto:dev@iceberg.apache.org>
Subject: Re: [Proposal] Partition stats in Iceberg

Hi All,

In the same design document 
(https://docs.google.com/document/d/1vaufuD47kMijz97LxM67X8OX-W2Wq7nmlz3jRo8J5Qk/edit?usp=sharing),
I have added a section called "Design for approval".
It also contains a potential PR breakdown for the phase 1 implementation and 
future development scope.
Please take a look and please vote if you think the design is ok.

Thanks,
Ajantha

On Mon, Dec 5, 2022 at 8:37 PM Ajantha Bhat 
<ajanthab...@gmail.com<mailto:ajanthab...@gmail.com>> wrote:
A big thanks to everyone who was involved in the review and the discussions so 
far.

Please find the meeting minutes from the last iceberg sync about the partition 
stats.
    a. Writers should not write the partition stats or any stats as of now.
        Because it requires bumping the spec to V3. (We can have it as part of 
the v3 spec later on. But not anytime soon).
    b. So, there can be an async way of generating the stats like ANALYZE table 
or call procedure.
        Which will compute the stats till the current snapshot and store it as 
a partition stats file.
    c. In phase 1, partition stats will just store the row_count and file_count 
per partition value as mentioned in the design document.
        Later it can be enhanced to store puffin file location and other 
metrics per partition value.
    d. These tuples are stored in a single sorted Avro/parquet file (we need to 
finalize this).
    e. Each time "analyze table" will rewrite the whole stats file as keeping 
multiple delta files will just make the read path messy.
        Also, even with million rows, it can be of a few MB size.
        Once the writers start writing the stats (V3 spec), we can revisit 
storing as the delta files if there are any performance issues.

The next immediate plan is to
    a. Get these PRs merged (open points in existing StatictisFile interface 
added during Puffin)
        #6267<https://github.com/apache/iceberg/pull/6267>, 
#6090<https://github.com/apache/iceberg/pull/6090>, 
#6091<https://github.com/apache/iceberg/pull/6091>
    b. Figure out how to give accurate stats with row-level deletes and how to 
mask dropped partition values from stats.
        https://github.com/apache/iceberg/issues/6042
    c. Standardize the `StatictisFile` interface to hold the parquet/Avro stats 
file (instead of always assuming it as a Puffin file)
        and introduce a `StatisticsType` enum.
    d. Conclude the storage format and get approval for the design.

I will wait another week or two for some more people to take a look at the 
document
before jumping into the implementation.

Thanks,
Ajantha.

On Sat, Nov 26, 2022 at 8:25 AM Ajantha Bhat 
<ajanthab...@gmail.com<mailto:ajanthab...@gmail.com>> wrote:
Hi Ryan,
are you saying that you think the partition-level stats should not be required? 
I think that would be best.
I think there is some confusion here. Partition-level stats are required (hence 
the proposal).
But does the writer always write it? (with the append/delete/replace operation)
or writer skips writing it and then the user generates it using DML like 
"Analyze table" was the point of discussion.
I think we can have both options with the writer stats writing controlled by a 
table property "write.stats.enabled"

I’m all for improving the interface for retrieving stats. It’s a separate issue
Agree. Let us discuss it in a separate thread.

Thanks,
Ajantha

On Sat, Nov 26, 2022 at 12:12 AM Ryan Blue 
<b...@tabular.io<mailto:b...@tabular.io>> wrote:

Ajantha, are you saying that you think the partition-level stats should not be 
required? I think that would be best.

I’m all for improving the interface for retrieving stats. It’s a separate 
issue, but I think that Iceberg should provide both access to the Puffin files 
and metadata as well as a higher-level interface for retrieving information 
like a column’s NDV. Something like this:

int ndv = 
table.findStat(Statistics.NDV).limitSnapshotDistance(3).forColumn("x");

On Thu, Nov 24, 2022 at 2:31 AM Ajantha Bhat 
<ajanthab...@gmail.com<mailto:ajanthab...@gmail.com>> wrote:
Hi Ryan,
Thanks a lot for the review and suggestions.
but I think there is also a decision that we need to make before that: Should 
Iceberg require writers to maintain the partition stats?
I think I would prefer to take a lazy approach and not assume that writers will 
keep the partition stats up to date,
in which case we need a way to know which parts of a table are newer than the 
most recent stats.

This is a common problem for existing table-level puffin stats too.  Not just 
for partition stats.
As mentioned in the "integration with the current code" section point 8),
I was planning to introduce a table property "write.stats.enabled" with a 
default value set to false.
And as per point 7), I was planning to introduce an "ANALYZE table" or "CALL 
procedure" SQL (maybe table-level API too) to asynchronously
compute the stats on demand from the previous checkpoints.

But currently, `TableMetadata` doesn't have a clean Interface to provide the 
statistics file for the current snapshot.
If stats are not present, we need another interface to provide a last 
successful snapshot id for which stats was computed.
Also, there is some confusion around reusing the statistics file (because the 
spec only has a computed snapshot id, not the referenced snapshot id).
I am planning to open up a PR to handle these interface updates this week. 
(same things as you suggested in the last Iceberg sync).
This should serve as a good foundation to get insights for lazy & incremental 
stats computing.

Thanks,
Ajantha

On Thu, Nov 24, 2022 at 12:50 AM Ryan Blue 
<b...@tabular.io<mailto:b...@tabular.io>> wrote:
Thanks for writing this up, Ajantha! I think that we have all the upstream 
pieces in place to work on this so it's great to have a proposal.

The proposal does a good job of summarizing the choices for how to store the 
data, but I think there is also a decision that we need to make before that: 
Should Iceberg require writers to maintain the partition stats?

If we do want writers to participate, then we may want to make choices that are 
easier for writers. But I think that is going to be a challenge. Adding 
requirements for writers would mean that we need to bump the spec version. 
Otherwise, we aren't guaranteed that writers will update the files correctly. I 
think I would prefer to take a lazy approach and not assume that writers will 
keep the partition stats up to date, in which case we need a way to know which 
parts of a table are newer than the most recent stats.

Ryan

On Wed, Nov 23, 2022 at 4:36 AM Ajantha Bhat 
<ajanthab...@gmail.com<mailto:ajanthab...@gmail.com>> wrote:
Thanks Piotr for taking a look at it.
I have replied to all the comments in the document.
I might need your support in standardising the existing `StatisticsFile` 
interface to adopt partition stats as mentioned in the design.

We do need more eyes on the design.
Once I get approval for the design, I can start the implementation.

Thanks,
Ajantha

On Wed, Nov 23, 2022 at 3:28 PM Piotr Findeisen 
<pi...@starburstdata.com<mailto:pi...@starburstdata.com>> wrote:
Hi Ajantha,

this is very interesting document, thank you for your work on this!
I've added a few comments there.

I have one high-level design comment so I thought it would be nicer to everyone 
if I re-post it here

is "partition" the right level of keeping the stats?
We do this in Hive, but was it an accidental choice? or just the only thing 
that was possible to be implemented many years ago?

Iceberg allows to have higher number of partitions compared to Hive, because it 
scales better. But that means partition-level may or may not be the right 
granularity.

A self-optimizing system would gather stats on "per query unit" basis -- for 
example if i partition by [ day x country ], but usually query by day, the days 
are the "query unit" and from stats perspective country can be ignored.
Having more fine-grained partitions may lead to expensive planning time, so 
it's not theoretical problem.

I am not saying we should implement all this logic right now, but I think we 
should decouple partitioning scheme from stats partitions, to allow  query 
engine to become smarter.

cc @Alexander Jo<mailto:alex...@starburstdata.com>

Best
PF

On Mon, Nov 14, 2022 at 12:47 PM Ajantha Bhat 
<ajanthab...@gmail.com<mailto:ajanthab...@gmail.com>> wrote:
Hi Community,
I did a proposal write-up for the partition stats in Iceberg.
Please have a look and let me know what you think. I would like to work on it.

https://docs.google.com/document/d/1vaufuD47kMijz97LxM67X8OX-W2Wq7nmlz3jRo8J5Qk/edit?usp=sharing

Requirement background snippet from the above document.
For some query engines that use cost-based-optimizer instead or along with 
rule-based-optimizer (like Dremio, Trino, etc), at the planning time,
it is good to know the partition level stats like total rows per partition and 
total files per partition to take decisions for CBO (
like deciding on the join reordering and join type, identifying the 
parallelism).
Currently, the only way to do this is to read the partition info from data_file 
in manifest_entry of the manifest file and compute partition-level statistics 
(the same thing that ‘partitions’ metadata table is doing [see Appendix 
A<https://docs.google.com/document/d/1vaufuD47kMijz97LxM67X8OX-W2Wq7nmlz3jRo8J5Qk/edit#heading=h.s8iywtu7x8m6>]).
Doing this on each query is expensive. Hence, this is a proposal for computing 
and storing partition-level stats for Iceberg tables and using them during 
queries.

Thanks,
Ajantha

--
Ryan Blue
Tabular

--
Ryan Blue
Tabular

--
Ryan Blue
Tabular

--
Ryan Blue
Tabular

RE: [Proposal] Partition stats in Iceberg

Reply via email to