RE: Getting last modified timestamp/other stats per partition

Mayur Srivastava Tue, 08 Mar 2022 06:57:16 -0800

Thanks for the responses. We can try filter snapshots that are added by 
compaction. Tagging should help reduce the number of snapshots that we have to 
maintain. But, how will tagging help with partition level last modified times 
without maintain the history of snapshots (assuming each partition has a 
different commit time). Will we be able to store additional stats, e.g. commit 
times, per data file or partition in the tagged snapshot?

From: Szehon Ho <szehon.apa...@gmail.com>
Sent: Monday, March 7, 2022 1:40 PM
To: Iceberg Dev List <dev@iceberg.apache.org>
Subject: Re: Getting last modified timestamp/other stats per partition

2.       How can we distinguish between snapshots where new data was added vs 
snapshots where compaction was done?

Yea, to answer the second question, I forgot to mention there is a field on 
Manifest Entries table called 'status' that you can filter on.  It might not be 
documented as it's a bit more advanced/internal, but the values are listed here:
https://github.com/apache/iceberg/blob/master/core%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Ficeberg%2FManifestEntry.java#L30

So you would want to add to the query the filter (e.status =1) , which is 
Added, if you only care about when the file is first added.

Thanks
Szehon

On Mon, Mar 7, 2022 at 9:33 AM Ryan Blue 
<b...@tabular.io<mailto:b...@tabular.io>> wrote:
Mayur,

This is one of the reasons why we want to introduce tagging in the format. That 
will allow you to tag snapshots that you want to keep and expire intermediate 
versions.

In general, there is some cost to keeping thousands of snapshots. Those are 
held in the metadata file that gets written each commit, so you end up writing 
a fairly large file. If your commits are infrequent it doens't generally make a 
difference. But if you have commits every minute or so it can get in the way.

Tagging will reduce the problem, and moving to change-based commits with the 
REST catalog should also help in the long term.

Ryan

On Mon, Mar 7, 2022 at 8:18 AM Mayur Srivastava 
<mayur.srivast...@twosigma.com<mailto:mayur.srivast...@twosigma.com>> wrote:
A few follow-up questions for getting last modified time for each partition:

1.       If we want to use snapshots, does this mean we will have to maintain 
full history of snapshots? E.g. if we partition by method=‘day’ and write once 
a day for a few years, we will end up in maintaining 1000s of snapshots. How 
does a long history of snapshots affect metadata size, commit performance, 
etc.? We intend to experiment with this but I’m curious to know if there’s 
already some recommendation on the amount of history for snapshots.

2.       How can we distinguish between snapshots where new data was added vs 
snapshots where compaction was done?

Thanks,
Mayur

From: Mayur Srivastava 
<mayur.srivast...@twosigma.com<mailto:mayur.srivast...@twosigma.com>>
Sent: Thursday, February 24, 2022 7:27 AM
To: dev@iceberg.apache.org<mailto:dev@iceberg.apache.org>
Subject: RE: Getting last modified timestamp/other stats per partition

Thanks Szehon. I’ll give this a try.

From: Szehon Ho <szehon.apa...@gmail.com<mailto:szehon.apa...@gmail.com>>
Sent: Wednesday, February 23, 2022 1:38 PM
To: Iceberg Dev List <dev@iceberg.apache.org<mailto:dev@iceberg.apache.org>>
Subject: Re: Getting last modified timestamp/other stats per partition

Hi

Probably the metadata tables can help with this.

For the size/num_rows of partitions, you can query the files table, 
https://iceberg.apache.org/docs/latest/spark-queries/#files.  (Because Iceberg 
keeps stats for files, and not necessary partitions).

SELECT partition, sum(file_size_in_bytes), sum(record_count) from 
$my_table.files f GROUP BY f.partition

This will be compressed size (again Iceberg keeps file-level stats and so not 
sure if there are any stats for uncompressed sizes.)

For the last modified time, it will be slightly harder.  The file's physical 
modified time is not good enough because it's not exactly when it is 
'committed' into Iceberg.   You may have to try a more advanced query on the 
snapshots table and manifest-entries table: 
https://iceberg.apache.org/docs/latest/spark-queries/#snapshots

SELECT MAX(s.committed_at),e.data_file.partition FROM $my_table.snapshots s 
JOIN $my_table.entries e WHERE s.snapshot_id = e.snapshot_id GROUP_BY by 
e.data_file.partition

Hope that helps,
Szehon

On Wed, Feb 23, 2022 at 8:50 AM Mayur Srivastava 
<mayur.srivast...@twosigma.com<mailto:mayur.srivast...@twosigma.com>> wrote:
Hi,

In Iceberg, is there a way to get the last modified timestamp and other stats 
(e.g. num rows, uncompressed size, compressed size) of the data per partition?

Thanks,
Mayur

--
Ryan Blue
Tabular

RE: Getting last modified timestamp/other stats per partition

Reply via email to