Thanks Szehon. I’ll give this a try.

From: Szehon Ho <szehon.apa...@gmail.com>
Sent: Wednesday, February 23, 2022 1:38 PM
To: Iceberg Dev List <dev@iceberg.apache.org>
Subject: Re: Getting last modified timestamp/other stats per partition

Hi

Probably the metadata tables can help with this.

For the size/num_rows of partitions, you can query the files table, 
https://iceberg.apache.org/docs/latest/spark-queries/#files.  (Because Iceberg 
keeps stats for files, and not necessary partitions).

SELECT partition, sum(file_size_in_bytes), sum(record_count) from 
$my_table.files f GROUP BY f.partition

This will be compressed size (again Iceberg keeps file-level stats and so not 
sure if there are any stats for uncompressed sizes.)

For the last modified time, it will be slightly harder.  The file's physical 
modified time is not good enough because it's not exactly when it is 
'committed' into Iceberg.   You may have to try a more advanced query on the 
snapshots table and manifest-entries table: 
https://iceberg.apache.org/docs/latest/spark-queries/#snapshots

SELECT MAX(s.committed_at),e.data_file.partition FROM $my_table.snapshots s 
JOIN $my_table.entries e WHERE s.snapshot_id = e.snapshot_id GROUP_BY by 
e.data_file.partition

Hope that helps,
Szehon

On Wed, Feb 23, 2022 at 8:50 AM Mayur Srivastava 
<mayur.srivast...@twosigma.com<mailto:mayur.srivast...@twosigma.com>> wrote:
Hi,

In Iceberg, is there a way to get the last modified timestamp and other stats 
(e.g. num rows, uncompressed size, compressed size) of the data per partition?

Thanks,
Mayur

Reply via email to