Re: Getting last modified timestamp/other stats per partition

Ryan Blue Mon, 07 Mar 2022 09:33:10 -0800

Mayur,

This is one of the reasons why we want to introduce tagging in the format.
That will allow you to tag snapshots that you want to keep and expire
intermediate versions.


In general, there is some cost to keeping thousands of snapshots. Those are
held in the metadata file that gets written each commit, so you end up
writing a fairly large file. If your commits are infrequent it doens't
generally make a difference. But if you have commits every minute or so it
can get in the way.

Tagging will reduce the problem, and moving to change-based commits with
the REST catalog should also help in the long term.

Ryan

On Mon, Mar 7, 2022 at 8:18 AM Mayur Srivastava <
mayur.srivast...@twosigma.com> wrote:

> A few follow-up questions for getting last modified time for each
> partition:
>
>
>
> 1.       If we want to use snapshots, does this mean we will have to
> maintain full history of snapshots? E.g. if we partition by method=‘day’
> and write once a day for a few years, we will end up in maintaining 1000s
> of snapshots. How does a long history of snapshots affect metadata size,
> commit performance, etc.? We intend to experiment with this but I’m curious
> to know if there’s already some recommendation on the amount of history for
> snapshots.
>
> 2.       How can we distinguish between snapshots where new data was
> added vs snapshots where compaction was done?
>
>
>
> Thanks,
>
> Mayur
>
>
>
> *From:* Mayur Srivastava <mayur.srivast...@twosigma.com>
> *Sent:* Thursday, February 24, 2022 7:27 AM
> *To:* dev@iceberg.apache.org
> *Subject:* RE: Getting last modified timestamp/other stats per partition
>
>
>
> Thanks Szehon. I’ll give this a try.
>
>
>
> *From:* Szehon Ho <szehon.apa...@gmail.com>
> *Sent:* Wednesday, February 23, 2022 1:38 PM
> *To:* Iceberg Dev List <dev@iceberg.apache.org>
> *Subject:* Re: Getting last modified timestamp/other stats per partition
>
>
>
> Hi
>
>
>
> Probably the metadata tables can help with this.
>
>
>
> For the size/num_rows of partitions, you can query the files table,
> https://iceberg.apache.org/docs/latest/spark-queries/#files.  (Because
> Iceberg keeps stats for files, and not necessary partitions).
>
>
>
> SELECT partition, sum(file_size_in_bytes), sum(record_count) from
> $my_table.files f GROUP BY f.partition
>
>
>
> This will be compressed size (again Iceberg keeps file-level stats and so
> not sure if there are any stats for uncompressed sizes.)
>
>
>
> For the last modified time, it will be slightly harder.  The file's
> physical modified time is not good enough because it's not exactly when it
> is 'committed' into Iceberg.   You may have to try a more advanced query on
> the snapshots table and manifest-entries table:
> https://iceberg.apache.org/docs/latest/spark-queries/#snapshots
>
>
>
> SELECT MAX(s.committed_at),e.data_file.partition FROM $my_table.snapshots
> s JOIN $my_table.entries e WHERE s.snapshot_id = e.snapshot_id GROUP_BY by
> e.data_file.partition
>
>
>
> Hope that helps,
>
> Szehon
>
>
>
> On Wed, Feb 23, 2022 at 8:50 AM Mayur Srivastava <
> mayur.srivast...@twosigma.com> wrote:
>
> Hi,
>
>
>
> In Iceberg, is there a way to get the last modified timestamp and other
> stats (e.g. num rows, uncompressed size, compressed size) of the data per
> partition?
>
>
>
> Thanks,
>
> Mayur
>
>
>
>

-- 
Ryan Blue
Tabular

Re: Getting last modified timestamp/other stats per partition

Reply via email to