>
> 2.       How can we distinguish between snapshots where new data was
> added vs snapshots where compaction was done?
>

Yea, to answer the second question, I forgot to mention there is a field on
Manifest Entries table called 'status' that you can filter on.  It might
not be documented as it's a bit more advanced/internal, but the values are
listed here:
https://github.com/apache/iceberg/blob/master/core%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Ficeberg%2FManifestEntry.java#L30

So you would want to add to the query the filter (e.status =1) , which is
Added, if you only care about when the file is first added.

Thanks
Szehon

On Mon, Mar 7, 2022 at 9:33 AM Ryan Blue <b...@tabular.io> wrote:

> Mayur,
>
> This is one of the reasons why we want to introduce tagging in the format.
> That will allow you to tag snapshots that you want to keep and expire
> intermediate versions.
>
> In general, there is some cost to keeping thousands of snapshots. Those
> are held in the metadata file that gets written each commit, so you end up
> writing a fairly large file. If your commits are infrequent it doens't
> generally make a difference. But if you have commits every minute or so it
> can get in the way.
>
> Tagging will reduce the problem, and moving to change-based commits with
> the REST catalog should also help in the long term.
>
> Ryan
>
> On Mon, Mar 7, 2022 at 8:18 AM Mayur Srivastava <
> mayur.srivast...@twosigma.com> wrote:
>
>> A few follow-up questions for getting last modified time for each
>> partition:
>>
>>
>>
>> 1.       If we want to use snapshots, does this mean we will have to
>> maintain full history of snapshots? E.g. if we partition by method=‘day’
>> and write once a day for a few years, we will end up in maintaining 1000s
>> of snapshots. How does a long history of snapshots affect metadata size,
>> commit performance, etc.? We intend to experiment with this but I’m curious
>> to know if there’s already some recommendation on the amount of history for
>> snapshots.
>>
>> 2.       How can we distinguish between snapshots where new data was
>> added vs snapshots where compaction was done?
>>
>>
>>
>> Thanks,
>>
>> Mayur
>>
>>
>>
>> *From:* Mayur Srivastava <mayur.srivast...@twosigma.com>
>> *Sent:* Thursday, February 24, 2022 7:27 AM
>> *To:* dev@iceberg.apache.org
>> *Subject:* RE: Getting last modified timestamp/other stats per partition
>>
>>
>>
>> Thanks Szehon. I’ll give this a try.
>>
>>
>>
>> *From:* Szehon Ho <szehon.apa...@gmail.com>
>> *Sent:* Wednesday, February 23, 2022 1:38 PM
>> *To:* Iceberg Dev List <dev@iceberg.apache.org>
>> *Subject:* Re: Getting last modified timestamp/other stats per partition
>>
>>
>>
>> Hi
>>
>>
>>
>> Probably the metadata tables can help with this.
>>
>>
>>
>> For the size/num_rows of partitions, you can query the files table,
>> https://iceberg.apache.org/docs/latest/spark-queries/#files.  (Because
>> Iceberg keeps stats for files, and not necessary partitions).
>>
>>
>>
>> SELECT partition, sum(file_size_in_bytes), sum(record_count) from
>> $my_table.files f GROUP BY f.partition
>>
>>
>>
>> This will be compressed size (again Iceberg keeps file-level stats and so
>> not sure if there are any stats for uncompressed sizes.)
>>
>>
>>
>> For the last modified time, it will be slightly harder.  The file's
>> physical modified time is not good enough because it's not exactly when it
>> is 'committed' into Iceberg.   You may have to try a more advanced query on
>> the snapshots table and manifest-entries table:
>> https://iceberg.apache.org/docs/latest/spark-queries/#snapshots
>>
>>
>>
>> SELECT MAX(s.committed_at),e.data_file.partition FROM $my_table.snapshots
>> s JOIN $my_table.entries e WHERE s.snapshot_id = e.snapshot_id GROUP_BY by
>> e.data_file.partition
>>
>>
>>
>> Hope that helps,
>>
>> Szehon
>>
>>
>>
>> On Wed, Feb 23, 2022 at 8:50 AM Mayur Srivastava <
>> mayur.srivast...@twosigma.com> wrote:
>>
>> Hi,
>>
>>
>>
>> In Iceberg, is there a way to get the last modified timestamp and other
>> stats (e.g. num rows, uncompressed size, compressed size) of the data per
>> partition?
>>
>>
>>
>> Thanks,
>>
>> Mayur
>>
>>
>>
>>
>
> --
> Ryan Blue
> Tabular
>

Reply via email to