Thanks for the responses. We can try filter snapshots that are added by compaction. Tagging should help reduce the number of snapshots that we have to maintain. But, how will tagging help with partition level last modified times without maintain the history of snapshots (assuming each partition has a different commit time). Will we be able to store additional stats, e.g. commit times, per data file or partition in the tagged snapshot?
From: Szehon Ho <szehon.apa...@gmail.com> Sent: Monday, March 7, 2022 1:40 PM To: Iceberg Dev List <dev@iceberg.apache.org> Subject: Re: Getting last modified timestamp/other stats per partition 2. How can we distinguish between snapshots where new data was added vs snapshots where compaction was done? Yea, to answer the second question, I forgot to mention there is a field on Manifest Entries table called 'status' that you can filter on. It might not be documented as it's a bit more advanced/internal, but the values are listed here: https://github.com/apache/iceberg/blob/master/core%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Ficeberg%2FManifestEntry.java#L30 So you would want to add to the query the filter (e.status =1) , which is Added, if you only care about when the file is first added. Thanks Szehon On Mon, Mar 7, 2022 at 9:33 AM Ryan Blue <b...@tabular.io<mailto:b...@tabular.io>> wrote: Mayur, This is one of the reasons why we want to introduce tagging in the format. That will allow you to tag snapshots that you want to keep and expire intermediate versions. In general, there is some cost to keeping thousands of snapshots. Those are held in the metadata file that gets written each commit, so you end up writing a fairly large file. If your commits are infrequent it doens't generally make a difference. But if you have commits every minute or so it can get in the way. Tagging will reduce the problem, and moving to change-based commits with the REST catalog should also help in the long term. Ryan On Mon, Mar 7, 2022 at 8:18 AM Mayur Srivastava <mayur.srivast...@twosigma.com<mailto:mayur.srivast...@twosigma.com>> wrote: A few follow-up questions for getting last modified time for each partition: 1. If we want to use snapshots, does this mean we will have to maintain full history of snapshots? E.g. if we partition by method=‘day’ and write once a day for a few years, we will end up in maintaining 1000s of snapshots. How does a long history of snapshots affect metadata size, commit performance, etc.? We intend to experiment with this but I’m curious to know if there’s already some recommendation on the amount of history for snapshots. 2. How can we distinguish between snapshots where new data was added vs snapshots where compaction was done? Thanks, Mayur From: Mayur Srivastava <mayur.srivast...@twosigma.com<mailto:mayur.srivast...@twosigma.com>> Sent: Thursday, February 24, 2022 7:27 AM To: dev@iceberg.apache.org<mailto:dev@iceberg.apache.org> Subject: RE: Getting last modified timestamp/other stats per partition Thanks Szehon. I’ll give this a try. From: Szehon Ho <szehon.apa...@gmail.com<mailto:szehon.apa...@gmail.com>> Sent: Wednesday, February 23, 2022 1:38 PM To: Iceberg Dev List <dev@iceberg.apache.org<mailto:dev@iceberg.apache.org>> Subject: Re: Getting last modified timestamp/other stats per partition Hi Probably the metadata tables can help with this. For the size/num_rows of partitions, you can query the files table, https://iceberg.apache.org/docs/latest/spark-queries/#files. (Because Iceberg keeps stats for files, and not necessary partitions). SELECT partition, sum(file_size_in_bytes), sum(record_count) from $my_table.files f GROUP BY f.partition This will be compressed size (again Iceberg keeps file-level stats and so not sure if there are any stats for uncompressed sizes.) For the last modified time, it will be slightly harder. The file's physical modified time is not good enough because it's not exactly when it is 'committed' into Iceberg. You may have to try a more advanced query on the snapshots table and manifest-entries table: https://iceberg.apache.org/docs/latest/spark-queries/#snapshots SELECT MAX(s.committed_at),e.data_file.partition FROM $my_table.snapshots s JOIN $my_table.entries e WHERE s.snapshot_id = e.snapshot_id GROUP_BY by e.data_file.partition Hope that helps, Szehon On Wed, Feb 23, 2022 at 8:50 AM Mayur Srivastava <mayur.srivast...@twosigma.com<mailto:mayur.srivast...@twosigma.com>> wrote: Hi, In Iceberg, is there a way to get the last modified timestamp and other stats (e.g. num rows, uncompressed size, compressed size) of the data per partition? Thanks, Mayur -- Ryan Blue Tabular