I just think this is a bit more complicated than I want to take into the main 
library just because we have to make decisions about

1. Retries
2. Concurrency
3. Results/Error Reporting

But if we have a good proposal for we will handle all those I think we could do 
it? 

> On Dec 6, 2023, at 2:05 PM, Andrea Campolonghi <acampolon...@gmail.com> wrote:
> 
> I think that if you call an expire snapshots function this is exactly what 
> you want 
> 
> On Wed, Dec 6, 2023 at 18:47 Ryan Blue <b...@tabular.io 
> <mailto:b...@tabular.io>> wrote:
>> My concern with the per-catalog approach is that people might accidentally 
>> run it. Do you think it's clear enough that these invocations will drop 
>> older snapshots?
>> 
>> On Wed, Dec 6, 2023 at 2:40 AM Andrea Campolonghi <acampolon...@gmail.com 
>> <mailto:acampolon...@gmail.com>> wrote:
>>> I like this approach. + 1
>>> 
>>>> On 6 Dec 2023, at 11:37, naveen <nk1...@gmail.com 
>>>> <mailto:nk1...@gmail.com>> wrote:
>>>> 
>>>> Hi Everyone,
>>>> 
>>>> Currently Spark-Procedures supports expire_snapshots/remove_orphan_files 
>>>> per table.
>>>> 
>>>> Today, if someone has to run GCs on an entire catalog they will have to 
>>>> manually run these procedures for every table.
>>>> 
>>>> Is it a good idea to do it in bulk as per catalog or with multiple tables ?
>>>> 
>>>> Current syntax:
>>>> CALL hive_prod.system.expire_snapshots(table => 'db.sample', <Options>)
>>>> Proposed Syntax something similar:
>>>> 
>>>> Per Namespace/Database
>>>> CALL hive_prod.system.expire_snapshots(database => 'db', <Options>)
>>>> Per Catalog
>>>> CALL hive_prod.system.expire_snapshots(<Options>)
>>>> Multiple Tables
>>>> CALL hive_prod.system.expire_snapshots(tables => Array('db1.table1', 
>>>> 'db2.table2), <Options>)
>>>> PS: There could be exceptions for individual catalogs. Like Nessie doesn't 
>>>> support GC other than Nessie CLI. Hadoop can't list all the Namespaces.
>>>> 
>>>> 
>>>> Regards,
>>>> Naveen Kumar
>>>> 
>>> 
>> 
>> 
>> --
>> Ryan Blue
>> Tabular

Reply via email to