Re: [DISCUSS] Run GC with Catalog or Tables

Ryan Blue Wed, 06 Dec 2023 09:48:08 -0800

My concern with the per-catalog approach is that people might accidentally
run it. Do you think it's clear enough that these invocations will drop
older snapshots?


On Wed, Dec 6, 2023 at 2:40 AM Andrea Campolonghi <acampolon...@gmail.com>
wrote:

> I like this approach. + 1
>
> On 6 Dec 2023, at 11:37, naveen <nk1...@gmail.com> wrote:
>
> Hi Everyone,
>
> Currently Spark-Procedures supports *expire_snapshots/remove_orphan_files
> *per table.
>
> Today, if someone has to run GCs on an entire catalog they will have to
> manually run these procedures for every table.
>
> Is it a good idea to do it in bulk as per catalog or with multiple tables ?
>
> Current syntax:
>
> CALL hive_prod.system.expire_snapshots(table => 'db.sample', <Options>)
>
> Proposed Syntax something similar:
>
> Per Namespace/Database
>
> CALL hive_prod.system.expire_snapshots(database => 'db', <Options>)
>
> Per Catalog
>
> CALL hive_prod.system.expire_snapshots(<Options>)
>
> Multiple Tables
>
> CALL hive_prod.system.expire_snapshots(tables => Array('db1.table1', 
> 'db2.table2), <Options>)
>
> PS: There could be exceptions for individual catalogs. Like Nessie doesn't
> support GC other than Nessie CLI. Hadoop can't list all the Namespaces.
>
>
> Regards,
> Naveen Kumar
>
>
>

-- 
Ryan Blue
Tabular

Re: [DISCUSS] Run GC with Catalog or Tables

Reply via email to