Hey Gabor,

Thanks for raising this. While reading this, my first thought is to
leverage the `tableExists` operation:
https://github.com/apache/iceberg/blob/e3f39972863f891481ad9f5a559ffef093976bd7/open-api/rest-catalog-open-api.yaml#L1129-L1160

This doesn't return anything today, but we could return a payload to the
latest metadata.json.

Looking forward to what others think.

Kind regards,
Fokko




Op di 12 nov 2024 om 14:33 schreef Shani Elharrar
<sh...@upsolver.com.invalid>:

> I recommend option (b), provided there is no partial metadata loading. We
> implemented option (b) internally to facilitate partial metadata loading,
> as we have tables with hundreds of thousands of snapshots. This results in
> metadata that occupies approximately 500 MB in memory (excluding the
> JsonNodes), which is a significant load for some of our services.
>
> Shani.
>
> On 12 Nov 2024, at 14:12, Gabor Kaszab <gaborkas...@apache.org> wrote:
>
> Hey Iceberg Community,
>
> *Background:*
> Impala is designed in a way to cache the Iceberg table metadata (BaseTable
> objects in practice) for faster access. Currently, Impala is tightly
> coupled with HMS and in turn with the HiveCatalog, and in order to keep the
> cached table objects up-to-date there is a notification mechanism driven by
> HMS to notify Impala about any changes in the table metadata.
> The Impala community is actively looking for ways to decouple HMS from
> Impala and provide a way to use Impala without the need for HMS, and get
> the Iceberg table metadata from other catalog Implementations mainly
> focusing now on REST catalogs.
>
> *Problem to solve:*
> We identified a particular missing functionality in the current REST spec:
> For engines that cache table metadata currently there is no way to check if
> that table metadata is up-to-date or not, and whether the engine should
> reload the metadata for that table or not without getting a whole table
> object from the catalog. For this I think the REST catalog (but in fact I
> think this could apply to any other catalogs) should be able to answer a
> question like:
> "Hi Catalog, I have this version of this table, is it up-to-date?"
>
> *Proposal:*
> I've been following the discussion about partial metadata loading
> <https://lists.apache.org/thread/ll3q30410gfrr89lynojj7b2kyh1xgh9> that
> could be also used to answer the above question, but I have the impression
> now that the conversation stopped making any progress.
> So instead of waiting for partial metadata loading I propose to have an
> addition to the REST spec now to answer the question I raised above:
>
> a) boolean isLatest(TableIdentifier ident, String metadataLocation);
> b) String metadataLocation(TableIdentifier ident);
>
> Any of the above 2 approaches could help engines to decide if they have to
> invalidate/reload particular table metadata in the cache. I personally
> would go for option a) but would be open to hear other opinions.
>
> I'd like to know if the community could support me extending the REST spec
> with any of the 2 options.
>
> Regards,
> Gabor
>
>
>

Reply via email to