Hi Gabor,

I'm going to propose something that does not quitte align with your idea,
so please bear with me.

Both options in you email (A and B) assume that the engine is going to make
freshness decisions based on the location of metadata. I see some
conceptual rough edges here.

A change in metadata location does not necessarily mean a change in
metadata content. What do you think about using a hash of metadata instead
of location as an indicator or change?

Not all changes in metadata content may be relevant to all engines. Does
Impala cache all snapshots and schemas? Even if so, it may still be
beneficial to permit the client to ask for changes to specific areas of
metadata. For example, "Is snapshot X latest"? I believe that approach is
more generalizable, especially with the discussion of partial metadata
loading in mind. WDYT?

Thanks,
Dmitri.



On Tue, Nov 12, 2024 at 7:12 AM Gabor Kaszab <gaborkas...@apache.org> wrote:

> Hey Iceberg Community,
>
> *Background:*
> Impala is designed in a way to cache the Iceberg table metadata (BaseTable
> objects in practice) for faster access. Currently, Impala is tightly
> coupled with HMS and in turn with the HiveCatalog, and in order to keep the
> cached table objects up-to-date there is a notification mechanism driven by
> HMS to notify Impala about any changes in the table metadata.
> The Impala community is actively looking for ways to decouple HMS from
> Impala and provide a way to use Impala without the need for HMS, and get
> the Iceberg table metadata from other catalog Implementations mainly
> focusing now on REST catalogs.
>
> *Problem to solve:*
> We identified a particular missing functionality in the current REST spec:
> For engines that cache table metadata currently there is no way to check if
> that table metadata is up-to-date or not, and whether the engine should
> reload the metadata for that table or not without getting a whole table
> object from the catalog. For this I think the REST catalog (but in fact I
> think this could apply to any other catalogs) should be able to answer a
> question like:
> "Hi Catalog, I have this version of this table, is it up-to-date?"
>
> *Proposal:*
> I've been following the discussion about partial metadata loading
> <https://lists.apache.org/thread/ll3q30410gfrr89lynojj7b2kyh1xgh9> that
> could be also used to answer the above question, but I have the impression
> now that the conversation stopped making any progress.
> So instead of waiting for partial metadata loading I propose to have an
> addition to the REST spec now to answer the question I raised above:
>
> a) boolean isLatest(TableIdentifier ident, String metadataLocation);
> b) String metadataLocation(TableIdentifier ident);
>
> Any of the above 2 approaches could help engines to decide if they have to
> invalidate/reload particular table metadata in the cache. I personally
> would go for option a) but would be open to hear other opinions.
>
> I'd like to know if the community could support me extending the REST spec
> with any of the 2 options.
>
> Regards,
> Gabor
>

Reply via email to