Hi Gabor, I'm going to propose something that does not quitte align with your idea, so please bear with me.
Both options in you email (A and B) assume that the engine is going to make freshness decisions based on the location of metadata. I see some conceptual rough edges here. A change in metadata location does not necessarily mean a change in metadata content. What do you think about using a hash of metadata instead of location as an indicator or change? Not all changes in metadata content may be relevant to all engines. Does Impala cache all snapshots and schemas? Even if so, it may still be beneficial to permit the client to ask for changes to specific areas of metadata. For example, "Is snapshot X latest"? I believe that approach is more generalizable, especially with the discussion of partial metadata loading in mind. WDYT? Thanks, Dmitri. On Tue, Nov 12, 2024 at 7:12 AM Gabor Kaszab <gaborkas...@apache.org> wrote: > Hey Iceberg Community, > > *Background:* > Impala is designed in a way to cache the Iceberg table metadata (BaseTable > objects in practice) for faster access. Currently, Impala is tightly > coupled with HMS and in turn with the HiveCatalog, and in order to keep the > cached table objects up-to-date there is a notification mechanism driven by > HMS to notify Impala about any changes in the table metadata. > The Impala community is actively looking for ways to decouple HMS from > Impala and provide a way to use Impala without the need for HMS, and get > the Iceberg table metadata from other catalog Implementations mainly > focusing now on REST catalogs. > > *Problem to solve:* > We identified a particular missing functionality in the current REST spec: > For engines that cache table metadata currently there is no way to check if > that table metadata is up-to-date or not, and whether the engine should > reload the metadata for that table or not without getting a whole table > object from the catalog. For this I think the REST catalog (but in fact I > think this could apply to any other catalogs) should be able to answer a > question like: > "Hi Catalog, I have this version of this table, is it up-to-date?" > > *Proposal:* > I've been following the discussion about partial metadata loading > <https://lists.apache.org/thread/ll3q30410gfrr89lynojj7b2kyh1xgh9> that > could be also used to answer the above question, but I have the impression > now that the conversation stopped making any progress. > So instead of waiting for partial metadata loading I propose to have an > addition to the REST spec now to answer the question I raised above: > > a) boolean isLatest(TableIdentifier ident, String metadataLocation); > b) String metadataLocation(TableIdentifier ident); > > Any of the above 2 approaches could help engines to decide if they have to > invalidate/reload particular table metadata in the cache. I personally > would go for option a) but would be open to hear other opinions. > > I'd like to know if the community could support me extending the REST spec > with any of the 2 options. > > Regards, > Gabor >