I recommend option (b), provided there is no partial metadata loading. We implemented option (b) internally to facilitate partial metadata loading, as we have tables with hundreds of thousands of snapshots. This results in metadata that occupies approximately 500 MB in memory (excluding the JsonNodes), which is a significant load for some of our services.
Shani. > On 12 Nov 2024, at 14:12, Gabor Kaszab <gaborkas...@apache.org> wrote: > > Hey Iceberg Community, > > Background: > Impala is designed in a way to cache the Iceberg table metadata (BaseTable > objects in practice) for faster access. Currently, Impala is tightly coupled > with HMS and in turn with the HiveCatalog, and in order to keep the cached > table objects up-to-date there is a notification mechanism driven by HMS to > notify Impala about any changes in the table metadata. > The Impala community is actively looking for ways to decouple HMS from Impala > and provide a way to use Impala without the need for HMS, and get the Iceberg > table metadata from other catalog Implementations mainly focusing now on REST > catalogs. > > Problem to solve: > We identified a particular missing functionality in the current REST spec: > For engines that cache table metadata currently there is no way to check if > that table metadata is up-to-date or not, and whether the engine should > reload the metadata for that table or not without getting a whole table > object from the catalog. For this I think the REST catalog (but in fact I > think this could apply to any other catalogs) should be able to answer a > question like: > "Hi Catalog, I have this version of this table, is it up-to-date?" > > Proposal: > I've been following the discussion about partial metadata loading > <https://lists.apache.org/thread/ll3q30410gfrr89lynojj7b2kyh1xgh9> that could > be also used to answer the above question, but I have the impression now that > the conversation stopped making any progress. > So instead of waiting for partial metadata loading I propose to have an > addition to the REST spec now to answer the question I raised above: > > a) boolean isLatest(TableIdentifier ident, String metadataLocation); > b) String metadataLocation(TableIdentifier ident); > > Any of the above 2 approaches could help engines to decide if they have to > invalidate/reload particular table metadata in the cache. I personally would > go for option a) but would be open to hear other opinions. > > I'd like to know if the community could support me extending the REST spec > with any of the 2 options. > > Regards, > Gabor