Thanks for the answers so far! Fokko, I think your suggestion makes sense, however, I feel that a 'tableExists' call returning the metadata path is kind of a side effect of an operation and not something users would expect. Having an 'isLatest' or 'metadataLocation' operations seem cleaner and more intuitive. Just curious: isn't changing an existing operation on the API counts as a breaking change? Wouldn't it need a new major release?
Regards, Gabor On Tue, Nov 12, 2024 at 2:55 PM Fokko Driesprong <fo...@apache.org> wrote: > Hey Gabor, > > Thanks for raising this. While reading this, my first thought is to > leverage the `tableExists` operation: > > https://github.com/apache/iceberg/blob/e3f39972863f891481ad9f5a559ffef093976bd7/open-api/rest-catalog-open-api.yaml#L1129-L1160 > > This doesn't return anything today, but we could return a payload to the > latest metadata.json. > > Looking forward to what others think. > > Kind regards, > Fokko > > > > > Op di 12 nov 2024 om 14:33 schreef Shani Elharrar > <sh...@upsolver.com.invalid>: > >> I recommend option (b), provided there is no partial metadata loading. We >> implemented option (b) internally to facilitate partial metadata loading, >> as we have tables with hundreds of thousands of snapshots. This results in >> metadata that occupies approximately 500 MB in memory (excluding the >> JsonNodes), which is a significant load for some of our services. >> >> Shani. >> >> On 12 Nov 2024, at 14:12, Gabor Kaszab <gaborkas...@apache.org> wrote: >> >> Hey Iceberg Community, >> >> *Background:* >> Impala is designed in a way to cache the Iceberg table metadata >> (BaseTable objects in practice) for faster access. Currently, Impala is >> tightly coupled with HMS and in turn with the HiveCatalog, and in order to >> keep the cached table objects up-to-date there is a notification mechanism >> driven by HMS to notify Impala about any changes in the table metadata. >> The Impala community is actively looking for ways to decouple HMS from >> Impala and provide a way to use Impala without the need for HMS, and get >> the Iceberg table metadata from other catalog Implementations mainly >> focusing now on REST catalogs. >> >> *Problem to solve:* >> We identified a particular missing functionality in the current REST >> spec: For engines that cache table metadata currently there is no way to >> check if that table metadata is up-to-date or not, and whether the engine >> should reload the metadata for that table or not without getting a whole >> table object from the catalog. For this I think the REST catalog (but in >> fact I think this could apply to any other catalogs) should be able to >> answer a question like: >> "Hi Catalog, I have this version of this table, is it up-to-date?" >> >> *Proposal:* >> I've been following the discussion about partial metadata loading >> <https://lists.apache.org/thread/ll3q30410gfrr89lynojj7b2kyh1xgh9> that >> could be also used to answer the above question, but I have the impression >> now that the conversation stopped making any progress. >> So instead of waiting for partial metadata loading I propose to have an >> addition to the REST spec now to answer the question I raised above: >> >> a) boolean isLatest(TableIdentifier ident, String metadataLocation); >> b) String metadataLocation(TableIdentifier ident); >> >> Any of the above 2 approaches could help engines to decide if they have >> to invalidate/reload particular table metadata in the cache. I personally >> would go for option a) but would be open to hear other opinions. >> >> I'd like to know if the community could support me extending the REST >> spec with any of the 2 options. >> >> Regards, >> Gabor >> >> >>