Thanks for the answers so far!

Fokko, I think your suggestion makes sense, however, I feel that a
'tableExists' call returning the metadata path is kind of a side effect of
an operation and not something users would expect. Having an 'isLatest' or
'metadataLocation' operations seem cleaner and more intuitive.
Just curious: isn't changing an existing operation on the API counts as a
breaking change? Wouldn't it need a new major release?

Regards,
Gabor


On Tue, Nov 12, 2024 at 2:55 PM Fokko Driesprong <fo...@apache.org> wrote:

> Hey Gabor,
>
> Thanks for raising this. While reading this, my first thought is to
> leverage the `tableExists` operation:
>
> https://github.com/apache/iceberg/blob/e3f39972863f891481ad9f5a559ffef093976bd7/open-api/rest-catalog-open-api.yaml#L1129-L1160
>
> This doesn't return anything today, but we could return a payload to the
> latest metadata.json.
>
> Looking forward to what others think.
>
> Kind regards,
> Fokko
>
>
>
>
> Op di 12 nov 2024 om 14:33 schreef Shani Elharrar
> <sh...@upsolver.com.invalid>:
>
>> I recommend option (b), provided there is no partial metadata loading. We
>> implemented option (b) internally to facilitate partial metadata loading,
>> as we have tables with hundreds of thousands of snapshots. This results in
>> metadata that occupies approximately 500 MB in memory (excluding the
>> JsonNodes), which is a significant load for some of our services.
>>
>> Shani.
>>
>> On 12 Nov 2024, at 14:12, Gabor Kaszab <gaborkas...@apache.org> wrote:
>>
>> Hey Iceberg Community,
>>
>> *Background:*
>> Impala is designed in a way to cache the Iceberg table metadata
>> (BaseTable objects in practice) for faster access. Currently, Impala is
>> tightly coupled with HMS and in turn with the HiveCatalog, and in order to
>> keep the cached table objects up-to-date there is a notification mechanism
>> driven by HMS to notify Impala about any changes in the table metadata.
>> The Impala community is actively looking for ways to decouple HMS from
>> Impala and provide a way to use Impala without the need for HMS, and get
>> the Iceberg table metadata from other catalog Implementations mainly
>> focusing now on REST catalogs.
>>
>> *Problem to solve:*
>> We identified a particular missing functionality in the current REST
>> spec: For engines that cache table metadata currently there is no way to
>> check if that table metadata is up-to-date or not, and whether the engine
>> should reload the metadata for that table or not without getting a whole
>> table object from the catalog. For this I think the REST catalog (but in
>> fact I think this could apply to any other catalogs) should be able to
>> answer a question like:
>> "Hi Catalog, I have this version of this table, is it up-to-date?"
>>
>> *Proposal:*
>> I've been following the discussion about partial metadata loading
>> <https://lists.apache.org/thread/ll3q30410gfrr89lynojj7b2kyh1xgh9> that
>> could be also used to answer the above question, but I have the impression
>> now that the conversation stopped making any progress.
>> So instead of waiting for partial metadata loading I propose to have an
>> addition to the REST spec now to answer the question I raised above:
>>
>> a) boolean isLatest(TableIdentifier ident, String metadataLocation);
>> b) String metadataLocation(TableIdentifier ident);
>>
>> Any of the above 2 approaches could help engines to decide if they have
>> to invalidate/reload particular table metadata in the cache. I personally
>> would go for option a) but would be open to hear other opinions.
>>
>> I'd like to know if the community could support me extending the REST
>> spec with any of the 2 options.
>>
>> Regards,
>> Gabor
>>
>>
>>

Reply via email to