Hi,

(Apologies if this email is a duplicate. This is my third attempt.)

I also need a way to ensure that my table data is up-to-date. For now, I’m 
handling this by setting an expiration period after which I fetch the data 
again, regardless of its freshness.

Here are my thoughts on the current suggestions. Please correct me if I've 
misunderstood any of the points.

- isLatest(): This function could be inefficient since it would require an 
additional round-trip to fetch the metadata if it’s not up-to-date. This would 
result in two round-trips overall, which seems suboptimal.
- metadataLocation(): This has a similar issue as isLatest(). BTW, according to 
the REST catalog API documentation for LoadTableResult schema, it states, 
"Clients can check whether metadata has changed by comparing metadata locations 
after the table has been created." 
(https://github.com/apache/iceberg/blob/3659ded18d50206576985339bd55cd82f5e200cc/open-api/rest-catalog-open-api.yaml#L3175)
 This suggests that if the metadata location has changed, the metadata can be 
considered updated.
- tableExists(): Based on the name, this function seems to serve a different 
purpose.

Here is my suggestion:

Since HTTP has built-in caching features 
(https://developer.mozilla.org/en-US/docs/Web/HTTP/Caching), and REST catalogs 
operate over HTTP, it seems natural to leverage HTTP caching mechanisms. For 
example, HTTP includes the If-Modified-Since header and the 304 Not Modified 
status code. Using this approach, we could achieve data freshness with a single 
round-trip, fetching updated data only if there are modifications.

What do you think about defining the spec in this direction?

Thank you.




-----Original Message-----
From: "Yufei Gu" <flyrain...@gmail.com>
To: <dev@iceberg.apache.org>;
Cc:
Sent: 2024-11-13 (수) 03:43:24 (UTC+09:00)
Subject: Re: [DISCUSS] REST: Way to query if metadata pointer is the latest



Hi Gamber,

Thanks for the proposal! Impala isn’t unique in needing this—I've seen similar 
requirements from other engines.

As others pointed out, using the “tableExists” endpoint seems like a 
workaround. I don't consider it a permanent way forward. We could address this 
by either modifying the current load table endpoint or introducing a new one, 
but ideally, we should avoid adding endpoints for every specific need. With 
that, partial metadata loading seems like a strong approach here, we will need 
certain agreement though. I'd suggest the community consider the use cases 
seriously. We need a way forward. 

I’m also not too concerned about using metadata file paths to verify the latest 
table version; clients can simply extract metadata filenames, which include the 
UUID.

Yufei




On Tue, Nov 12, 2024 at 7:46 AM Jean-Baptiste Onofré <j...@nanthrax.net> wrote:

Hi Fokko

I like the idea, but I think it's more a workaround and could be
confusing for users :)

Regards
JB

On Tue, Nov 12, 2024 at 2:53 PM Fokko Driesprong <fo...@apache.org> wrote:
>
> Hey Gabor,
>
> Thanks for raising this. While reading this, my first thought is to leverage 
> the `tableExists` operation:
> https://github.com/apache/iceberg/blob/e3f39972863f891481ad9f5a559ffef093976bd7/open-api/rest-catalog-open-api.yaml#L1129-L1160
>
> This doesn't return anything today, but we could return a payload to the 
> latest metadata.json.
>
> Looking forward to what others think.
>
> Kind regards,
> Fokko
>
>
>
>
> Op di 12 nov 2024 om 14:33 schreef Shani Elharrar 
> <sh...@upsolver.com.invalid>:
>>
>> I recommend option (b), provided there is no partial metadata loading. We 
>> implemented option (b) internally to facilitate partial metadata loading, as 
>> we have tables with hundreds of thousands of snapshots. This results in 
>> metadata that occupies approximately 500 MB in memory (excluding the 
>> JsonNodes), which is a significant load for some of our services.
>>
>> Shani.
>>
>> On 12 Nov 2024, at 14:12, Gabor Kaszab <gaborkas...@apache.org> wrote:
>>
>> Hey Iceberg Community,
>>
>> Background:
>> Impala is designed in a way to cache the Iceberg table metadata (BaseTable 
>> objects in practice) for faster access. Currently, Impala is tightly coupled 
>> with HMS and in turn with the HiveCatalog, and in order to keep the cached 
>> table objects up-to-date there is a notification mechanism driven by HMS to 
>> notify Impala about any changes in the table metadata.
>> The Impala community is actively looking for ways to decouple HMS from 
>> Impala and provide a way to use Impala without the need for HMS, and get the 
>> Iceberg table metadata from other catalog Implementations mainly focusing 
>> now on REST catalogs.
>>
>> Problem to solve:
>> We identified a particular missing functionality in the current REST spec: 
>> For engines that cache table metadata currently there is no way to check if 
>> that table metadata is up-to-date or not, and whether the engine should 
>> reload the metadata for that table or not without getting a whole table 
>> object from the catalog. For this I think the REST catalog (but in fact I 
>> think this could apply to any other catalogs) should be able to answer a 
>> question like:
>> "Hi Catalog, I have this version of this table, is it up-to-date?"
>>
>> Proposal:
>> I've been following the discussion about partial metadata loading that could 
>> be also used to answer the above question, but I have the impression now 
>> that the conversation stopped making any progress.
>> So instead of waiting for partial metadata loading I propose to have an 
>> addition to the REST spec now to answer the question I raised above:
>>
>> a) boolean isLatest(TableIdentifier ident, String metadataLocation);
>> b) String metadataLocation(TableIdentifier ident);
>>
>> Any of the above 2 approaches could help engines to decide if they have to 
>> invalidate/reload particular table metadata in the cache. I personally would 
>> go for option a) but would be open to hear other opinions.
>>
>> I'd like to know if the community could support me extending the REST spec 
>> with any of the 2 options.
>>
>> Regards,
>> Gabor
>>
>>

Reply via email to