Hi,

I like the idea and it makes sense. As soon as it's clearly stated in
the spec (using If-Modified-Since header and 304 status code), it
looks good to me.

Thanks !
Regards
JB

On Fri, Nov 15, 2024 at 1:58 AM Taeyun Kim <taeyun....@innowireless.com> wrote:
>
> Hi,
>
> (Apologies if this email is a duplicate. This is my third attempt.)
>
> I also need a way to ensure that my table data is up-to-date. For now, I’m 
> handling this by setting an expiration period after which I fetch the data 
> again, regardless of its freshness.
>
> Here are my thoughts on the current suggestions. Please correct me if I've 
> misunderstood any of the points.
>
> - isLatest(): This function could be inefficient since it would require an 
> additional round-trip to fetch the metadata if it’s not up-to-date. This 
> would result in two round-trips overall, which seems suboptimal.
> - metadataLocation(): This has a similar issue as isLatest(). BTW, according 
> to the REST catalog API documentation for LoadTableResult schema, it states, 
> "Clients can check whether metadata has changed by comparing metadata 
> locations after the table has been created." 
> (https://github.com/apache/iceberg/blob/3659ded18d50206576985339bd55cd82f5e200cc/open-api/rest-catalog-open-api.yaml#L3175)
>  This suggests that if the metadata location has changed, the metadata can be 
> considered updated.
> - tableExists(): Based on the name, this function seems to serve a different 
> purpose.
>
> Here is my suggestion:
>
> Since HTTP has built-in caching features 
> (https://developer.mozilla.org/en-US/docs/Web/HTTP/Caching), and REST 
> catalogs operate over HTTP, it seems natural to leverage HTTP caching 
> mechanisms. For example, HTTP includes the If-Modified-Since header and the 
> 304 Not Modified status code. Using this approach, we could achieve data 
> freshness with a single round-trip, fetching updated data only if there are 
> modifications.
>
> What do you think about defining the spec in this direction?
>
> Thank you.
>
>
>
>
> -----Original Message-----
> From: "Yufei Gu" <flyrain...@gmail.com>
> To: <dev@iceberg.apache.org>;
> Cc:
> Sent: 2024-11-13 (수) 03:43:24 (UTC+09:00)
> Subject: Re: [DISCUSS] REST: Way to query if metadata pointer is the latest
>
>
>
> Hi Gamber,
>
> Thanks for the proposal! Impala isn’t unique in needing this—I've seen 
> similar requirements from other engines.
>
> As others pointed out, using the “tableExists” endpoint seems like a 
> workaround. I don't consider it a permanent way forward. We could address 
> this by either modifying the current load table endpoint or introducing a new 
> one, but ideally, we should avoid adding endpoints for every specific need. 
> With that, partial metadata loading seems like a strong approach here, we 
> will need certain agreement though. I'd suggest the community consider the 
> use cases seriously. We need a way forward.
>
> I’m also not too concerned about using metadata file paths to verify the 
> latest table version; clients can simply extract metadata filenames, which 
> include the UUID.
>
> Yufei
>
>
>
>
> On Tue, Nov 12, 2024 at 7:46 AM Jean-Baptiste Onofré <j...@nanthrax.net> 
> wrote:
>
> Hi Fokko
>
> I like the idea, but I think it's more a workaround and could be
> confusing for users :)
>
> Regards
> JB
>
> On Tue, Nov 12, 2024 at 2:53 PM Fokko Driesprong <fo...@apache.org> wrote:
> >
> > Hey Gabor,
> >
> > Thanks for raising this. While reading this, my first thought is to 
> > leverage the `tableExists` operation:
> > https://github.com/apache/iceberg/blob/e3f39972863f891481ad9f5a559ffef093976bd7/open-api/rest-catalog-open-api.yaml#L1129-L1160
> >
> > This doesn't return anything today, but we could return a payload to the 
> > latest metadata.json.
> >
> > Looking forward to what others think.
> >
> > Kind regards,
> > Fokko
> >
> >
> >
> >
> > Op di 12 nov 2024 om 14:33 schreef Shani Elharrar 
> > <sh...@upsolver.com.invalid>:
> >>
> >> I recommend option (b), provided there is no partial metadata loading. We 
> >> implemented option (b) internally to facilitate partial metadata loading, 
> >> as we have tables with hundreds of thousands of snapshots. This results in 
> >> metadata that occupies approximately 500 MB in memory (excluding the 
> >> JsonNodes), which is a significant load for some of our services.
> >>
> >> Shani.
> >>
> >> On 12 Nov 2024, at 14:12, Gabor Kaszab <gaborkas...@apache.org> wrote:
> >>
> >> Hey Iceberg Community,
> >>
> >> Background:
> >> Impala is designed in a way to cache the Iceberg table metadata (BaseTable 
> >> objects in practice) for faster access. Currently, Impala is tightly 
> >> coupled with HMS and in turn with the HiveCatalog, and in order to keep 
> >> the cached table objects up-to-date there is a notification mechanism 
> >> driven by HMS to notify Impala about any changes in the table metadata.
> >> The Impala community is actively looking for ways to decouple HMS from 
> >> Impala and provide a way to use Impala without the need for HMS, and get 
> >> the Iceberg table metadata from other catalog Implementations mainly 
> >> focusing now on REST catalogs.
> >>
> >> Problem to solve:
> >> We identified a particular missing functionality in the current REST spec: 
> >> For engines that cache table metadata currently there is no way to check 
> >> if that table metadata is up-to-date or not, and whether the engine should 
> >> reload the metadata for that table or not without getting a whole table 
> >> object from the catalog. For this I think the REST catalog (but in fact I 
> >> think this could apply to any other catalogs) should be able to answer a 
> >> question like:
> >> "Hi Catalog, I have this version of this table, is it up-to-date?"
> >>
> >> Proposal:
> >> I've been following the discussion about partial metadata loading that 
> >> could be also used to answer the above question, but I have the impression 
> >> now that the conversation stopped making any progress.
> >> So instead of waiting for partial metadata loading I propose to have an 
> >> addition to the REST spec now to answer the question I raised above:
> >>
> >> a) boolean isLatest(TableIdentifier ident, String metadataLocation);
> >> b) String metadataLocation(TableIdentifier ident);
> >>
> >> Any of the above 2 approaches could help engines to decide if they have to 
> >> invalidate/reload particular table metadata in the cache. I personally 
> >> would go for option a) but would be open to hear other opinions.
> >>
> >> I'd like to know if the community could support me extending the REST spec 
> >> with any of the 2 options.
> >>
> >> Regards,
> >> Gabor
> >>
> >>

Reply via email to