Hey everyone, The suggestion of re-using that endpoint was just thinking out loud. My underlying concern is what Yufei mentioned; having a proliferation of endpoints that address specific needs. I like the approach suggested by Taeyun, and would love to know what others think.
I'd like to know if the community could support me extending the REST spec > with any of the 2 options. To get back to your original question. This change would go through an improvement proposal <https://iceberg.apache.org/contribute/#apache-iceberg-improvement-proposals> since it is a change to a specification. Kind regards, Fokko Op vr 15 nov 2024 om 02:00 schreef Taeyun Kim <taeyun....@innowireless.com>: > Hi, > > (Apologies if this email is a duplicate. This is my third attempt.) > > I also need a way to ensure that my table data is up-to-date. For now, I’m > handling this by setting an expiration period after which I fetch the data > again, regardless of its freshness. > > Here are my thoughts on the current suggestions. Please correct me if I've > misunderstood any of the points. > > - isLatest(): This function could be inefficient since it would require an > additional round-trip to fetch the metadata if it’s not up-to-date. This > would result in two round-trips overall, which seems suboptimal. > - metadataLocation(): This has a similar issue as isLatest(). BTW, > according to the REST catalog API documentation for LoadTableResult schema, > it states, "Clients can check whether metadata has changed by comparing > metadata locations after the table has been created." ( > https://github.com/apache/iceberg/blob/3659ded18d50206576985339bd55cd82f5e200cc/open-api/rest-catalog-open-api.yaml#L3175) > This suggests that if the metadata location has changed, the metadata can > be considered updated. > - tableExists(): Based on the name, this function seems to serve a > different purpose. > > Here is my suggestion: > > Since HTTP has built-in caching features ( > https://developer.mozilla.org/en-US/docs/Web/HTTP/Caching), and REST > catalogs operate over HTTP, it seems natural to leverage HTTP caching > mechanisms. For example, HTTP includes the If-Modified-Since header and the > 304 Not Modified status code. Using this approach, we could achieve data > freshness with a single round-trip, fetching updated data only if there are > modifications. > > What do you think about defining the spec in this direction? > > Thank you. > > > > > -----Original Message----- > From: "Yufei Gu" <flyrain...@gmail.com> > To: <dev@iceberg.apache.org>; > Cc: > Sent: 2024-11-13 (수) 03:43:24 (UTC+09:00) > Subject: Re: [DISCUSS] REST: Way to query if metadata pointer is the latest > > > > Hi Gamber, > > Thanks for the proposal! Impala isn’t unique in needing this—I've seen > similar requirements from other engines. > > As others pointed out, using the “tableExists” endpoint seems like a > workaround. I don't consider it a permanent way forward. We could address > this by either modifying the current load table endpoint or introducing a > new one, but ideally, we should avoid adding endpoints for every specific > need. With that, partial metadata loading seems like a strong approach > here, we will need certain agreement though. I'd suggest the community > consider the use cases seriously. We need a way forward. > > I’m also not too concerned about using metadata file paths to verify the > latest table version; clients can simply extract metadata filenames, which > include the UUID. > > Yufei > > > > > On Tue, Nov 12, 2024 at 7:46 AM Jean-Baptiste Onofré <j...@nanthrax.net> > wrote: > > Hi Fokko > > I like the idea, but I think it's more a workaround and could be > confusing for users :) > > Regards > JB > > On Tue, Nov 12, 2024 at 2:53 PM Fokko Driesprong <fo...@apache.org> wrote: > > > > Hey Gabor, > > > > Thanks for raising this. While reading this, my first thought is to > leverage the `tableExists` operation: > > > https://github.com/apache/iceberg/blob/e3f39972863f891481ad9f5a559ffef093976bd7/open-api/rest-catalog-open-api.yaml#L1129-L1160 > > > > This doesn't return anything today, but we could return a payload to the > latest metadata.json. > > > > Looking forward to what others think. > > > > Kind regards, > > Fokko > > > > > > > > > > Op di 12 nov 2024 om 14:33 schreef Shani Elharrar > <sh...@upsolver.com.invalid>: > >> > >> I recommend option (b), provided there is no partial metadata loading. > We implemented option (b) internally to facilitate partial metadata > loading, as we have tables with hundreds of thousands of snapshots. This > results in metadata that occupies approximately 500 MB in memory (excluding > the JsonNodes), which is a significant load for some of our services. > >> > >> Shani. > >> > >> On 12 Nov 2024, at 14:12, Gabor Kaszab <gaborkas...@apache.org> wrote: > >> > >> Hey Iceberg Community, > >> > >> Background: > >> Impala is designed in a way to cache the Iceberg table metadata > (BaseTable objects in practice) for faster access. Currently, Impala is > tightly coupled with HMS and in turn with the HiveCatalog, and in order to > keep the cached table objects up-to-date there is a notification mechanism > driven by HMS to notify Impala about any changes in the table metadata. > >> The Impala community is actively looking for ways to decouple HMS from > Impala and provide a way to use Impala without the need for HMS, and get > the Iceberg table metadata from other catalog Implementations mainly > focusing now on REST catalogs. > >> > >> Problem to solve: > >> We identified a particular missing functionality in the current REST > spec: For engines that cache table metadata currently there is no way to > check if that table metadata is up-to-date or not, and whether the engine > should reload the metadata for that table or not without getting a whole > table object from the catalog. For this I think the REST catalog (but in > fact I think this could apply to any other catalogs) should be able to > answer a question like: > >> "Hi Catalog, I have this version of this table, is it up-to-date?" > >> > >> Proposal: > >> I've been following the discussion about partial metadata loading that > could be also used to answer the above question, but I have the impression > now that the conversation stopped making any progress. > >> So instead of waiting for partial metadata loading I propose to have an > addition to the REST spec now to answer the question I raised above: > >> > >> a) boolean isLatest(TableIdentifier ident, String metadataLocation); > >> b) String metadataLocation(TableIdentifier ident); > >> > >> Any of the above 2 approaches could help engines to decide if they have > to invalidate/reload particular table metadata in the cache. I personally > would go for option a) but would be open to hear other opinions. > >> > >> I'd like to know if the community could support me extending the REST > spec with any of the 2 options. > >> > >> Regards, > >> Gabor > >> > >> >