Hi Taeyun, Thank you for the clear explanation.
I agree that the ETag solution is more suitable. If we were going that way, I'd propose a customized version number as an ETag—for instance, leveraging the metadata.json file name as the identifier. To summarize, HTTP caching relies on headers (e.g., ETag or Last-Modified) to validate whether a version is up-to-date, whereas the alternative approach proposed above uses an additional parameter for verification. From my perspective, there isn’t a fundamental difference between the two, so I’m OK with either. A couple of points to note: 1. Both approaches would require changes to the "loadTable" endpoint. 2. A minor advantage of HTTP caching is that it integrates seamlessly with browsers, but since most clients of the Iceberg REST catalog aren’t browsers, this may not be a significant factor. 3. I’d also recommend considering the requirement to retrieve multiple tables(e.g., all tables under a namespace, or a list of table names) from the catalog. This requires a new endpoint and may not work with HTTP caching. Let me know your thoughts or if there’s anything else to consider. Yufei On Sun, Nov 17, 2024 at 6:43 PM Taeyun Kim <taeyun....@innowireless.com> wrote: > Hi, > > To Gabor: > It doesn’t seem necessary to interpret HTTP caching literally in this > context. > Simply using the HTTP headers defined by HTTP caching to check the > freshness of metadata should be sufficient. > There’s no requirement for the client to duplicate or store cached HTTP > responses. > > To Yufei: > As I understand it, the client doesn’t send its own timestamp but instead > uses the timestamp originally provided by the server in the Last-Modified > header. > Therefore, clock synchronization issues should not be a concern. > > Here’s the general flow of HTTP cache validation based on > If-Modified-Since: > > - Client: initial request: > > GET (url) HTTP/1.1 > > - Server response: > > HTTP/1.1 200 OK > Last-Modified: (date1) > Cache-Control: no-store, no-cache, max-age=0, must-revalidate, > proxy-revalidate > (with response body) > > - Client: validation request: > > GET (url) HTTP/1.1 > If-Modified-Since: (date1) > > - Server response (if unchanged): > > HTTP/1.1 304 Not Modified > Last-Modified: (date1) > Cache-Control: no-store, no-cache, max-age=0, must-revalidate, > proxy-revalidate > (without response body) > > - Server response (if updated): > > HTTP/1.1 200 OK > Last-Modified: (date2) > Cache-Control: no-store, no-cache, max-age=0, must-revalidate, > proxy-revalidate > (with response body) > > However, using time-based freshness checks can present challenges, such as > parsing time formats or synchronizing file update times across servers. > To address these issues, HTTP cache validation based on ETag is also > defined in the specification. > > Here’s the flow for ETag-based validation: > > - Client: initial request: > > GET (url) HTTP/1.1 > > - Server response: > > HTTP/1.1 200 OK > ETag: "(arbitrary string 1 generated by the server)" > Cache-Control: no-store, no-cache, max-age=0, must-revalidate, > proxy-revalidate > (with response body) > > - Client: validation request: > > GET (url) HTTP/1.1 > If-None-Match: "(arbitrary string 1 generated by the server)" > > - Server response (if unchanged): > > HTTP/1.1 304 Not Modified > ETag: "(arbitrary string 1 generated by the server)" > Cache-Control: no-store, no-cache, max-age=0, must-revalidate, > proxy-revalidate > (without response body) > > - Server response (if updated): > > HTTP/1.1 200 OK > ETag: "(arbitrary string 2 generated by the server)" > Cache-Control: no-store, no-cache, max-age=0, must-revalidate, > proxy-revalidate > (with response body) > > The server can choose to use either If-Modified-Since or ETag for > freshness validation. > Alternatively, to simplify the implementation related to the Iceberg REST > catalog, it might make sense to define only the more accurate ETag-based > validation in the spec. > For reference, RFC 9110 recommends specifying both ETag and Last-Modified. > When both are provided, ETag takes precedence. > > Note on Cache-Control Headers: > The Cache-Control values in the examples above are intended to ensure that > the client validates freshness with the server on every request. Writing > the header in this extended format is primarily to accommodate outdated > HTTP/1.1 implementations. However, under the HTTP/1.1 specification, the > following is sufficient: > > Cache-Control: no-cache > > That’s all for now. > Thank you. > > > -----Original Message----- > From: "Yufei Gu" <flyrain...@gmail.com> > To: <dev@iceberg.apache.org>; > Cc: > Sent: 2024-11-16 (토) 02:51:05 (UTC+09:00) > Subject: Re: [DISCUSS] REST: Way to query if metadata pointer is the latest > > > > How does HTTP caching handle desynchronized clocks between clients and the > server? > > At t0, the client gets the latest table version. > At t1, the server makes a new commit. > At t2, the client sends a request with a timestamp t2, but due to > desynchronization, it refers to t0. > > The server may reply with 304 Not Modified, causing the client to think > its cache is up-to-date and miss the commit at t1. > > > > Yufei > > > > > On Fri, Nov 15, 2024 at 6:37 AM Gabor Kaszab <gaborkas...@apache.org> > wrote: > Hi All, > > > First of all it's great to see that there are others who could benefit > from giving a solution to this problem. I appreciate all the comments and > feedback so far. > There were a number of different opinions, so let me start with > summarizing the different topics that came up: > > > New endpoint vs using an existing endpoint: > Based on the answers (Fokko, Yufei) I had the impression that we should be > careful when adding new REST endpoints, and we should examine the re-use of > existing endpoints first. Let's do that then, and in case we don't find it > feasible then we can still fall back to any of my initial proposals > (isLatest() or metadataLocation()). > > > Granularity of freshness checks: > It was brought up (Dmitri) that we might not want to do the metadata > freshness checks solely based on metadata location, but we should consider > doing more granular freshness checks. I personally don't see much benefit > of designing this solution like that, TBH, but seeing some use-cases could > help us understand the motivation here. > Let me share my opinion on some of the arguments: > > > "A change in metadata location does not necessarily mean a change in > metadata content" > > > AFAIK whenever Iceberg creates a new metadata file there is some change in > the metadata itself. There might not be a new snapshot, though in the cases > of e.g. a schema/partition evolution. But even in these cases triggering a > table reload could make sense to me (e.g. answering SHOW CREATE TABLE and > similar queries). Additionally, I'd assume the number of metadata location > changes that don't create a new snapshot is too negligible to optimize for. > Dmitri, let me know if I misunderstood something. > > > "it may still be beneficial to permit the client to ask for changes to > specific areas of metadata" > > This seems like a use-case that the partial metadata loading proposal > could solve. To identify the need to load a specific part of the metadata > with partial metadata loading seems an overkill to design with my proposal, > if this is what you have in mind. Also I found that the partial metadata > loading proposal faces serious headwinds, so I wouldn't rely on it at the > moment. > > > Re-using tableExists > I think there is a consensus here that tableExists returning a metadata > location could work but seems more like a workaround and could be > misleading for the users. > > Partial metadata loading could solve this: > (Yufei) I agree, it would be perfect for my use-case and I'm following the > discussion on the proposal. However, for me it seems, as I wrote above, > that the proposal faces serious headwinds now and I honestly wouldn't > expect a solution in the short term. But solving the freshness problems is > a more urgent thing to solve, not just for myself and Impala but apparently > to many other stakeholders in the community according to the interest on > this thread. > Hence, I propose to come up with a separate solution for freshness checks, > and we can still move to using partial metadata loading once that's out. > > > Use HTTPCache and If-Modified-Since with loadTable > This solution seems to do the trick for us. Let me do some research myself > to see if there are any difficulties implementing this. Currently, I have > more questions than answers wrt this approach :) > - The initial problem is to answer freshness questions for the cached > tables on the client side. If we introduce HttpCaching wouldn't we > introduce the same problem but on a different level of representation. We'd > then need to decide the freshness/staleness of the cached data in the HTTP > layer. > - If we cache the HTTP responses for a loadTable then we essentially cache > the content of the metadata.jsons including the snapshot and metadata log > and everything, plus the snapshot list (and I think the manifests for the > latest snapshot). I believe that the size of this can easily reach the low > megabytes range in memory, so in total keeping them in the HTTP Cache for > all the tables we have queried can easily mean that we keep a couple of GBs > in memory just for this purpose. > For engines that already cache table metadata wouldn't this mean that we > will cache some parts of the metadata redundantly? > - How would we decide what is the max-age of a cached table metadata in > the HTTP Cache? Would it be configurable so that each engine could use > whatever it prefers? > > > Sorry if any of the questions doesn't make sense, I just want to make sure > I understand all the aspects of this approach. > > > An additional topic I have in mind: > REST catalog vs other catalogs: > Now we are focusing our discussion on the REST spec, but I think it would > be beneficial to extend our focus and cover other catalog implementations > too. I don't think that this problem of data freshness is specific to REST > catalog, it could affect any table in any other catalog too. > > > I'll continue my investigation wrt the proposals, I just wanted to flush > out and sum up what we have now before the weekend. > > > Regards, > Gabor > > > > > On Fri, Nov 15, 2024 at 10:16 AM Jean-Baptiste Onofré <j...@nanthrax.net> > wrote: > Hi, > > I like the idea and it makes sense. As soon as it's clearly stated in > the spec (using If-Modified-Since header and 304 status code), it > looks good to me. > > Thanks ! > Regards > JB > > On Fri, Nov 15, 2024 at 1:58 AM Taeyun Kim <taeyun....@innowireless.com> > wrote: > > > > Hi, > > > > (Apologies if this email is a duplicate. This is my third attempt.) > > > > I also need a way to ensure that my table data is up-to-date. For now, > I’m handling this by setting an expiration period after which I fetch the > data again, regardless of its freshness. > > > > Here are my thoughts on the current suggestions. Please correct me if > I've misunderstood any of the points. > > > > - isLatest(): This function could be inefficient since it would require > an additional round-trip to fetch the metadata if it’s not up-to-date. This > would result in two round-trips overall, which seems suboptimal. > > - metadataLocation(): This has a similar issue as isLatest(). BTW, > according to the REST catalog API documentation for LoadTableResult schema, > it states, "Clients can check whether metadata has changed by comparing > metadata locations after the table has been created." ( > https://github.com/apache/iceberg/blob/3659ded18d50206576985339bd55cd82f5e200cc/open-api/rest-catalog-open-api.yaml#L3175) > This suggests that if the metadata location has changed, the metadata can > be considered updated. > > - tableExists(): Based on the name, this function seems to serve a > different purpose. > > > > Here is my suggestion: > > > > Since HTTP has built-in caching features ( > https://developer.mozilla.org/en-US/docs/Web/HTTP/Caching), and REST > catalogs operate over HTTP, it seems natural to leverage HTTP caching > mechanisms. For example, HTTP includes the If-Modified-Since header and the > 304 Not Modified status code. Using this approach, we could achieve data > freshness with a single round-trip, fetching updated data only if there are > modifications. > > > > What do you think about defining the spec in this direction? > > > > Thank you. > > > > > > > > > > -----Original Message----- > > From: "Yufei Gu" <flyrain...@gmail.com> > > To: <dev@iceberg.apache.org>; > > Cc: > > Sent: 2024-11-13 (수) 03:43:24 (UTC+09:00) > > Subject: Re: [DISCUSS] REST: Way to query if metadata pointer is the > latest > > > > > > > > Hi Gamber, > > > > Thanks for the proposal! Impala isn’t unique in needing this—I've seen > similar requirements from other engines. > > > > As others pointed out, using the “tableExists” endpoint seems like a > workaround. I don't consider it a permanent way forward. We could address > this by either modifying the current load table endpoint or introducing a > new one, but ideally, we should avoid adding endpoints for every specific > need. With that, partial metadata loading seems like a strong approach > here, we will need certain agreement though. I'd suggest the community > consider the use cases seriously. We need a way forward. > > > > I’m also not too concerned about using metadata file paths to verify the > latest table version; clients can simply extract metadata filenames, which > include the UUID. > > > > Yufei > > > > > > > > > > On Tue, Nov 12, 2024 at 7:46 AM Jean-Baptiste Onofré <j...@nanthrax.net> > wrote: > > > > Hi Fokko > > > > I like the idea, but I think it's more a workaround and could be > > confusing for users :) > > > > Regards > > JB > > > > On Tue, Nov 12, 2024 at 2:53 PM Fokko Driesprong <fo...@apache.org> > wrote: > > > > > > Hey Gabor, > > > > > > Thanks for raising this. While reading this, my first thought is to > leverage the `tableExists` operation: > > > > https://github.com/apache/iceberg/blob/e3f39972863f891481ad9f5a559ffef093976bd7/open-api/rest-catalog-open-api.yaml#L1129-L1160 > > > > > > This doesn't return anything today, but we could return a payload to > the latest metadata.json. > > > > > > Looking forward to what others think. > > > > > > Kind regards, > > > Fokko > > > > > > > > > > > > > > > Op di 12 nov 2024 om 14:33 schreef Shani Elharrar > <sh...@upsolver.com.invalid>: > > >> > > >> I recommend option (b), provided there is no partial metadata > loading. We implemented option (b) internally to facilitate partial > metadata loading, as we have tables with hundreds of thousands of > snapshots. This results in metadata that occupies approximately 500 MB in > memory (excluding the JsonNodes), which is a significant load for some of > our services. > > >> > > >> Shani. > > >> > > >> On 12 Nov 2024, at 14:12, Gabor Kaszab <gaborkas...@apache.org> > wrote: > > >> > > >> Hey Iceberg Community, > > >> > > >> Background: > > >> Impala is designed in a way to cache the Iceberg table metadata > (BaseTable objects in practice) for faster access. Currently, Impala is > tightly coupled with HMS and in turn with the HiveCatalog, and in order to > keep the cached table objects up-to-date there is a notification mechanism > driven by HMS to notify Impala about any changes in the table metadata. > > >> The Impala community is actively looking for ways to decouple HMS > from Impala and provide a way to use Impala without the need for HMS, and > get the Iceberg table metadata from other catalog Implementations mainly > focusing now on REST catalogs. > > >> > > >> Problem to solve: > > >> We identified a particular missing functionality in the current REST > spec: For engines that cache table metadata currently there is no way to > check if that table metadata is up-to-date or not, and whether the engine > should reload the metadata for that table or not without getting a whole > table object from the catalog. For this I think the REST catalog (but in > fact I think this could apply to any other catalogs) should be able to > answer a question like: > > >> "Hi Catalog, I have this version of this table, is it up-to-date?" > > >> > > >> Proposal: > > >> I've been following the discussion about partial metadata loading > that could be also used to answer the above question, but I have the > impression now that the conversation stopped making any progress. > > >> So instead of waiting for partial metadata loading I propose to have > an addition to the REST spec now to answer the question I raised above: > > >> > > >> a) boolean isLatest(TableIdentifier ident, String metadataLocation); > > >> b) String metadataLocation(TableIdentifier ident); > > >> > > >> Any of the above 2 approaches could help engines to decide if they > have to invalidate/reload particular table metadata in the cache. I > personally would go for option a) but would be open to hear other opinions. > > >> > > >> I'd like to know if the community could support me extending the REST > spec with any of the 2 options. > > >> > > >> Regards, > > >> Gabor > > >> > > >>