Hi Taeyun,

Thank you for the clear explanation.

I agree that the ETag solution is more suitable. If we were going that way,
I'd propose a customized version number as an ETag—for instance, leveraging
the metadata.json file name as the identifier.

To summarize, HTTP caching relies on headers (e.g., ETag or Last-Modified)
to validate whether a version is up-to-date, whereas the alternative
approach proposed above uses an additional parameter for verification. From
my perspective, there isn’t a fundamental difference between the two, so
I’m OK with either.

A couple of points to note:

   1. Both approaches would require changes to the "loadTable" endpoint.
   2. A minor advantage of HTTP caching is that it integrates seamlessly
   with browsers, but since most clients of the Iceberg REST catalog aren’t
   browsers, this may not be a significant factor.
   3. I’d also recommend considering the requirement to retrieve multiple
   tables(e.g., all tables under a namespace, or a list of table names) from
   the catalog. This requires a new endpoint and may not work with HTTP
   caching.

Let me know your thoughts or if there’s anything else to consider.
Yufei


On Sun, Nov 17, 2024 at 6:43 PM Taeyun Kim <taeyun....@innowireless.com>
wrote:

> Hi,
>
> To Gabor:
> It doesn’t seem necessary to interpret HTTP caching literally in this
> context.
> Simply using the HTTP headers defined by HTTP caching to check the
> freshness of metadata should be sufficient.
> There’s no requirement for the client to duplicate or store cached HTTP
> responses.
>
> To Yufei:
> As I understand it, the client doesn’t send its own timestamp but instead
> uses the timestamp originally provided by the server in the Last-Modified
> header.
> Therefore, clock synchronization issues should not be a concern.
>
> Here’s the general flow of HTTP cache validation based on
> If-Modified-Since:
>
> - Client: initial request:
>
> GET (url) HTTP/1.1
>
> - Server response:
>
> HTTP/1.1 200 OK
> Last-Modified: (date1)
> Cache-Control: no-store, no-cache, max-age=0, must-revalidate,
> proxy-revalidate
> (with response body)
>
> - Client: validation request:
>
> GET (url) HTTP/1.1
> If-Modified-Since: (date1)
>
> - Server response (if unchanged):
>
> HTTP/1.1 304 Not Modified
> Last-Modified: (date1)
> Cache-Control: no-store, no-cache, max-age=0, must-revalidate,
> proxy-revalidate
> (without response body)
>
> - Server response (if updated):
>
> HTTP/1.1 200 OK
> Last-Modified: (date2)
> Cache-Control: no-store, no-cache, max-age=0, must-revalidate,
> proxy-revalidate
> (with response body)
>
> However, using time-based freshness checks can present challenges, such as
> parsing time formats or synchronizing file update times across servers.
> To address these issues, HTTP cache validation based on ETag is also
> defined in the specification.
>
> Here’s the flow for ETag-based validation:
>
> - Client: initial request:
>
> GET (url) HTTP/1.1
>
> - Server response:
>
> HTTP/1.1 200 OK
> ETag: "(arbitrary string 1 generated by the server)"
> Cache-Control: no-store, no-cache, max-age=0, must-revalidate,
> proxy-revalidate
> (with response body)
>
> - Client: validation request:
>
> GET (url) HTTP/1.1
> If-None-Match: "(arbitrary string 1 generated by the server)"
>
> - Server response (if unchanged):
>
> HTTP/1.1 304 Not Modified
> ETag: "(arbitrary string 1 generated by the server)"
> Cache-Control: no-store, no-cache, max-age=0, must-revalidate,
> proxy-revalidate
> (without response body)
>
> - Server response (if updated):
>
> HTTP/1.1 200 OK
> ETag: "(arbitrary string 2 generated by the server)"
> Cache-Control: no-store, no-cache, max-age=0, must-revalidate,
> proxy-revalidate
> (with response body)
>
> The server can choose to use either If-Modified-Since or ETag for
> freshness validation.
> Alternatively, to simplify the implementation related to the Iceberg REST
> catalog, it might make sense to define only the more accurate ETag-based
> validation in the spec.
> For reference, RFC 9110 recommends specifying both ETag and Last-Modified.
> When both are provided, ETag takes precedence.
>
> Note on Cache-Control Headers:
> The Cache-Control values in the examples above are intended to ensure that
> the client validates freshness with the server on every request. Writing
> the header in this extended format is primarily to accommodate outdated
> HTTP/1.1 implementations. However, under the HTTP/1.1 specification, the
> following is sufficient:
>
> Cache-Control: no-cache
>
> That’s all for now.
> Thank you.
>
>
> -----Original Message-----
> From: "Yufei Gu" <flyrain...@gmail.com>
> To: <dev@iceberg.apache.org>;
> Cc:
> Sent: 2024-11-16 (토) 02:51:05 (UTC+09:00)
> Subject: Re: [DISCUSS] REST: Way to query if metadata pointer is the latest
>
>
>
> How does HTTP caching handle desynchronized clocks between clients and the
> server?
>
> At t0, the client gets the latest table version.
> At t1, the server makes a new commit.
> At t2, the client sends a request with a timestamp t2, but due to
> desynchronization, it refers to t0.
>
> The server may reply with 304 Not Modified, causing the client to think
> its cache is up-to-date and miss the commit at t1.
>
>
>
> Yufei
>
>
>
>
> On Fri, Nov 15, 2024 at 6:37 AM Gabor Kaszab <gaborkas...@apache.org>
> wrote:
> Hi All,
>
>
> First of all it's great to see that there are others who could benefit
> from giving a solution to this problem. I appreciate all the comments and
> feedback so far.
> There were a number of different opinions, so let me start with
> summarizing the different topics that came up:
>
>
> New endpoint vs using an existing endpoint:
> Based on the answers (Fokko, Yufei) I had the impression that we should be
> careful when adding new REST endpoints, and we should examine the re-use of
> existing endpoints first. Let's do that then, and in case we don't find it
> feasible then we can still fall back to any of my initial proposals
> (isLatest() or metadataLocation()).
>
>
> Granularity of freshness checks:
> It was brought up (Dmitri) that we might not want to do the metadata
> freshness checks solely based on metadata location, but we should consider
> doing more granular freshness checks. I personally don't see much benefit
> of designing this solution like that, TBH, but seeing some use-cases could
> help us understand the motivation here.
> Let me share my opinion on some of the arguments:
>
>
> "A change in metadata location does not necessarily mean a change in
> metadata content"
>
>
> AFAIK whenever Iceberg creates a new metadata file there is some change in
> the metadata itself. There might not be a new snapshot, though in the cases
> of e.g. a schema/partition evolution. But even in these cases triggering a
> table reload could make sense to me (e.g. answering SHOW CREATE TABLE and
> similar queries). Additionally, I'd assume the number of metadata location
> changes that don't create a new snapshot is too negligible to optimize for.
> Dmitri, let me know if I misunderstood something.
>
>
> "it may still be beneficial to permit the client to ask for changes to
> specific areas of metadata"
>
> This seems like a use-case that the partial metadata loading proposal
> could solve. To identify the need to load a specific part of the metadata
> with partial metadata loading seems an overkill to design with my proposal,
> if this is what you have in mind. Also I found that the partial metadata
> loading proposal faces serious headwinds, so I wouldn't rely on it at the
> moment.
>
>
> Re-using tableExists
> I think there is a consensus here that tableExists returning a metadata
> location could work but seems more like a workaround and could be
> misleading for the users.
>
> Partial metadata loading could solve this:
> (Yufei) I agree, it would be perfect for my use-case and I'm following the
> discussion on the proposal. However, for me it seems, as I wrote above,
> that the proposal faces serious headwinds now and I honestly wouldn't
> expect a solution in the short term. But solving the freshness problems is
> a more urgent thing to solve, not just for myself and Impala but apparently
> to many other stakeholders in the community according to the interest on
> this thread.
> Hence, I propose to come up with a separate solution for freshness checks,
> and we can still move to using partial metadata loading once that's out.
>
>
> Use HTTPCache and If-Modified-Since with loadTable
> This solution seems to do the trick for us. Let me do some research myself
> to see if there are any difficulties implementing this. Currently, I have
> more questions than answers wrt this approach :)
> - The initial problem is to answer freshness questions for the cached
> tables on the client side. If we introduce HttpCaching wouldn't we
> introduce the same problem but on a different level of representation. We'd
> then need to decide the freshness/staleness of the cached data in the HTTP
> layer.
> - If we cache the HTTP responses for a loadTable then we essentially cache
> the content of the metadata.jsons including the snapshot and metadata log
> and everything, plus the snapshot list (and I think the manifests for the
> latest snapshot). I believe that the size of this can easily reach the low
> megabytes range in memory, so in total keeping them in the HTTP Cache for
> all the tables we have queried can easily mean that we keep a couple of GBs
> in memory just for this purpose.
> For engines that already cache table metadata wouldn't this mean that we
> will cache some parts of the metadata redundantly?
> - How would we decide what is the max-age of a cached table metadata in
> the HTTP Cache? Would it be configurable so that each engine could use
> whatever it prefers?
>
>
> Sorry if any of the questions doesn't make sense, I just want to make sure
> I understand all the aspects of this approach.
>
>
> An additional topic I have in mind:
> REST catalog vs other catalogs:
> Now we are focusing our discussion on the REST spec, but I think it would
> be beneficial to extend our focus and cover other catalog implementations
> too. I don't think that this problem of data freshness is specific to REST
> catalog, it could affect any table in any other catalog too.
>
>
> I'll continue my investigation wrt the proposals, I just wanted to flush
> out and sum up what we have now before the weekend.
>
>
> Regards,
> Gabor
>
>
>
>
> On Fri, Nov 15, 2024 at 10:16 AM Jean-Baptiste Onofré <j...@nanthrax.net>
> wrote:
> Hi,
>
> I like the idea and it makes sense. As soon as it's clearly stated in
> the spec (using If-Modified-Since header and 304 status code), it
> looks good to me.
>
> Thanks !
> Regards
> JB
>
> On Fri, Nov 15, 2024 at 1:58 AM Taeyun Kim <taeyun....@innowireless.com>
> wrote:
> >
> > Hi,
> >
> > (Apologies if this email is a duplicate. This is my third attempt.)
> >
> > I also need a way to ensure that my table data is up-to-date. For now,
> I’m handling this by setting an expiration period after which I fetch the
> data again, regardless of its freshness.
> >
> > Here are my thoughts on the current suggestions. Please correct me if
> I've misunderstood any of the points.
> >
> > - isLatest(): This function could be inefficient since it would require
> an additional round-trip to fetch the metadata if it’s not up-to-date. This
> would result in two round-trips overall, which seems suboptimal.
> > - metadataLocation(): This has a similar issue as isLatest(). BTW,
> according to the REST catalog API documentation for LoadTableResult schema,
> it states, "Clients can check whether metadata has changed by comparing
> metadata locations after the table has been created." (
> https://github.com/apache/iceberg/blob/3659ded18d50206576985339bd55cd82f5e200cc/open-api/rest-catalog-open-api.yaml#L3175)
> This suggests that if the metadata location has changed, the metadata can
> be considered updated.
> > - tableExists(): Based on the name, this function seems to serve a
> different purpose.
> >
> > Here is my suggestion:
> >
> > Since HTTP has built-in caching features (
> https://developer.mozilla.org/en-US/docs/Web/HTTP/Caching), and REST
> catalogs operate over HTTP, it seems natural to leverage HTTP caching
> mechanisms. For example, HTTP includes the If-Modified-Since header and the
> 304 Not Modified status code. Using this approach, we could achieve data
> freshness with a single round-trip, fetching updated data only if there are
> modifications.
> >
> > What do you think about defining the spec in this direction?
> >
> > Thank you.
> >
> >
> >
> >
> > -----Original Message-----
> > From: "Yufei Gu" <flyrain...@gmail.com>
> > To: <dev@iceberg.apache.org>;
> > Cc:
> > Sent: 2024-11-13 (수) 03:43:24 (UTC+09:00)
> > Subject: Re: [DISCUSS] REST: Way to query if metadata pointer is the
> latest
> >
> >
> >
> > Hi Gamber,
> >
> > Thanks for the proposal! Impala isn’t unique in needing this—I've seen
> similar requirements from other engines.
> >
> > As others pointed out, using the “tableExists” endpoint seems like a
> workaround. I don't consider it a permanent way forward. We could address
> this by either modifying the current load table endpoint or introducing a
> new one, but ideally, we should avoid adding endpoints for every specific
> need. With that, partial metadata loading seems like a strong approach
> here, we will need certain agreement though. I'd suggest the community
> consider the use cases seriously. We need a way forward.
> >
> > I’m also not too concerned about using metadata file paths to verify the
> latest table version; clients can simply extract metadata filenames, which
> include the UUID.
> >
> > Yufei
> >
> >
> >
> >
> > On Tue, Nov 12, 2024 at 7:46 AM Jean-Baptiste Onofré <j...@nanthrax.net>
> wrote:
> >
> > Hi Fokko
> >
> > I like the idea, but I think it's more a workaround and could be
> > confusing for users :)
> >
> > Regards
> > JB
> >
> > On Tue, Nov 12, 2024 at 2:53 PM Fokko Driesprong <fo...@apache.org>
> wrote:
> > >
> > > Hey Gabor,
> > >
> > > Thanks for raising this. While reading this, my first thought is to
> leverage the `tableExists` operation:
> > >
> https://github.com/apache/iceberg/blob/e3f39972863f891481ad9f5a559ffef093976bd7/open-api/rest-catalog-open-api.yaml#L1129-L1160
> > >
> > > This doesn't return anything today, but we could return a payload to
> the latest metadata.json.
> > >
> > > Looking forward to what others think.
> > >
> > > Kind regards,
> > > Fokko
> > >
> > >
> > >
> > >
> > > Op di 12 nov 2024 om 14:33 schreef Shani Elharrar
> <sh...@upsolver.com.invalid>:
> > >>
> > >> I recommend option (b), provided there is no partial metadata
> loading. We implemented option (b) internally to facilitate partial
> metadata loading, as we have tables with hundreds of thousands of
> snapshots. This results in metadata that occupies approximately 500 MB in
> memory (excluding the JsonNodes), which is a significant load for some of
> our services.
> > >>
> > >> Shani.
> > >>
> > >> On 12 Nov 2024, at 14:12, Gabor Kaszab <gaborkas...@apache.org>
> wrote:
> > >>
> > >> Hey Iceberg Community,
> > >>
> > >> Background:
> > >> Impala is designed in a way to cache the Iceberg table metadata
> (BaseTable objects in practice) for faster access. Currently, Impala is
> tightly coupled with HMS and in turn with the HiveCatalog, and in order to
> keep the cached table objects up-to-date there is a notification mechanism
> driven by HMS to notify Impala about any changes in the table metadata.
> > >> The Impala community is actively looking for ways to decouple HMS
> from Impala and provide a way to use Impala without the need for HMS, and
> get the Iceberg table metadata from other catalog Implementations mainly
> focusing now on REST catalogs.
> > >>
> > >> Problem to solve:
> > >> We identified a particular missing functionality in the current REST
> spec: For engines that cache table metadata currently there is no way to
> check if that table metadata is up-to-date or not, and whether the engine
> should reload the metadata for that table or not without getting a whole
> table object from the catalog. For this I think the REST catalog (but in
> fact I think this could apply to any other catalogs) should be able to
> answer a question like:
> > >> "Hi Catalog, I have this version of this table, is it up-to-date?"
> > >>
> > >> Proposal:
> > >> I've been following the discussion about partial metadata loading
> that could be also used to answer the above question, but I have the
> impression now that the conversation stopped making any progress.
> > >> So instead of waiting for partial metadata loading I propose to have
> an addition to the REST spec now to answer the question I raised above:
> > >>
> > >> a) boolean isLatest(TableIdentifier ident, String metadataLocation);
> > >> b) String metadataLocation(TableIdentifier ident);
> > >>
> > >> Any of the above 2 approaches could help engines to decide if they
> have to invalidate/reload particular table metadata in the cache. I
> personally would go for option a) but would be open to hear other opinions.
> > >>
> > >> I'd like to know if the community could support me extending the REST
> spec with any of the 2 options.
> > >>
> > >> Regards,
> > >> Gabor
> > >>
> > >>

Reply via email to