Re: [DISCUSS] REST: Way to query if metadata pointer is the latest

Taeyun Kim Wed, 20 Nov 2024 17:01:21 -0800

Hi Gabor,

On HTTP Caching:

If an HTTP client library performs caching by default and doesn’t allow 
disabling it, I believe that library shouldn’t be used - at least in the 
context of this discussion.
The kind of HTTP client library I have in mind is one that handles encoding and 
decoding of HTTP headers and body, as well as connection pooling. The 
responsibility for interpreting headers, status, and body content should remain 
with the application. While caching support can be provided, it should be 
optional.
When using a library that behaves as I described, the issues you mentioned in 
points 4) and 5) shouldn’t arise, as the library wouldn’t interfere with 
caching.
For reference, the Rust reqwest crate (which Iceberg-Rust appears to use) seems 
to operate as expected in this regard.

On Programming Languages and APIs:

One of my points was that there doesn’t seem to be a reason to center the 
discussion around Java (and its libraries).
BTW, I don’t think it’s necessary for the functions in the iceberg-rust library 
to be identical to those in the Java library. Optimal solutions may vary by 
language, and library developers may have different design goals.
Personally, my primary focus is on the REST catalog API specification, rather 
than language-specific library APIs. (To avoid confusion, I’ll refer to the 
REST catalog API as the "specification" from here on.)
Library APIs are (merely) implementations designed to make the specification 
easier to use. Since libraries are open-source, I can modify them as needed for 
my use case (and, in fact, I’ve made modifications to iceberg-rust for my 
purposes). However, the specification defines the interface between different 
applications or servers, making it immutable for practical purposes.

On ETags:

The decision to use ETags is not just an implementation detail - it is part of 
the specification itself. In my view, it is far more significant than the 
signature of a library API function. I’ve outlined the reasons for this above.

On the Proposal:

I agree that the current function (loadTable(TableIdentifier)) cannot be 
freshness-aware. This is expected, as the caller doesn’t provide the version it 
holds, leaving the callee with no basis for comparison.
On the other hand, the proposed new function signature doesn’t seem to provide 
a way for the caller to supply ETags (or equivalent identifiers representing 
specific table versions for other catalog types). Is such information intended 
to be embedded within the Table structure?
To me, it seems clearer to explicitly provide such information (like ETags) 
rather than embedding it in the Table structure. That said, I might be 
misunderstanding the intention here.

Thank you.

-----Original Message-----
From: "Gabor Kaszab" <gaborkas...@apache.org>
To: <dev@iceberg.apache.org>;
Cc:
Sent: 2024-11-19 (화) 21:26:01 (UTC+09:00)
Subject: Re: [DISCUSS] REST: Way to query if metadata pointer is the latest

Hi,

Thanks for sharing your view, Taeyun! I think there are many levels of 
representation here and we might not mean the same with our points. I think in 
general an interaction between a query engine and an Iceberg REST catalog has 
these different layers:
1) The engine (Impala, Spark, Trino, etc.).
2) Catalog API of the Iceberg lib offers loadTable(TableIdentifier) that 
returns a Table object. Different language implementations seem to have the 
same API (Java, Rust, etc.).
3) The particular implementation of a catalog that implements the above 
loadTable(TableIdentifier) function. In this example the RESTCatalog / 
RESTSessionCatalog.
4) RESTClient implemented by HTTPClient (used by the REST catalog) to 
communicate with the REST server (still implemented within Iceberg)
5) The external HTTPClient 
(org.apache.hc.client5.http.impl.classic.CloseableHttpClient) that orchestrates 
the HTTP traffic between the client and the server

Let me reflect on your comments:
- HTTP caching
With the above layers in mind if I'm not mistaken HTTP Caching would be 
configured in 4) and the actual caching of HTTP responses would be in 5). This 
is what I meant by HTTP layer. With HTTP Caching the control of how long a 
cached TableMetadata is stored will no longer be in 1) and would be in 4) - 5). 
I don't think that any of the engines that cache table metadata would want to 
have this loss of control.

- Programming language
I'm not sure I get your point with this. The Catalog API seems the same 
regardless of programming language (See the links for 2) ).

- ETags
An ETag is an implementation detail that is relevant for HTTP communication. We 
can't extend the Catalog API in 2) nor in 3) with functions that are aware of 
ETags (e.g. return ETags or accept ETags as param). Those APIs are common for 
all the Catalog implementations including ones that don't leverage ETags for 
HTTP traffic.

Proposal:
- Catalog API
I don't think that the current Catalog.loadTable(TableIdentifier) API in 2) is 
suitable for a freshness-aware table loading use case. It wouldn't be 
transparent to the clients if that actual catalog implementation avoids 
reloading the table if it hasn't changed or if that catalog implementation 
reloads the table unconditionally with this API call.
Also it doesn't seem straightforward what the API should return if the table is 
considered fresh. This API returns a Table object and in case we get a 304 
without a body from the catalog server, we won't have a way to construct a 
Table object as a return value for this API. I already shared my concerns for 
caching the LoadTableResponses within 4) - 5)

So I think we need a new API on the Catalog for this purpose. Thanks Zoltan for 
the naming suggestion, after I sent my mail yesterday I also thought that I 
could've come up with a more intuitive name.
This can either be:
a) Table reloadTable(TableIdentifier, Table)
b) Table loadTableIfChanged(TableIdentifier, Table)

With this Catalog level API we could provide the current known state of that 
particular table as a parameter, and if the client side of the catalog 
implementation finds that the table hasn't changed it can return this Table 
object for the current state. With this approach no caching would be needed in 
2) - 5). It's up to the catalog implementation how it finds out if the table 
has been changed or not.

- REST API, REST spec
The REST API could use the ETag approach to check table freshness. As described 
in previous mails this could reduce the number of round trips to refresh a 
table to one without the need of separately checking the freshness. We could 
use the same endpoint as we do for the current loadTable(), with an additional 
mention of an optional ETag being used and also including the 304 into the 
possible responses.

For this approach we have to cache the [TableIdentifier -> last ETag] mapping 
somewhere. I think 4), the Iceberg specific HTTPClient could be suitable for 
this purpose, however, this seems too low level for this purpose. We can also 
consider RESTSessionCatalog to cache the ETags.

This is something to be considered, but for REST catalog implementations that 
don't support the ETag based implementation, they would just simply perform a 
regular loadTable operation, not bothering with sending 304 codes. We can also 
investigate if we should get an exception if that particular REST 
implementation doesn't support the ETag approach, so that clients could notice 
if there is no freshness-aware table loading under the hood.

- Other catalog types
Currently we focus on the REST catalog implementation but the above Catalog API 
proposal could work for other catalog types too. The internal implementation 
could be different, though. Initially they could throw a 
NotImplementedException.

I hope this makes sense and I haven't missed any details or previous comments.

Regards,
Gabor

On Tue, Nov 19, 2024 at 5:17 AM Taeyun Kim <taeyun....@innowireless.com> wrote:

Hi,

Here are my thoughts:

- HTTP Layer: To my knowledge, there isn’t a separate "HTTP layer" in this 
context, so concerns about control over caching shouldn’t be an issue. The 
header approach I mentioned simply involves handling additional headers when 
using HTTP client libraries to interact with the REST API.

- Programming Language: For reference, I don’t use Java - I use Rust and C++. 
Personally, I hope Iceberg’s specifications avoid including Java-specific 
features and that the cross-language compatible REST catalog becomes the 
primary catalog for Iceberg.

- API Perspective: Given the above, I may not be in the best position to 
comment on Java APIs. However, regarding Gabor’s proposed API (Table 
loadTable(Table existingTable)), I believe it would be difficult for an 
ETag-based REST catalog to support this API since it cannot provide an ETag. 
Instead, I’d like to suggest an alternative API:
Option<Table, tag> loadTableIfNoneMatch(TableIdentifier, Option<tag>)
Initially, the client would provide None as the tag. If the tag is not None and 
matches the latest table tag, the API returns None (= not updated). If the tag 
is None or does not match the latest table tag, the API returns a new (Table, 
tag) pair.

Thank you.

-----Original Message-----
From: "Zoltán Borók-Nagy" <borokna...@cloudera.com.invalid>
To: <dev@iceberg.apache.org>;
Cc:
Sent: 2024-11-19 (화) 03:16:05 (UTC+09:00)
Subject: Re: [DISCUSS] REST: Way to query if metadata pointer is the latest

Hey Everyone,

Thanks Gábor, I think the proposed interface would be very useful to any engine 
that employs caching, e.g. Impala.
And it is pretty neat that it is catalog-agnostic, i.e. we just give all the 
information we have about the table and let the catalog implementation 
efficiently reload it.

I might have a nitpick suggestion about the name to clearly express the intent: 
loadTable -> reloadTable (or, refreshTable)

Cheers,
    Zoltan

On Mon, Nov 18, 2024 at 5:17 PM Gabor Kaszab <gaborkas...@cloudera.com.invalid> 
wrote:

Hi Iceberg Community,

This is a great conversation so far, and thanks everyone for the valuable 
inputs!
I'd like to articulate 2 things that we have to keep in mind with the design:

1: There are 2 interfaces here that we should consider:
What I mean by this is that so far we have been talking about the REST spec, 
more narrowly the HTTP communication between Iceberg's REST client and the REST 
server. I think the proposed solution with the ETag absolutely makes sense 
within this context.
However, the usual way of a client interacting with an Iceberg catalog 
(including REST) is the Catalog API in the library. This API offers a 
loadTable(TableIdentifier) function that returns a Table object. With the above 
HTTP-based solution in mind I don't think we could give any meaningful results 
if the HTTP layer finds that the table hasn't changed. I argued already against 
pushing the caching responsibilities from the clients into the HTTP layer 
(mostly because of losing the control over the cache, and also observability 
won't be straightforward) so let's assume for now that we won't do caching in 
the HTTP layer, only execute the loadTable calls to the REST catalog by setting 
the ETag. In case we get a 304 we won't be able to construct a Table object to 
answer the Catalog.loadTable(TableIdentifier) call. We could return null or 
throw an exception but I don't find any of them appropriate.

2: There are catalog types other than REST
I started this conversation focusing on the REST spec, but the more I think of 
this the more I feel that the same functionality should be offered for all the 
other catalog types too. Let's assume that we have an engine that caches table 
metadata and initially uses REST catalog. For such an engine the proposed 
solution would solve the problem of checking table freshnes and also reloading 
the table metadata. A simple code for that could be enough if we configured our 
HTTP client properly (just sketched a simple example):

tableCache_.put(catalog_.loadTable(tableIdentifier));

Also let's assume we solve the issue in 1) and we can answer such a call even 
if we get 304 from the server as the table is unchanged. So with this solution 
with the REST catalog we can be sure that the table is only loaded from the 
catalog if changed (or the age expired). But what if we configure another 
catalog, let's say HiveCatalog. The very same code for that catalog would 
trigger a table reload for every execution causing unexpected performance 
issues.
I have to double check but I assume that this HTTP approach wouldn't be 
feasible for other catalog types unfortunately.

I hope these arguments make sense :)

As a partial solution this is what I have in mind: 
We can add another function into the catalog API for this purpose. Let's say 
something like this:
Table loadTable(Table existingTable);

What advantages I see with this:
- This could solve issue 1) above. In case the table hasn't changed we can 
simply return 'existingTable' without using HTTP Cache.
- The clients wouldn't need to explicitly call for isLatest() and such 
functions to check for freshness, and they wouldn't need to trigger table 
reloading for themselve. This API would be expected to cover this under the 
hood.
- The current Catalog.loadTable(TableIdentifier) API wouldn't be enough for all 
the catalog types on it's own, but with this one each catalog implementations 
(e.g. HiveCatalog, REST catalog, etc.) then can implement their own way of 
doing freshness checks and table reloads. For REST we could follow the HTTP 
ETag approach, while for other catalogs we could follow other approaches.

Regards,
Gabor

On Mon, Nov 18, 2024 at 8:48 AM Shani Elharrar <sh...@upsolver.com.invalid> 
wrote:

You're totally right. Perhaps using a "Content-Location" header might be a 
better fit for that. 

Shani.

On 18 Nov 2024, at 9:27, Taeyun Kim <taeyun....@innowireless.com> wrote:

Hi,

Here are my thoughts:

- The value of ETag is (as far as I know) defined as an opaque string by the 
specification, meaning the client shouldn’t interpret or assign any 
significance to it, regardless of what the server specifies. It’s best to avoid 
the client giving any particular meaning to the ETag value.
- One major advantage of the header approach compared to other methods is that 
if an update has occurred, the updated content can be immediately included in 
the response without requiring an additional request. This saves one 
request-response round-trip (although It’s also possible to define a separate 
endpoint with the same functionality).
- Since the Iceberg REST catalog server is effectively a type of HTTP server, 
at least in theory, it may be expected to handle HTTP cache and 
validation-related processes. The header approach can be seen as leveraging 
this mechanism appropriately.
- The header approach doesn’t have to be limited to the 
/v1/{prefix}/namespaces/{namespace}/tables/{table} endpoint. It could also be 
applied to all GET-based endpoints, though this might broaden the scope 
significantly.

Thank you.

-----Original Message-----
From: "Shani Elharrar" <sh...@upsolver.com.invalid>
To: <dev@iceberg.apache.org>;
Cc: <dev@iceberg.apache.org>;
Sent: 2024-11-18 (월) 16:21:16 (UTC+09:00)
Subject: Re: [DISCUSS] REST: Way to query if metadata pointer is the latest

Using the metadata file name as ETag is nice way to go. In that case, adding 
HEAD method support to the loadTable endpoint will return the latest metadata 
pointer, which can be used to support "isLatest" without returning the body. It 
can be also leveraged in order to return the latest metadata location of the 
table.

Shani.

On 18 Nov 2024, at 8:52, Yufei Gu <flyrain...@gmail.com> wrote:

Hi Taeyun,

Thank you for the clear explanation.

I agree that the ETag solution is more suitable. If we were going that way, I'd 
propose a customized version number as an ETag—for instance, leveraging the 
metadata.json file name as the identifier.

To summarize, HTTP caching relies on headers (e.g., ETag or Last-Modified) to 
validate whether a version is up-to-date, whereas the alternative approach 
proposed above uses an additional parameter for verification. From my 
perspective, there isn’t a fundamental difference between the two, so I’m OK 
with either.

A couple of points to note:

Both approaches would require changes to the "loadTable" endpoint.
A minor advantage of HTTP caching is that it integrates seamlessly with 
browsers, but since most clients of the Iceberg REST catalog aren’t browsers, 
this may not be a significant factor.
I’d also recommend considering the requirement to retrieve multiple 
tables(e.g., all tables under a namespace, or a list of table names) from the 
catalog. This requires a new endpoint and may not work with HTTP caching.

Let me know your thoughts or if there’s anything else to consider.

Yufei

On Sun, Nov 17, 2024 at 6:43 PM Taeyun Kim <taeyun....@innowireless.com> wrote:
Hi,

To Gabor:
It doesn’t seem necessary to interpret HTTP caching literally in this context.
Simply using the HTTP headers defined by HTTP caching to check the freshness of 
metadata should be sufficient.
There’s no requirement for the client to duplicate or store cached HTTP 
responses.

To Yufei:
As I understand it, the client doesn’t send its own timestamp but instead uses 
the timestamp originally provided by the server in the Last-Modified header.
Therefore, clock synchronization issues should not be a concern.

Here’s the general flow of HTTP cache validation based on If-Modified-Since:

- Client: initial request:

GET (url) HTTP/1.1

- Server response:

HTTP/1.1 200 OK 
Last-Modified: (date1) 
Cache-Control: no-store, no-cache, max-age=0, must-revalidate, proxy-revalidate 
(with response body)

- Client: validation request:

GET (url) HTTP/1.1 
If-Modified-Since: (date1)

- Server response (if unchanged):

HTTP/1.1 304 Not Modified 
Last-Modified: (date1) 
Cache-Control: no-store, no-cache, max-age=0, must-revalidate, proxy-revalidate 
(without response body)

- Server response (if updated):

HTTP/1.1 200 OK 
Last-Modified: (date2) 
Cache-Control: no-store, no-cache, max-age=0, must-revalidate, proxy-revalidate 
(with response body)

However, using time-based freshness checks can present challenges, such as 
parsing time formats or synchronizing file update times across servers.
To address these issues, HTTP cache validation based on ETag is also defined in 
the specification.

Here’s the flow for ETag-based validation:

- Client: initial request:

GET (url) HTTP/1.1

- Server response:

HTTP/1.1 200 OK 
ETag: "(arbitrary string 1 generated by the server)" 
Cache-Control: no-store, no-cache, max-age=0, must-revalidate, proxy-revalidate 
(with response body)

- Client: validation request:

GET (url) HTTP/1.1 
If-None-Match: "(arbitrary string 1 generated by the server)"

- Server response (if unchanged):

HTTP/1.1 304 Not Modified 
ETag: "(arbitrary string 1 generated by the server)" 
Cache-Control: no-store, no-cache, max-age=0, must-revalidate, proxy-revalidate 
(without response body)

- Server response (if updated):

HTTP/1.1 200 OK 
ETag: "(arbitrary string 2 generated by the server)" 
Cache-Control: no-store, no-cache, max-age=0, must-revalidate, proxy-revalidate 
(with response body)

The server can choose to use either If-Modified-Since or ETag for freshness 
validation.
Alternatively, to simplify the implementation related to the Iceberg REST 
catalog, it might make sense to define only the more accurate ETag-based 
validation in the spec.
For reference, RFC 9110 recommends specifying both ETag and Last-Modified. When 
both are provided, ETag takes precedence.

Note on Cache-Control Headers:
The Cache-Control values in the examples above are intended to ensure that the 
client validates freshness with the server on every request. Writing the header 
in this extended format is primarily to accommodate outdated HTTP/1.1 
implementations. However, under the HTTP/1.1 specification, the following is 
sufficient:

Cache-Control: no-cache

That’s all for now.
Thank you.

-----Original Message-----
From: "Yufei Gu" <flyrain...@gmail.com>
To: <dev@iceberg.apache.org>;
Cc:
Sent: 2024-11-16 (토) 02:51:05 (UTC+09:00)
Subject: Re: [DISCUSS] REST: Way to query if metadata pointer is the latest

How does HTTP caching handle desynchronized clocks between clients and the 
server?

At t0, the client gets the latest table version.
At t1, the server makes a new commit.
At t2, the client sends a request with a timestamp t2, but due to 
desynchronization, it refers to t0.

The server may reply with 304 Not Modified, causing the client to think its 
cache is up-to-date and miss the commit at t1.

Yufei

On Fri, Nov 15, 2024 at 6:37 AM Gabor Kaszab <gaborkas...@apache.org> wrote:
Hi All,

First of all it's great to see that there are others who could benefit from 
giving a solution to this problem. I appreciate all the comments and feedback 
so far.
There were a number of different opinions, so let me start with summarizing the 
different topics that came up:

New endpoint vs using an existing endpoint:
Based on the answers (Fokko, Yufei) I had the impression that we should be 
careful when adding new REST endpoints, and we should examine the re-use of 
existing endpoints first. Let's do that then, and in case we don't find it 
feasible then we can still fall back to any of my initial proposals (isLatest() 
or metadataLocation()).

Granularity of freshness checks:
It was brought up (Dmitri) that we might not want to do the metadata freshness 
checks solely based on metadata location, but we should consider doing more 
granular freshness checks. I personally don't see much benefit of designing 
this solution like that, TBH, but seeing some use-cases could help us 
understand the motivation here.
Let me share my opinion on some of the arguments:

"A change in metadata location does not necessarily mean a change in metadata 
content"

AFAIK whenever Iceberg creates a new metadata file there is some change in the 
metadata itself. There might not be a new snapshot, though in the cases of e.g. 
a schema/partition evolution. But even in these cases triggering a table reload 
could make sense to me (e.g. answering SHOW CREATE TABLE and similar queries). 
Additionally, I'd assume the number of metadata location changes that don't 
create a new snapshot is too negligible to optimize for.
Dmitri, let me know if I misunderstood something.

"it may still be beneficial to permit the client to ask for changes to specific 
areas of metadata"

This seems like a use-case that the partial metadata loading proposal could 
solve. To identify the need to load a specific part of the metadata with 
partial metadata loading seems an overkill to design with my proposal, if this 
is what you have in mind. Also I found that the partial metadata loading 
proposal faces serious headwinds, so I wouldn't rely on it at the moment.

Re-using tableExists
I think there is a consensus here that tableExists returning a metadata 
location could work but seems more like a workaround and could be misleading 
for the users.

Partial metadata loading could solve this:
(Yufei) I agree, it would be perfect for my use-case and I'm following the 
discussion on the proposal. However, for me it seems, as I wrote above, that 
the proposal faces serious headwinds now and I honestly wouldn't expect a 
solution in the short term. But solving the freshness problems is a more urgent 
thing to solve, not just for myself and Impala but apparently to many other 
stakeholders in the community according to the interest on this thread.
Hence, I propose to come up with a separate solution for freshness checks, and 
we can still move to using partial metadata loading once that's out.

Use HTTPCache and If-Modified-Since with loadTable
This solution seems to do the trick for us. Let me do some research myself to 
see if there are any difficulties implementing this. Currently, I have more 
questions than answers wrt this approach :)
- The initial problem is to answer freshness questions for the cached tables on 
the client side. If we introduce HttpCaching wouldn't we introduce the same 
problem but on a different level of representation. We'd then need to decide 
the freshness/staleness of the cached data in the HTTP layer.
- If we cache the HTTP responses for a loadTable then we essentially cache the 
content of the metadata.jsons including the snapshot and metadata log and 
everything, plus the snapshot list (and I think the manifests for the latest 
snapshot). I believe that the size of this can easily reach the low megabytes 
range in memory, so in total keeping them in the HTTP Cache for all the tables 
we have queried can easily mean that we keep a couple of GBs in memory just for 
this purpose.
For engines that already cache table metadata wouldn't this mean that we will 
cache some parts of the metadata redundantly?
- How would we decide what is the max-age of a cached table metadata in the 
HTTP Cache? Would it be configurable so that each engine could use whatever it 
prefers?

Sorry if any of the questions doesn't make sense, I just want to make sure I 
understand all the aspects of this approach.

An additional topic I have in mind:
REST catalog vs other catalogs:
Now we are focusing our discussion on the REST spec, but I think it would be 
beneficial to extend our focus and cover other catalog implementations too. I 
don't think that this problem of data freshness is specific to REST catalog, it 
could affect any table in any other catalog too.

I'll continue my investigation wrt the proposals, I just wanted to flush out 
and sum up what we have now before the weekend.

Regards,
Gabor

On Fri, Nov 15, 2024 at 10:16 AM Jean-Baptiste Onofré <j...@nanthrax.net> wrote:
Hi,

I like the idea and it makes sense. As soon as it's clearly stated in
the spec (using If-Modified-Since header and 304 status code), it
looks good to me.

Thanks !
Regards
JB

On Fri, Nov 15, 2024 at 1:58 AM Taeyun Kim <taeyun....@innowireless.com> wrote:
>
> Hi,
>
> (Apologies if this email is a duplicate. This is my third attempt.)
>
> I also need a way to ensure that my table data is up-to-date. For now, I’m 
> handling this by setting an expiration period after which I fetch the data 
> again, regardless of its freshness.
>
> Here are my thoughts on the current suggestions. Please correct me if I've 
> misunderstood any of the points.
>
> - isLatest(): This function could be inefficient since it would require an 
> additional round-trip to fetch the metadata if it’s not up-to-date. This 
> would result in two round-trips overall, which seems suboptimal.
> - metadataLocation(): This has a similar issue as isLatest(). BTW, according 
> to the REST catalog API documentation for LoadTableResult schema, it states, 
> "Clients can check whether metadata has changed by comparing metadata 
> locations after the table has been created." 
> (https://github.com/apache/iceberg/blob/3659ded18d50206576985339bd55cd82f5e200cc/open-api/rest-catalog-open-api.yaml#L3175)
>  This suggests that if the metadata location has changed, the metadata can be 
> considered updated.
> - tableExists(): Based on the name, this function seems to serve a different 
> purpose.
>
> Here is my suggestion:
>
> Since HTTP has built-in caching features 
> (https://developer.mozilla.org/en-US/docs/Web/HTTP/Caching), and REST 
> catalogs operate over HTTP, it seems natural to leverage HTTP caching 
> mechanisms. For example, HTTP includes the If-Modified-Since header and the 
> 304 Not Modified status code. Using this approach, we could achieve data 
> freshness with a single round-trip, fetching updated data only if there are 
> modifications.
>
> What do you think about defining the spec in this direction?
>
> Thank you.
>
>
>
>
> -----Original Message-----
> From: "Yufei Gu" <flyrain...@gmail.com>
> To: <dev@iceberg.apache.org>;
> Cc:
> Sent: 2024-11-13 (수) 03:43:24 (UTC+09:00)
> Subject: Re: [DISCUSS] REST: Way to query if metadata pointer is the latest
>
>
>
> Hi Gamber,
>
> Thanks for the proposal! Impala isn’t unique in needing this—I've seen 
> similar requirements from other engines.
>
> As others pointed out, using the “tableExists” endpoint seems like a 
> workaround. I don't consider it a permanent way forward. We could address 
> this by either modifying the current load table endpoint or introducing a new 
> one, but ideally, we should avoid adding endpoints for every specific need. 
> With that, partial metadata loading seems like a strong approach here, we 
> will need certain agreement though. I'd suggest the community consider the 
> use cases seriously. We need a way forward.
>
> I’m also not too concerned about using metadata file paths to verify the 
> latest table version; clients can simply extract metadata filenames, which 
> include the UUID.
>
> Yufei
>
>
>
>
> On Tue, Nov 12, 2024 at 7:46 AM Jean-Baptiste Onofré <j...@nanthrax.net> 
> wrote:
>
> Hi Fokko
>
> I like the idea, but I think it's more a workaround and could be
> confusing for users :)
>
> Regards
> JB
>
> On Tue, Nov 12, 2024 at 2:53 PM Fokko Driesprong <fo...@apache.org> wrote:
> >
> > Hey Gabor,
> >
> > Thanks for raising this. While reading this, my first thought is to 
> > leverage the `tableExists` operation:
> > https://github.com/apache/iceberg/blob/e3f39972863f891481ad9f5a559ffef093976bd7/open-api/rest-catalog-open-api.yaml#L1129-L1160
> >
> > This doesn't return anything today, but we could return a payload to the 
> > latest metadata.json.
> >
> > Looking forward to what others think.
> >
> > Kind regards,
> > Fokko
> >
> >
> >
> >
> > Op di 12 nov 2024 om 14:33 schreef Shani Elharrar 
> > <sh...@upsolver.com.invalid>:
> >>
> >> I recommend option (b), provided there is no partial metadata loading. We 
> >> implemented option (b) internally to facilitate partial metadata loading, 
> >> as we have tables with hundreds of thousands of snapshots. This results in 
> >> metadata that occupies approximately 500 MB in memory (excluding the 
> >> JsonNodes), which is a significant load for some of our services.
> >>
> >> Shani.
> >>
> >> On 12 Nov 2024, at 14:12, Gabor Kaszab <gaborkas...@apache.org> wrote:
> >>
> >> Hey Iceberg Community,
> >>
> >> Background:
> >> Impala is designed in a way to cache the Iceberg table metadata (BaseTable 
> >> objects in practice) for faster access. Currently, Impala is tightly 
> >> coupled with HMS and in turn with the HiveCatalog, and in order to keep 
> >> the cached table objects up-to-date there is a notification mechanism 
> >> driven by HMS to notify Impala about any changes in the table metadata.
> >> The Impala community is actively looking for ways to decouple HMS from 
> >> Impala and provide a way to use Impala without the need for HMS, and get 
> >> the Iceberg table metadata from other catalog Implementations mainly 
> >> focusing now on REST catalogs.
> >>
> >> Problem to solve:
> >> We identified a particular missing functionality in the current REST spec: 
> >> For engines that cache table metadata currently there is no way to check 
> >> if that table metadata is up-to-date or not, and whether the engine should 
> >> reload the metadata for that table or not without getting a whole table 
> >> object from the catalog. For this I think the REST catalog (but in fact I 
> >> think this could apply to any other catalogs) should be able to answer a 
> >> question like:
> >> "Hi Catalog, I have this version of this table, is it up-to-date?"
> >>
> >> Proposal:
> >> I've been following the discussion about partial metadata loading that 
> >> could be also used to answer the above question, but I have the impression 
> >> now that the conversation stopped making any progress.
> >> So instead of waiting for partial metadata loading I propose to have an 
> >> addition to the REST spec now to answer the question I raised above:
> >>
> >> a) boolean isLatest(TableIdentifier ident, String metadataLocation);
> >> b) String metadataLocation(TableIdentifier ident);
> >>
> >> Any of the above 2 approaches could help engines to decide if they have to 
> >> invalidate/reload particular table metadata in the cache. I personally 
> >> would go for option a) but would be open to hear other opinions.
> >>
> >> I'd like to know if the community could support me extending the REST spec 
> >> with any of the 2 options.
> >>
> >> Regards,
> >> Gabor
> >>
> >>
![](https://innowireless.dooray.com/mail-receipts?img=7547413759332b66-3113a64e77e5c3fb-36ab7c8bd44b0e02-36ab7e7273f7887e.gif)

Re: [DISCUSS] REST: Way to query if metadata pointer is the latest

Reply via email to