Hi Gamber,

Thanks for the proposal! Impala isn’t unique in needing this—I've seen
similar requirements from other engines.

As others pointed out, using the “tableExists” endpoint seems like a
workaround. I don't consider it a permanent way forward. We could address
this by either modifying the current load table endpoint or introducing a
new one, but ideally, we should avoid adding endpoints for every specific
need. With that, partial metadata loading seems like a strong approach
here, we will need certain agreement though. I'd suggest the community
consider the use cases seriously. We need a way forward.

I’m also not too concerned about using metadata file paths to verify the
latest table version; clients can simply extract metadata filenames, which
include the UUID.
Yufei


On Tue, Nov 12, 2024 at 7:46 AM Jean-Baptiste Onofré <j...@nanthrax.net>
wrote:

> Hi Fokko
>
> I like the idea, but I think it's more a workaround and could be
> confusing for users :)
>
> Regards
> JB
>
> On Tue, Nov 12, 2024 at 2:53 PM Fokko Driesprong <fo...@apache.org> wrote:
> >
> > Hey Gabor,
> >
> > Thanks for raising this. While reading this, my first thought is to
> leverage the `tableExists` operation:
> >
> https://github.com/apache/iceberg/blob/e3f39972863f891481ad9f5a559ffef093976bd7/open-api/rest-catalog-open-api.yaml#L1129-L1160
> >
> > This doesn't return anything today, but we could return a payload to the
> latest metadata.json.
> >
> > Looking forward to what others think.
> >
> > Kind regards,
> > Fokko
> >
> >
> >
> >
> > Op di 12 nov 2024 om 14:33 schreef Shani Elharrar
> <sh...@upsolver.com.invalid>:
> >>
> >> I recommend option (b), provided there is no partial metadata loading.
> We implemented option (b) internally to facilitate partial metadata
> loading, as we have tables with hundreds of thousands of snapshots. This
> results in metadata that occupies approximately 500 MB in memory (excluding
> the JsonNodes), which is a significant load for some of our services.
> >>
> >> Shani.
> >>
> >> On 12 Nov 2024, at 14:12, Gabor Kaszab <gaborkas...@apache.org> wrote:
> >>
> >> Hey Iceberg Community,
> >>
> >> Background:
> >> Impala is designed in a way to cache the Iceberg table metadata
> (BaseTable objects in practice) for faster access. Currently, Impala is
> tightly coupled with HMS and in turn with the HiveCatalog, and in order to
> keep the cached table objects up-to-date there is a notification mechanism
> driven by HMS to notify Impala about any changes in the table metadata.
> >> The Impala community is actively looking for ways to decouple HMS from
> Impala and provide a way to use Impala without the need for HMS, and get
> the Iceberg table metadata from other catalog Implementations mainly
> focusing now on REST catalogs.
> >>
> >> Problem to solve:
> >> We identified a particular missing functionality in the current REST
> spec: For engines that cache table metadata currently there is no way to
> check if that table metadata is up-to-date or not, and whether the engine
> should reload the metadata for that table or not without getting a whole
> table object from the catalog. For this I think the REST catalog (but in
> fact I think this could apply to any other catalogs) should be able to
> answer a question like:
> >> "Hi Catalog, I have this version of this table, is it up-to-date?"
> >>
> >> Proposal:
> >> I've been following the discussion about partial metadata loading that
> could be also used to answer the above question, but I have the impression
> now that the conversation stopped making any progress.
> >> So instead of waiting for partial metadata loading I propose to have an
> addition to the REST spec now to answer the question I raised above:
> >>
> >> a) boolean isLatest(TableIdentifier ident, String metadataLocation);
> >> b) String metadataLocation(TableIdentifier ident);
> >>
> >> Any of the above 2 approaches could help engines to decide if they have
> to invalidate/reload particular table metadata in the cache. I personally
> would go for option a) but would be open to hear other opinions.
> >>
> >> I'd like to know if the community could support me extending the REST
> spec with any of the 2 options.
> >>
> >> Regards,
> >> Gabor
> >>
> >>
>

Reply via email to