Re: [DISCUSS] Partial Metadata Loading

Gabor Kaszab Tue, 29 Oct 2024 01:31:46 -0700

Hi Iceberg Community,

I just wanted to mention that I was also going to start a discussion about
getting partial information from LoadTableResponse through the REST API.
My motivation is a bit different here, though:
Impala currently has strong integration with HMS and in turn with the
HiveCatalog. Nowadays there are efforts put into the project to make it
work with REST catalog for Iceberg tables, and there is one piece that we
miss now with the REST API. Impala caches table metadata and we need a way
to decide whether we have to reload the metadata for a particular table or
not. Currently, with HMS we have a push-based solution where every change
of the table is pushed to Impala from HMS as notifications/events, and with
REST catalog we were thinking of a pull-based approach where Impala
occasionally asks the REST catalog whether a particular table is up-to-date
or not.


*Use-case*: So in Impala's case what would be important is to have a REST
Catalog API to answer a question like:
"I cached this version of this particular table, is it up-to-date or do I
have to reload it?"

*Possible solutions*:
1) This could either be achieved by an API like this:
    boolean isLatest(TableIdentifier ident, String metadataLocation);
2) Another approach could be to get the latest metadata location and let
the engine compare it to the one it holds:
    String metadataLocation(TableIdentifier ident);
3) Similarly to 2) querying metadata location could also be achieved by the
current proposal of partial metadata like: (I just made up some types here)
    Table loadTable(TableIdentifier ident,
SomeFilterClass.MetadataLocation);

Either way is fine for Impala I think, I just wanted to share our use-case
that could also leverage getting partial metadata.
Now that I have written this mail it seems to hijack the original
conversation a bit. Let me know if I should raise this in a separate
[discuss] thread.

Regards,
Gabor

On Tue, Oct 29, 2024 at 2:16 AM Haizhou Zhao <[email protected]>
wrote:

> Hello Dev list,
>
> I want to update the community on the current thread for the proposal
> "Partially Loading Metadata - LoadTable V2" after hearing more perspectives
> from the community. In general, there are still some distance to go for a
> general consensus which I hope to foster more conversations and hear new
> inputs.
>
> *Previous Discussions* (
> https://docs.google.com/document/d/1Nv7_9XqS8EyR30_mrrqkwbZx9pw34i3HYIwuDDXnOY4/edit?tab=t.0
> *)*
>
>
> *10/28/2024, quick google meet discussion*
>
> Thanks, Christian, Dmitri, Eric, JB, Szehon, Yufei for your time and
> voicing your opinion this morning. Here're a quick summary of what we
> discussed (detail meeting notes also included in the link above):
>
> Folks agreed that having a REST endpoint allowing clients to filter for
> what they need from LoadTableResult is a useful feature. The preliminary
> use cases that are brought up:
> 1. Load only current snapshot and current schema
> 2. Load only metadata file location
> 3. Load only credentials to access table
> 4. Query historical status of the table when time traveling
> Meanwhile, it is also important for this endpoint to be extensible enough
> so that it could take care of likewise use cases that only require a
> portion of LoadTableResult (metadata included) in the future.
>
> What the group has no strong preference or needs further inputs are:
> 1. Whether to modify the existing loadTable endpoint for partial loading
> or creating a new endpoint. The possible concern here is backward
> compatibility.
> 2. Whether to add bulk support to support cases like loading the current
> schema of all tables belonging to the same namespace.
>
>
> *10/23/2024, Iceberg community sync*
>
> Thanks, Ryan, Dan, Yufei, JB, Russel and Szehon for your inputs here.
>
> Folks are divided in two aspects:
> 1. Can we use table maintenance work to keep metadata size at check, thus
> preventing the necessity to slice metadata at all?
> 2. Is it the same use case to bulk load part of the information for many
> tables and to load part of the information for one table?
>
>
> *10/09/2024, Dev list*
>
> Thanks, Dan, Eduard for your inputs here.
>
> Folks are aligned here to extend the existing "refs" mode to other fields
> (i.e. metadata-log, snapshot-log, schemas), so that we can lazily load
> those fields if not needed.
>
>
> There are other parties from the community I had discussion on this topic
> with. I appreciate your input, and I failed to mention the discussion here
> because I forgot to keep a written record of the context for those
> discussions. In case you fall into this category, then I do apologize.
>
>
> *Summary of perspectives*
>
> The original proposal was aimed to tackle the growing metadata problem,
> and proposed a loadTable V2 endpoint. As the last thread mentioned, the
> conclusion at the time was that *extending the existing "refs" loading
> mode to more fields is preferable as it introduces less complexity and is
> more feasible to implement*.
>
> The later threads were where the community divided. On the one side, *there's
> a general scepticism on the concept of partial metadata* (i.e. union
> results from different requests has been a problem, even for "refs" lazy
> loading in the past); on the other side, *there's a push to generalize
> partial metadata concept to "LoadTableResult" as a whole* (e.g. to only
> return metadata file location, or only return table access creds based on
> client filter).
>
> Related is the concept of bulk API, where the community has raised this
> use case more than once, which are typically related to data warehouse
> management features, such as: 1) querying current schemas of all the tables
> belonging to a namespace; 2) querying certain table properties of many
> tables to see if any maintenance (downstream) jobs should be triggered; 3)
> querying ownership information of all tables to check security compliance
> of all the tables in data warehouse, etc.
>
> I want to lay everything down and foster more discussion for a good
> direction:
> 1. extend the current "refs" lazy loading mechanism to be a more generic
> solution
> 2. prevent partial metadata at all cost, and try to contain metadata size
> to always (or most of the time) load in full
> 3. generalize partial loading concept to the entire "LoadTableResult"
> (e.g. a generic loadTable V2 endpoint), so that users can use the same
> endpoint whether they want part of metadata, or other part of the
> "LoadTableResult" (e.g. metadata file location; table creds)
> 4. repurposing the last direction to make a bulk API for the REST spec,
> where loading pieces of information from many tables are permitted
> Or if there are other directions I failed to account for here.
>
> Looking forward to feedback/discussion from the community, thanks!
> Haizhou
>

Re: [DISCUSS] Partial Metadata Loading

Reply via email to