[DISCUSS] Partial Metadata Loading

Haizhou Zhao Mon, 28 Oct 2024 18:16:03 -0700

Hello Dev list,

I want to update the community on the current thread for the proposal
"Partially Loading Metadata - LoadTable V2" after hearing more perspectives
from the community. In general, there are still some distance to go for a
general consensus which I hope to foster more conversations and hear new
inputs.

*Previous Discussions* (
https://docs.google.com/document/d/1Nv7_9XqS8EyR30_mrrqkwbZx9pw34i3HYIwuDDXnOY4/edit?tab=t.0
*)*

*10/28/2024, quick google meet discussion*

Thanks, Christian, Dmitri, Eric, JB, Szehon, Yufei for your time and
voicing your opinion this morning. Here're a quick summary of what we
discussed (detail meeting notes also included in the link above):

Folks agreed that having a REST endpoint allowing clients to filter for
what they need from LoadTableResult is a useful feature. The preliminary
use cases that are brought up:
1. Load only current snapshot and current schema
2. Load only metadata file location
3. Load only credentials to access table
4. Query historical status of the table when time traveling
Meanwhile, it is also important for this endpoint to be extensible enough
so that it could take care of likewise use cases that only require a
portion of LoadTableResult (metadata included) in the future.

What the group has no strong preference or needs further inputs are:
1. Whether to modify the existing loadTable endpoint for partial loading or
creating a new endpoint. The possible concern here is backward
compatibility.
2. Whether to add bulk support to support cases like loading the current
schema of all tables belonging to the same namespace.

*10/23/2024, Iceberg community sync*

Thanks, Ryan, Dan, Yufei, JB, Russel and Szehon for your inputs here.

Folks are divided in two aspects:
1. Can we use table maintenance work to keep metadata size at check, thus
preventing the necessity to slice metadata at all?
2. Is it the same use case to bulk load part of the information for many
tables and to load part of the information for one table?

*10/09/2024, Dev list*

Thanks, Dan, Eduard for your inputs here.

Folks are aligned here to extend the existing "refs" mode to other fields
(i.e. metadata-log, snapshot-log, schemas), so that we can lazily load
those fields if not needed.

There are other parties from the community I had discussion on this topic
with. I appreciate your input, and I failed to mention the discussion here
because I forgot to keep a written record of the context for those
discussions. In case you fall into this category, then I do apologize.

*Summary of perspectives*

The original proposal was aimed to tackle the growing metadata problem, and
proposed a loadTable V2 endpoint. As the last thread mentioned, the
conclusion at the time was that *extending the existing "refs" loading mode
to more fields is preferable as it introduces less complexity and is more
feasible to implement*.

The later threads were where the community divided. On the one side, *there's
a general scepticism on the concept of partial metadata* (i.e. union
results from different requests has been a problem, even for "refs" lazy
loading in the past); on the other side, *there's a push to generalize
partial metadata concept to "LoadTableResult" as a whole* (e.g. to only
return metadata file location, or only return table access creds based on
client filter).

Related is the concept of bulk API, where the community has raised this use
case more than once, which are typically related to data warehouse
management features, such as: 1) querying current schemas of all the tables
belonging to a namespace; 2) querying certain table properties of many
tables to see if any maintenance (downstream) jobs should be triggered; 3)
querying ownership information of all tables to check security compliance
of all the tables in data warehouse, etc.

I want to lay everything down and foster more discussion for a good
direction:
1. extend the current "refs" lazy loading mechanism to be a more generic
solution
2. prevent partial metadata at all cost, and try to contain metadata size
to always (or most of the time) load in full
3. generalize partial loading concept to the entire "LoadTableResult" (e.g.
a generic loadTable V2 endpoint), so that users can use the same endpoint
whether they want part of metadata, or other part of the "LoadTableResult"
(e.g. metadata file location; table creds)
4. repurposing the last direction to make a bulk API for the REST spec,
where loading pieces of information from many tables are permitted
Or if there are other directions I failed to account for here.

Looking forward to feedback/discussion from the community, thanks!
Haizhou

[DISCUSS] Partial Metadata Loading

Reply via email to