Re: [PROPOSAL] Partially Loading Metadata - LoadTable V2

Haizhou Zhao Thu, 10 Oct 2024 17:50:39 -0700

Thanks Eduard and Dan,

At this stage, my main goal is to check around the community whether this
problem is worth solving. If I can get sufficient feedback, or better, even
consensus from the community, then that lays down a good foundation to
further progress this thread. Implementation details are important, but, at
this stage, less important than knowing this is the right direction. I look
forward to hash out implementation details with folks from the community
should there be enough support on solving this problem.

That being said, if I have to throw out my two cents on the implementation
details now, then here it is:

@Eduard, I think the fundamental difference between us is that yours is
"LazilyLoading", while mine is "PartiallyLoading". I reasoned about it,
"LazilyLoading" actually introduces less intrusive changes to fundamental
contracts like "TableOperations" and "Table", yet still achieves what
"PartiallyLoading" aims to do - i.e., on the surface, the full metadata is
still there, but any field on the metadata could be lazily loaded, which
means it physically is not there until it is needed, which somehow is
partially loading a metadata. Em, so yeah, that makes a lot of sense. If I
misunderstood your implementation, let me know. I just downloaded your code
patch and started to play around.

@Dan, expanding the "refs" concept to more fields sounds great, but I worry
eventually we need to make changes at some level to the current "refs"
implementation. Because, ideally, we aim for a reusable/generic framework
to all these kind of list/maps fields on metadata - we know that
"snapshots", "metadata-log", "snapshot-log", "schemas" are growth factors
in the current version of the spec, but we might have more of such fields
in the future versions of the spec (er, hard to predict, probably
"lineage"?). And I think there will be changes when we convert a solution
only applicable to "snapshots" field into a generic solution, which could
be a spec change, could be a client change (might take time to finalize the
details there though).

@Eduard, @Dan feel free to comment. Welcome thoughts from the rest of the
community as well.

-Haizhou

On Thu, Oct 10, 2024 at 9:23 AM Daniel Weeks <dwe...@apache.org> wrote:

> Hey Haizhou,
>
> I think you've done a great job of capturing some of the metadata size
> related issues in the doc, but I would echo Eduard's comments that we
> should explore using the existing refs only loading first.  This may
> require adding similar functionality for schemas/logs if we think that is a
> major issue (we have run into cases where that is an issue, but there's
> also maintenance work going on to help address some of these issues).
>
> The current refs only approach does fall back to a full metadata load when
> committing, but that was largely due to the complexity of changing the
> TableMetadata implementation, not necessarily a limitation of the REST spec.
>
> Definitely something we should be exploring, but we might already have
> some approaches that we can build upon.
>
> -Dan
>
> On Thu, Oct 10, 2024 at 6:37 AM Eduard Tudenhöfner <
> etudenhoef...@apache.org> wrote:
>
>> Hey Haizhou,
>>
>> thanks for working on that proposal. I think my main concern with the
>> current proposal is that it adds quite a lot of complexity at a bunch of
>> places, since you'd need to partially update *TableMetadata*.
>> Additionally, it requires a new endpoint.
>>
>> An alternative to that would be to do something similar to what we
>> already have in *TableMetadata*, where we lazily load *snapshots* when
>> needed. We could expand that approach to lazily load the full
>> *TableMetadata* from the server when necessary and always only show a
>> slim version of *TableMetadata*. I did such a POC a while ago, which can
>> be seen in
>> https://github.com/nastra/iceberg/commit/ae2c7768c6f37be2f86b575bfc4fe84429b22a0e.
>> That POC would need to be expanded so that it doesn't only do this for
>> snapshots, but also for other fields.
>> I believe the main fields that can get quite large over time are *snapshots
>> / metadata-log / snapshot-log / schemas*.
>>
>> Might be worth checking how much we could gain by using a lazy table
>> metadata supplier in this scenario, as that would reduce the required
>> complexity.
>>
>> Thanks,
>> Eduard
>>
>>
>>
>> On Thu, Oct 10, 2024 at 2:05 AM Haizhou Zhao <zhaohaizhou940...@gmail.com>
>> wrote:
>>
>>> Hello Dev List,
>>>
>>>
>>> I want to bring this proposal to discussion:
>>>
>>>
>>>
>>> https://docs.google.com/document/d/1eXnT0ZiFvdm_Zvk6fLGT_UxVWO-HsiqVywqu1Uk8s7E/edit#heading=h.uad1lm906wz4
>>>
>>>
>>>
>>> It proposes a new LoadTable API (branded LoadTableV2 at the moment) on
>>> REST spec that allows partially loading table metadata. The motivation is
>>> to stabilize and optimize Spark write workloads, especially on Iceberg
>>> tables with big metadata (e.g. due to huge list of snapshot/metadata log,
>>> complicated schema, etc.). We want to leverage this proposal to reduce
>>> operational and monetary cost of Iceberg & REST catalog usages, and achieve
>>> higher commit frequencies (DDL & DML included) on top of Iceberg tables
>>> through REST catalog.
>>>
>>>
>>>
>>> Looking forward to hearing feedback and discussions.
>>>
>>>
>>> Thank you,
>>>
>>> Haizhou
>>>
>>

Re: [PROPOSAL] Partially Loading Metadata - LoadTable V2

Reply via email to