Hey Haizhou,

I think you've done a great job of capturing some of the metadata size
related issues in the doc, but I would echo Eduard's comments that we
should explore using the existing refs only loading first.  This may
require adding similar functionality for schemas/logs if we think that is a
major issue (we have run into cases where that is an issue, but there's
also maintenance work going on to help address some of these issues).

The current refs only approach does fall back to a full metadata load when
committing, but that was largely due to the complexity of changing the
TableMetadata implementation, not necessarily a limitation of the REST spec.

Definitely something we should be exploring, but we might already have some
approaches that we can build upon.

-Dan

On Thu, Oct 10, 2024 at 6:37 AM Eduard Tudenhöfner <etudenhoef...@apache.org>
wrote:

> Hey Haizhou,
>
> thanks for working on that proposal. I think my main concern with the
> current proposal is that it adds quite a lot of complexity at a bunch of
> places, since you'd need to partially update *TableMetadata*.
> Additionally, it requires a new endpoint.
>
> An alternative to that would be to do something similar to what we already
> have in *TableMetadata*, where we lazily load *snapshots* when needed. We
> could expand that approach to lazily load the full *TableMetadata* from
> the server when necessary and always only show a slim version of
> *TableMetadata*. I did such a POC a while ago, which can be seen in
> https://github.com/nastra/iceberg/commit/ae2c7768c6f37be2f86b575bfc4fe84429b22a0e.
> That POC would need to be expanded so that it doesn't only do this for
> snapshots, but also for other fields.
> I believe the main fields that can get quite large over time are *snapshots
> / metadata-log / snapshot-log / schemas*.
>
> Might be worth checking how much we could gain by using a lazy table
> metadata supplier in this scenario, as that would reduce the required
> complexity.
>
> Thanks,
> Eduard
>
>
>
> On Thu, Oct 10, 2024 at 2:05 AM Haizhou Zhao <zhaohaizhou940...@gmail.com>
> wrote:
>
>> Hello Dev List,
>>
>>
>> I want to bring this proposal to discussion:
>>
>>
>>
>> https://docs.google.com/document/d/1eXnT0ZiFvdm_Zvk6fLGT_UxVWO-HsiqVywqu1Uk8s7E/edit#heading=h.uad1lm906wz4
>>
>>
>>
>> It proposes a new LoadTable API (branded LoadTableV2 at the moment) on
>> REST spec that allows partially loading table metadata. The motivation is
>> to stabilize and optimize Spark write workloads, especially on Iceberg
>> tables with big metadata (e.g. due to huge list of snapshot/metadata log,
>> complicated schema, etc.). We want to leverage this proposal to reduce
>> operational and monetary cost of Iceberg & REST catalog usages, and achieve
>> higher commit frequencies (DDL & DML included) on top of Iceberg tables
>> through REST catalog.
>>
>>
>>
>> Looking forward to hearing feedback and discussions.
>>
>>
>> Thank you,
>>
>> Haizhou
>>
>

Reply via email to