Hey Haizhou,

thanks for working on that proposal. I think my main concern with the
current proposal is that it adds quite a lot of complexity at a bunch of
places, since you'd need to partially update *TableMetadata*. Additionally,
it requires a new endpoint.

An alternative to that would be to do something similar to what we already
have in *TableMetadata*, where we lazily load *snapshots* when needed. We
could expand that approach to lazily load the full *TableMetadata* from the
server when necessary and always only show a slim version of *TableMetadata*.
I did such a POC a while ago, which can be seen in
https://github.com/nastra/iceberg/commit/ae2c7768c6f37be2f86b575bfc4fe84429b22a0e.
That POC would need to be expanded so that it doesn't only do this for
snapshots, but also for other fields.
I believe the main fields that can get quite large over time are *snapshots
/ metadata-log / snapshot-log / schemas*.

Might be worth checking how much we could gain by using a lazy table
metadata supplier in this scenario, as that would reduce the required
complexity.

Thanks,
Eduard



On Thu, Oct 10, 2024 at 2:05 AM Haizhou Zhao <zhaohaizhou940...@gmail.com>
wrote:

> Hello Dev List,
>
>
> I want to bring this proposal to discussion:
>
>
>
> https://docs.google.com/document/d/1eXnT0ZiFvdm_Zvk6fLGT_UxVWO-HsiqVywqu1Uk8s7E/edit#heading=h.uad1lm906wz4
>
>
>
> It proposes a new LoadTable API (branded LoadTableV2 at the moment) on
> REST spec that allows partially loading table metadata. The motivation is
> to stabilize and optimize Spark write workloads, especially on Iceberg
> tables with big metadata (e.g. due to huge list of snapshot/metadata log,
> complicated schema, etc.). We want to leverage this proposal to reduce
> operational and monetary cost of Iceberg & REST catalog usages, and achieve
> higher commit frequencies (DDL & DML included) on top of Iceberg tables
> through REST catalog.
>
>
>
> Looking forward to hearing feedback and discussions.
>
>
> Thank you,
>
> Haizhou
>

Reply via email to