[DISCUSS] Storing Table Metadata in the Metastore

Eric Maynard Fri, 23 May 2025 07:57:56 -0700

Hi all,

Some time ago I opened this PR <https://github.com/apache/polaris/pull/433>
which proposes to store/cache TableMetadata in the Polaris metastore,
avoiding a trip to object storage in many cases. Based on this recent
comment <https://github.com/apache/polaris/pull/433#issuecomment-2904298967> I
wanted to start up a mailing list thread for discussion about this feature
as it might be a little hard to follow comment threads on what is now a
very old PR.


The proposal is, in a nutshell, to add a new internal property
metadata-cache-content to IcebergTableLikeEntity's internal properties and
to use that to store the exact contents of a table's metadata.json. The
content can be updated whenever the metadata.json is read and can be
configured to happen only for metadata.json files below some approximate
size.

I recently used the benchmark suite proposed in this PR
<https://github.com/apache/polaris-tools/pull/21> to measure the impact of
the change and found it to dramatically improve loadTable performance.

Some things that have been brought up which are *not* in scope for this PR:
1. Directly loading the metadata.json content into a LoadTableResponse
without building an in-memory TableMetadata object was previously in the PR
but removed after this comment
<https://github.com/apache/polaris/pull/433#issuecomment-2885074219> from
Russell; it's planned as a followup.
2. Storing individual parts of table metadata.json in persistence, i.e.
just the schema. We can do this if a use case arises, but being able to
store whole table metadata is beneficial immediately.
3. A separate entity for table metadata. Because we add the table metadata
to IcebergTableLikeEntity we immediately benefit from the entity cache and
don't have to worry too much about consistency.
4. A separate cache for table metadata. Similar to the above, this would
make handling consistency more complicated. Having a separate cache, maybe
with its own size or TTL configurations, just for table metadata could be a
good followup but it's not necessary to make things work.

This is a feature that has the potential to deliver tremendous latency
benefits and one that opens up several interesting possibilities for
followup improvements.

If you're interested in the feature, please check out the PR or join the
discussion here. Thanks!

--EM

[DISCUSS] Storing Table Metadata in the Metastore

Reply via email to