An older version of the implementation took this approach, but some months
ago Dennis left comments suggesting that we use an internal property and
the PR was updated. Replies to your specific concerns inline:

> Regarding the specific proposal of using a internal property of entities
for holding the whole metadata - I'm not sure it's the best approach.

Why do we need the best approach to the first implementation? If version 1
of this change goes in and improves loadtable perf by 40% and then you want
to do a followup that improves it another 5%, isn't that okay?

I don't believe we're designing ourselves into the corner by giving admins
the option to put metadata content in the properties, even if we later
switch to a different implementation.

> Not all access to tables requires metadata

Much of it does, maybe all of it? Is there a particular method you're
concerned about which we can try running through the benchmark?

> Metadata can be huge

Yes, that is why the size is limited.

> Only sub-sections of the metadata may be required in some cases
  (e.g. when resolving REPLACE / APPEND conflicts [1])

As noted in my email this is not in scope. The idea here is that we
can create some new TableMetadata entity and give it top-level columns for
the fields we care about. Again, I'd like to do this eventually and this
was done in earlier versions of the PR. However I would point out that we
don't really give other entities such top level fields right now, even
frequently used ones like a table's location, and always rely on
deserialization of potentially heavy properties.

But the proof will be in the pudding. With the changes, the APIs I tested
were much faster. If there's some API that we believe is frequently called,
which is not seeing benefits, and which we believe must see benefits for
there to be any value in storing the metadata in the metastore, let's test
it and then talk about things in concrete terms.

--EM

On Fri, May 23, 2025 at 4:52 PM Dmitri Bourlatchkov <di...@apache.org>
wrote:

> Thanks for starting this discussion, Eric!
>
> Overall, I think the idea of storing (some) table metadata in the Polaris
> database is very relevant and a sound approach to improving query
> performance on the engine side.
>
> Regarding the specific proposal of using a internal property of entities
> for holding the whole metadata - I'm not sure it's the best approach.
>
> I believe it might be preferable to store metadata as separate entries
> in Polaris Persistence and only reference them from table-like entities.
>
> Some points for consideration:
>
> * Not all access to tables requires metadata
> * Metadata can be huge
> * Only sub-sections of the metadata may be required in some cases
>   (e.g. when resolving REPLACE / APPEND conflicts [1])
>
> The striping of large metadata into smaller database entries may be
> best dealt with at the Persistence layer. The proposed NoSQL impl.
> can do that efficiently. As for the current JDBC impl. it might be
> reasonable to store a BLOB (at least initially).
>
> WDYT?
>
> Thanks,
> Dmitri.
>
> [1] https://github.com/apache/polaris/pull/1285
>
> On 2025/05/23 14:57:34 Eric Maynard wrote:
> > Hi all,
> >
> > Some time ago I opened this PR <
> https://github.com/apache/polaris/pull/433>
> > which proposes to store/cache TableMetadata in the Polaris metastore,
> > avoiding a trip to object storage in many cases. Based on this recent
> > comment <
> https://github.com/apache/polaris/pull/433#issuecomment-2904298967> I
> > wanted to start up a mailing list thread for discussion about this
> feature
> > as it might be a little hard to follow comment threads on what is now a
> > very old PR.
> >
> > The proposal is, in a nutshell, to add a new internal property
> > metadata-cache-content to IcebergTableLikeEntity's internal properties
> and
> > to use that to store the exact contents of a table's metadata.json. The
> > content can be updated whenever the metadata.json is read and can be
> > configured to happen only for metadata.json files below some approximate
> > size.
> >
> > I recently used the benchmark suite proposed in this PR
> > <https://github.com/apache/polaris-tools/pull/21> to measure the impact
> of
> > the change and found it to dramatically improve loadTable performance.
> >
> > Some things that have been brought up which are *not* in scope for this
> PR:
> > 1. Directly loading the metadata.json content into a LoadTableResponse
> > without building an in-memory TableMetadata object was previously in the
> PR
> > but removed after this comment
> > <https://github.com/apache/polaris/pull/433#issuecomment-2885074219>
> from
> > Russell; it's planned as a followup.
> > 2. Storing individual parts of table metadata.json in persistence, i.e.
> > just the schema. We can do this if a use case arises, but being able to
> > store whole table metadata is beneficial immediately.
> > 3. A separate entity for table metadata. Because we add the table
> metadata
> > to IcebergTableLikeEntity we immediately benefit from the entity cache
> and
> > don't have to worry too much about consistency.
> > 4. A separate cache for table metadata. Similar to the above, this would
> > make handling consistency more complicated. Having a separate cache,
> maybe
> > with its own size or TTL configurations, just for table metadata could
> be a
> > good followup but it's not necessary to make things work.
> >
> > This is a feature that has the potential to deliver tremendous latency
> > benefits and one that opens up several interesting possibilities for
> > followup improvements.
> >
> > If you're interested in the feature, please check out the PR or join the
> > discussion here. Thanks!
> >
> > --EM
> >
>

Reply via email to