Hi Dan, The target is about exposing stats & metrics from the metadata (relaying partition stats, etc), and give the option for a REST Catalog implementation to extend with additional metrics/stats. The purpose of the REST Catalog interface is to expose that from the query planner, be able to use these stats/metrics to do better planning.
So, it's a raw idea for now. I would love to brainstorm with the community. Thanks ! Regards JB On Tue, Jan 21, 2025 at 6:42 PM Daniel Weeks <dwe...@apache.org> wrote: > > Hey JB, > > I'm not sure I fully understand what the proposal is, but I also realise it's > probably not completely fleshed out yet. > > When you say "manage metadata", the first concern that I have is whether you > mean to just query/get the info or to also modify it. Table metadata is > immutable and requires a commit to change, so I would assume you largely are > interested in just getting access to the data. Currently, snapshot summaries > are included with table load and I'm not clear on how we would expose > parquet/file stats since file level stats could be huge and largely depend on > the filters/projections to prune. I think partition stats is probably > something to consider, but I'm not sure how much faster that would be and the > size of partitions could really complicate the protocol. > > I think server-side pre/plan apis would be able to address a lot of these > types of situations, but I'm just concerned that we would end up rebuilding > that same functionality to address all of the issues with exposing this > information more directly. > > I'm interested if there are more concrete proposals, but I'm a little > hesitant because of these challenges. > > -Dan > > On Tue, Jan 21, 2025 at 6:40 AM Jean-Baptiste Onofré <j...@nanthrax.net> > wrote: >> >> Hi folks, >> >> I know we don't want to "expose" the whole metadata tables in the REST >> api, but I would like to discuss adding metadata stats and metrics >> management. >> We are discussing this as part of the Apache Polaris TMS proposal. >> >> The purpose is: >> 1. To add interfaces to manage metadata stats and metrics (partition >> stats, snapshot summaries, relay Parquet stats exposed via REST, ...) >> 2. The catalog implementation can deal with table properties, but can >> also extend to "extra" stats and metrics if needed >> 3. Query planners can use these metadata stats and metrics to perform >> better query plans. It could also be used by the server side planning >> to provide "pre-plan check" >> >> Before going to a proposal document, I would like to get first >> feedback from the community (if it makes sense or not). >> >> Thoughts ? >> >> Thanks ! >> Regards >> JB