Hi Dan,

The target is about exposing stats & metrics from the metadata
(relaying partition stats, etc), and give the option for a REST
Catalog implementation to extend with additional metrics/stats.
The purpose of the REST Catalog interface is to expose that from the
query planner, be able to use these stats/metrics to do better
planning.

So, it's a raw idea for now. I would love to brainstorm with the community.

Thanks !
Regards
JB

On Tue, Jan 21, 2025 at 6:42 PM Daniel Weeks <dwe...@apache.org> wrote:
>
> Hey JB,
>
> I'm not sure I fully understand what the proposal is, but I also realise it's 
> probably not completely fleshed out yet.
>
> When you say "manage metadata", the first concern that I have is whether you 
> mean to just query/get the info or to also modify it.  Table metadata is 
> immutable and requires a commit to change, so I would assume you largely are 
> interested in just getting access to the data.  Currently, snapshot summaries 
> are included with table load and I'm not clear on how we would expose 
> parquet/file stats since file level stats could be huge and largely depend on 
> the filters/projections to prune.  I think partition stats is probably 
> something to consider, but I'm not sure how much faster that would be and the 
> size of partitions could really complicate the protocol.
>
> I think server-side pre/plan apis would be able to address a lot of these 
> types of situations, but I'm just concerned that we would end up rebuilding 
> that same functionality to address all of the issues with exposing this 
> information more directly.
>
> I'm interested if there are more concrete proposals, but I'm a little 
> hesitant because of these challenges.
>
> -Dan
>
> On Tue, Jan 21, 2025 at 6:40 AM Jean-Baptiste Onofré <j...@nanthrax.net> 
> wrote:
>>
>> Hi folks,
>>
>> I know we don't want to "expose" the whole metadata tables in the REST
>> api, but I would like to discuss adding metadata stats and metrics
>> management.
>> We are discussing this as part of the Apache Polaris TMS proposal.
>>
>> The purpose is:
>> 1. To add interfaces to manage metadata stats and metrics (partition
>> stats, snapshot summaries, relay Parquet stats exposed via REST, ...)
>> 2. The catalog implementation can deal with table properties, but can
>> also extend to "extra" stats and metrics if needed
>> 3. Query planners can use these metadata stats and metrics to perform
>> better query plans. It could also be used by the server side planning
>> to provide "pre-plan check"
>>
>> Before going to a proposal document, I would like to get first
>> feedback from the community (if it makes sense or not).
>>
>> Thoughts ?
>>
>> Thanks !
>> Regards
>> JB

Reply via email to