Re: [DISCUSS] Add metadata stats/metrics management on the REST Spec

Jean-Baptiste Onofré Wed, 19 Feb 2025 06:22:51 -0800

Hi Fokko

That's an approach I considered but the problem is that the TMS/query
engine goes via the REST. So, if the metadata.json exposed by REST
doesn't contain the metrics, how can I get it ?
If your proposal is to extend the metadata.json with extra metrics,
that could be an option.
My proposal is more to have an extra endpoint to get metrics
"unrelated" to a table or extending the metadata.json, with also a way
to retrieve only the metrics needed by the TMS.


Regards
JB

On Wed, Feb 19, 2025 at 2:20 PM Fokko Driesprong <fo...@apache.org> wrote:
>
> Hey JB,
>
> Thanks for the additional context. My main question is, why wouldn't the TMS 
> directly query the metadata? Since the TMS should have access to the data 
> (otherwise it cannot compact it). This would be much faster and more 
> efficient. I share Daniel's concern that these requests could easily run into 
> the gigabytes (assuming JSON?).
>
> Kind regards,
> Fokko
>
>
>
> Op wo 19 feb 2025 om 12:12 schreef Jean-Baptiste Onofré <j...@nanthrax.net>:
>>
>> Hi folks,
>>
>> I realized that my first email on this thread needs context to be
>> better understood :)
>>
>> In Apache Polaris TMS (Table Maintenance Service), we "scoped" where
>> Polaris can help to trigger table maintenance jobs:
>> 1. Is table maintenance enabled (in Polaris)?
>> 2. Policies exposed by Polaris (e.g. data retention policy, compaction
>> policy, ...)
>> 3. Polaris events (e.g. tables/views/namespaces updates)
>> 4. Table metadata (via Iceberg REST)
>>     4.1. Table schema/partition spec/properties, etc
>>     4.2. Iceberg table Stats and metrics. Only the stats and metrics
>> are defined in the Iceberg table spec, e.g., partition stats, snapshot
>> summaries are available at this moment.
>>
>> Specifically about 4.2, the Table Maintenance Service would need more than 
>> that.
>>
>> My proposal about adding metrics endpoint to the REST spec is to
>> expose extra metrics for TMS and engine. I'm thinking of:
>> - metrics helping the compaction decisions and snapshots GC
>> - "extra" metrics which are very helpful for TMS (e.g. file size
>> distribution without partitions)
>>
>> I would like to propose a "two steps" approach:
>> 1. Add a "wild" metrics endpoint gathering all metrics for TMS/engines
>> but the exposed metrics are decided by the Catalog impl
>> 2. Enforce metrics list in the spec with a clear schema and
>> standardized metrics names.
>>
>> I will move forward with a proposal draft about that if there is no 
>> objection.
>>
>> Thoughts ?
>>
>> Regards
>> JB
>>
>> On Tue, Jan 21, 2025 at 3:40 PM Jean-Baptiste Onofré <j...@nanthrax.net> 
>> wrote:
>> >
>> > Hi folks,
>> >
>> > I know we don't want to "expose" the whole metadata tables in the REST
>> > api, but I would like to discuss adding metadata stats and metrics
>> > management.
>> > We are discussing this as part of the Apache Polaris TMS proposal.
>> >
>> > The purpose is:
>> > 1. To add interfaces to manage metadata stats and metrics (partition
>> > stats, snapshot summaries, relay Parquet stats exposed via REST, ...)
>> > 2. The catalog implementation can deal with table properties, but can
>> > also extend to "extra" stats and metrics if needed
>> > 3. Query planners can use these metadata stats and metrics to perform
>> > better query plans. It could also be used by the server side planning
>> > to provide "pre-plan check"
>> >
>> > Before going to a proposal document, I would like to get first
>> > feedback from the community (if it makes sense or not).
>> >
>> > Thoughts ?
>> >
>> > Thanks !
>> > Regards
>> > JB

Re: [DISCUSS] Add metadata stats/metrics management on the REST Spec

Reply via email to