Hi Fokko That's an approach I considered but the problem is that the TMS/query engine goes via the REST. So, if the metadata.json exposed by REST doesn't contain the metrics, how can I get it ? If your proposal is to extend the metadata.json with extra metrics, that could be an option. My proposal is more to have an extra endpoint to get metrics "unrelated" to a table or extending the metadata.json, with also a way to retrieve only the metrics needed by the TMS.
Regards JB On Wed, Feb 19, 2025 at 2:20 PM Fokko Driesprong <fo...@apache.org> wrote: > > Hey JB, > > Thanks for the additional context. My main question is, why wouldn't the TMS > directly query the metadata? Since the TMS should have access to the data > (otherwise it cannot compact it). This would be much faster and more > efficient. I share Daniel's concern that these requests could easily run into > the gigabytes (assuming JSON?). > > Kind regards, > Fokko > > > > Op wo 19 feb 2025 om 12:12 schreef Jean-Baptiste Onofré <j...@nanthrax.net>: >> >> Hi folks, >> >> I realized that my first email on this thread needs context to be >> better understood :) >> >> In Apache Polaris TMS (Table Maintenance Service), we "scoped" where >> Polaris can help to trigger table maintenance jobs: >> 1. Is table maintenance enabled (in Polaris)? >> 2. Policies exposed by Polaris (e.g. data retention policy, compaction >> policy, ...) >> 3. Polaris events (e.g. tables/views/namespaces updates) >> 4. Table metadata (via Iceberg REST) >> 4.1. Table schema/partition spec/properties, etc >> 4.2. Iceberg table Stats and metrics. Only the stats and metrics >> are defined in the Iceberg table spec, e.g., partition stats, snapshot >> summaries are available at this moment. >> >> Specifically about 4.2, the Table Maintenance Service would need more than >> that. >> >> My proposal about adding metrics endpoint to the REST spec is to >> expose extra metrics for TMS and engine. I'm thinking of: >> - metrics helping the compaction decisions and snapshots GC >> - "extra" metrics which are very helpful for TMS (e.g. file size >> distribution without partitions) >> >> I would like to propose a "two steps" approach: >> 1. Add a "wild" metrics endpoint gathering all metrics for TMS/engines >> but the exposed metrics are decided by the Catalog impl >> 2. Enforce metrics list in the spec with a clear schema and >> standardized metrics names. >> >> I will move forward with a proposal draft about that if there is no >> objection. >> >> Thoughts ? >> >> Regards >> JB >> >> On Tue, Jan 21, 2025 at 3:40 PM Jean-Baptiste Onofré <j...@nanthrax.net> >> wrote: >> > >> > Hi folks, >> > >> > I know we don't want to "expose" the whole metadata tables in the REST >> > api, but I would like to discuss adding metadata stats and metrics >> > management. >> > We are discussing this as part of the Apache Polaris TMS proposal. >> > >> > The purpose is: >> > 1. To add interfaces to manage metadata stats and metrics (partition >> > stats, snapshot summaries, relay Parquet stats exposed via REST, ...) >> > 2. The catalog implementation can deal with table properties, but can >> > also extend to "extra" stats and metrics if needed >> > 3. Query planners can use these metadata stats and metrics to perform >> > better query plans. It could also be used by the server side planning >> > to provide "pre-plan check" >> > >> > Before going to a proposal document, I would like to get first >> > feedback from the community (if it makes sense or not). >> > >> > Thoughts ? >> > >> > Thanks ! >> > Regards >> > JB