@Fokko: your point is absolutely valid. We don't want to burden the active catalog instance with returning such a big data set. Otherwise the main responsibility of the catalog could suffer.
OTOH there is some info which exists only on the catalog side which is not available elsewhere. This is especially true for catalogs which are doing query planning. For example, I would love to see the query statistics for a table, and how often specific files are accessed/returned in a plan. This would help compaction scheduling/planning highlight the hot spots where applying compaction could really make a difference. Thanks, Peter Fokko Driesprong <fo...@apache.org> ezt írta (időpont: 2025. febr. 19., Sze, 14:20): > Hey JB, > > Thanks for the additional context. My main question is, why wouldn't the > TMS directly query the metadata? Since the TMS should have access to the > data (otherwise it cannot compact it). This would be much faster and more > efficient. I share Daniel's concern that these requests could easily run > into the gigabytes (assuming JSON?). > > Kind regards, > Fokko > > > > Op wo 19 feb 2025 om 12:12 schreef Jean-Baptiste Onofré <j...@nanthrax.net>: > >> Hi folks, >> >> I realized that my first email on this thread needs context to be >> better understood :) >> >> In Apache Polaris TMS (Table Maintenance Service), we "scoped" where >> Polaris can help to trigger table maintenance jobs: >> 1. Is table maintenance enabled (in Polaris)? >> 2. Policies exposed by Polaris (e.g. data retention policy, compaction >> policy, ...) >> 3. Polaris events (e.g. tables/views/namespaces updates) >> 4. Table metadata (via Iceberg REST) >> 4.1. Table schema/partition spec/properties, etc >> 4.2. Iceberg table Stats and metrics. Only the stats and metrics >> are defined in the Iceberg table spec, e.g., partition stats, snapshot >> summaries are available at this moment. >> >> Specifically about 4.2, the Table Maintenance Service would need more >> than that. >> >> My proposal about adding metrics endpoint to the REST spec is to >> expose extra metrics for TMS and engine. I'm thinking of: >> - metrics helping the compaction decisions and snapshots GC >> - "extra" metrics which are very helpful for TMS (e.g. file size >> distribution without partitions) >> >> I would like to propose a "two steps" approach: >> 1. Add a "wild" metrics endpoint gathering all metrics for TMS/engines >> but the exposed metrics are decided by the Catalog impl >> 2. Enforce metrics list in the spec with a clear schema and >> standardized metrics names. >> >> I will move forward with a proposal draft about that if there is no >> objection. >> >> Thoughts ? >> >> Regards >> JB >> >> On Tue, Jan 21, 2025 at 3:40 PM Jean-Baptiste Onofré <j...@nanthrax.net> >> wrote: >> > >> > Hi folks, >> > >> > I know we don't want to "expose" the whole metadata tables in the REST >> > api, but I would like to discuss adding metadata stats and metrics >> > management. >> > We are discussing this as part of the Apache Polaris TMS proposal. >> > >> > The purpose is: >> > 1. To add interfaces to manage metadata stats and metrics (partition >> > stats, snapshot summaries, relay Parquet stats exposed via REST, ...) >> > 2. The catalog implementation can deal with table properties, but can >> > also extend to "extra" stats and metrics if needed >> > 3. Query planners can use these metadata stats and metrics to perform >> > better query plans. It could also be used by the server side planning >> > to provide "pre-plan check" >> > >> > Before going to a proposal document, I would like to get first >> > feedback from the community (if it makes sense or not). >> > >> > Thoughts ? >> > >> > Thanks ! >> > Regards >> > JB >> >