Hey JB,

Thanks for the additional context. My main question is, why wouldn't the
TMS directly query the metadata? Since the TMS should have access to the
data (otherwise it cannot compact it). This would be much faster and more
efficient. I share Daniel's concern that these requests could easily run
into the gigabytes (assuming JSON?).

Kind regards,
Fokko



Op wo 19 feb 2025 om 12:12 schreef Jean-Baptiste Onofré <j...@nanthrax.net>:

> Hi folks,
>
> I realized that my first email on this thread needs context to be
> better understood :)
>
> In Apache Polaris TMS (Table Maintenance Service), we "scoped" where
> Polaris can help to trigger table maintenance jobs:
> 1. Is table maintenance enabled (in Polaris)?
> 2. Policies exposed by Polaris (e.g. data retention policy, compaction
> policy, ...)
> 3. Polaris events (e.g. tables/views/namespaces updates)
> 4. Table metadata (via Iceberg REST)
>     4.1. Table schema/partition spec/properties, etc
>     4.2. Iceberg table Stats and metrics. Only the stats and metrics
> are defined in the Iceberg table spec, e.g., partition stats, snapshot
> summaries are available at this moment.
>
> Specifically about 4.2, the Table Maintenance Service would need more than
> that.
>
> My proposal about adding metrics endpoint to the REST spec is to
> expose extra metrics for TMS and engine. I'm thinking of:
> - metrics helping the compaction decisions and snapshots GC
> - "extra" metrics which are very helpful for TMS (e.g. file size
> distribution without partitions)
>
> I would like to propose a "two steps" approach:
> 1. Add a "wild" metrics endpoint gathering all metrics for TMS/engines
> but the exposed metrics are decided by the Catalog impl
> 2. Enforce metrics list in the spec with a clear schema and
> standardized metrics names.
>
> I will move forward with a proposal draft about that if there is no
> objection.
>
> Thoughts ?
>
> Regards
> JB
>
> On Tue, Jan 21, 2025 at 3:40 PM Jean-Baptiste Onofré <j...@nanthrax.net>
> wrote:
> >
> > Hi folks,
> >
> > I know we don't want to "expose" the whole metadata tables in the REST
> > api, but I would like to discuss adding metadata stats and metrics
> > management.
> > We are discussing this as part of the Apache Polaris TMS proposal.
> >
> > The purpose is:
> > 1. To add interfaces to manage metadata stats and metrics (partition
> > stats, snapshot summaries, relay Parquet stats exposed via REST, ...)
> > 2. The catalog implementation can deal with table properties, but can
> > also extend to "extra" stats and metrics if needed
> > 3. Query planners can use these metadata stats and metrics to perform
> > better query plans. It could also be used by the server side planning
> > to provide "pre-plan check"
> >
> > Before going to a proposal document, I would like to get first
> > feedback from the community (if it makes sense or not).
> >
> > Thoughts ?
> >
> > Thanks !
> > Regards
> > JB
>

Reply via email to