Thanks everyone for continuing to drive this forward. I agree that the
problem is getting complex enough that a more structured discussion would
help.

+1 on setting up a biweekly sync for the metrics architecture. I’m happy to
join.

Yufei


On Tue, Apr 21, 2026 at 2:34 PM EJ Wang <[email protected]>
wrote:

> Also, I've been looking more closely at the *persistence schema in the
> current metrics work*, and I think there's a structural rigidity problem
> worth raising before the shape gets locked in.
>
> Right now we have two separate tables (scan_metrics_report and
> commit_metrics_report), each with ~25 flattened columns that directly
> mirror the Iceberg report fields. The SPI follows the same split:
> writeScanReport and writeCommitReport as separate methods, with per-type
> record classes, converters, and model objects. *The practical cost:
> adding a new metric type (operational metrics, for example) requires a new
> table, a new SPI method, a new record class, a new model class, a new
> converter branch, and a schema migration*. That's a lot of surface area
> for what should be "one more kind of metric."
>
> *My bias* would be toward a single metrics table with *a typed JSON
> payload*. Something like: metric_type (enum), entity_id,
> table_identifier, snapshot_id (nullable), received_ts, schema_version, and
> a payload column for the metric-specific data. The metric_type +
> schema_version pair gives us a forward-compatible contract for the payload
> shape. Adding a new metric type becomes an enum value and a payload schema,
> not a schema migration. One thing I think we need to be deliberate about is
> the partition key design. If all metric types land in one table, scan
> metrics at scale (high concurrency, high frequency across many tables)
> could easily create hot partitions. We'd want the persistence layer to be
> able to shard by entity or time range, and that means the logical schema
> needs to expose enough structure for backends to partition on. I don't
> think the current flattened layout gives us that.
>
> This is getting complex enough that I don't think ad-hoc PR/ML threads
> will converge well. *Would people be open to a biweekly sync for metrics
> architecture?* I think 30 minutes every two weeks with interested parties
> would be enough to work through the schema, SPI shape, and read API design
> together. Happy to help set that up.
>
> -ej
>
> On Mon, Apr 20, 2026 at 2:19 PM EJ Wang <[email protected]>
> wrote:
>
>> Reviewed #4115, left a comment on the code organization side.
>>
>> One thing stood out: the metrics write path enters through
>> PolarisMetricsManager on MetaStoreManager, but the new read path bypasses
>> MetaStoreManager entirely and goes straight to BasePersistence via
>> callContext.getMetaStore(). That means the read API only works for backends
>> that implement BasePersistence. NoSQL and remote backends can't participate.
>>
>> Stepping back, I think the metrics subsystem is growing into something
>> real (write + read + REST API + AuthZ + pagination) *but the persistence
>> side is split across two layers in a way that's hard to extend*. I put
>> together two diagrams to show what I mean (my best effort).
>>
>> *Current state* (Diagram 1): three interfaces at three different levels.
>> The engine-facing SPI (PolarisMetricsReporter) is clean. But
>> PolarisMetricsManager on MetaStoreManager is a passthrough to
>> MetricsPersistence on BasePersistence. The @Beta annotation and SPI javadoc
>> are on the BasePersistence layer, while the actual extension points
>> (PolarisMetricsReporter, PolarisMetricsManager) carry no stability
>> annotation. The write path goes through the MetaStoreManager layer, the
>> read path doesn't.
>>
>> *What I envision* (Diagram 2): two SPIs at two levels.
>> PolarisMetricsReporter stays as the engine-facing SPI.
>> PolarisMetricsManager becomes the backend-facing SPI with both write and
>> read methods at the MetaStoreManager level, where any backend (JDBC, NoSQL,
>> remote) can implement them. MetricsPersistence on BasePersistence goes
>> away. Where metrics actually land is an implementation detail, not a core
>> interface.
>>
>> *Minor naming thing*: PolarisMetricsReporter is broader than what it
>> actually handles. It only accepts Iceberg REST Catalog metrics (ScanReport,
>> CommitReport via MetricsReport). Generic table metrics or operational
>> metrics aren't in scope. Not blocking, but worth noting if the metrics
>> surface expands.
>>
>> *Rough sketch of how to get there*:
>>  1.  Add read methods to PolarisMetricsManager (listScanReports,
>> listCommitReports) with default no-op, same as the existing write methods.
>> (Probably make PolarisMetricsManager more explicit on being Iceberg
>> specific like package name or class name etc.)
>>  2.  Wire MetricsReportsService through MetaStoreManager instead of
>> callContext.getMetaStore().
>>  3.  Extract metrics persistence from JdbcBasePersistenceImpl into its
>> own class. That file carries ~7 responsibilities, metrics being one of them.
>>  4.  Remove MetricsPersistence from BasePersistence.
>>
>> *None of this needs to happen in #4115. But if the direction makes sense,
>> it would be good to align before the metrics surface grows further. Curious
>> what others think.*
>>
>> *My mental model note*: Level 1 MetaStoreManager; level 2 transactional
>> persistence; level 3 base persistence
>>
>> Diagram 1
>> <https://www.plantuml.com/plantuml/uml/bLHDR-Cs4BthLmpIYupw0zbkKQ1r3M-S7Bp8xhhM7WCOb3IM65EaGD9EX2RzxHrHb4CxRelwa4YSDu_lpOVcnZ9jzvM8BBS2uGjQpJC3dtHMSekPtMk44IpsMgEqa5XcCOhCZikQQLP1pR8TAp2n3ILhmZDP20m0fcIvUkAoW2qJXd9z1bpToO9BX3WXu0ucy5rpgGPNm0nW5_epUWtm2Ue3pn3kMOFQmKntGZW0BYtgBSi8k5A2QMwybJNMIbFiGSR9QZc4nUqIvikStF0jHprua5C-amge42aNt3R0f5JaaoivdV2Pkqbx4hee4ymOkBh5BTiB-_uIeGeo8zL8rPsPl4DktdEiK1jkB1NdZCRbrSTecDe_mlHbF0wvBmCkaOH5_S8a_TTTKI6-nmCAkEw4LpxsZ-LbYLKQFKMNOgf_wuM7_bV9gOer5SYMMksBSWXFcbi49KNZXNLicwfe3TETC7gPdPqI7uBcHMb1RSzYq34c6PDUM9mn8HRsUTZEiDBve3NjVZumBj0U7SS37mGO7vcwtiK-_pU7U7L_f-digo9YbhSwIfMRwIITKGXbxdIUTCGF1SeCJxloKsU-3k9ddRbX1eDq1q_fx1JbBGT0glVyXimDuP4TQ5qpCAmnGEj2s_6n5mtn1z-97-63itFQZLPO1Ev2tu_WF7Ju-VPc0Skg5bYXxBhkY1xpD7EM_7fyflSpIsqMgVth5xhVr4eQxWQ8enaSAJQSG16yFSDuJ798rrcXr_3n-lfdk7icQjEBmFujL7AodiP_Y4Z7-YxvtZNs4zMgpNTl6tF8sglyPsmqchrjvQ-m-aP94r-TwCA2Ka8upPJZwtvSpoYCXkYMZU2NXvRMBfq9P3i3Le4VAZUAlUZ_oPKsxPgY0Q_BSKLkyr9bhQhQrJjo_x3TPlIB0DPjnMfcIoYP0QaYw1a0fTKDr8fB6ntNuvmoL1ZGkXa69Njh43zf9GiGxHQrA_jDYWRSzF5--WmTVrN97_Sm8LbLUy_lGBmLanJjFkDlGkRqjA_4tm00>
>> :
>>
>> [image: image.png]
>>
>> Diagram 2
>> <https://www.plantuml.com/plantuml/uml/VLLDR-8m4BtdLupO2sWBLVU8AaGB7AXAbssGzb896SS9RXqxjKqBwkv_tt7iV43fdaZYDpFlpRmnOsE9jhjSH9PRmM31hERKm8scMsuPjJlDe0yheZDc8RR4iYWoBrmMH9CS2a9VICPYUy1OZN0YCy5Q0BCbYNhdCeEK28En8G8wCvbnoQ0R8_05Bc6bkLIz3X03p1zzH7zR-9ZfDquPt9C3qoNCX2yV4G2NbkcKu5jdgGJHt0GbZwnG6i-UP3TUpk5gM6Ldqke350eZUqzoCft3U9xWHvxoa5-7K4nF1J46EbEMafsmdrCBbQ44gVggy18IZrn_ph5asd1ZiIKdQSgueZvjXrQFSFrdC3YN-nXmBacxbGiYyLVxLaBtdhqn0LSzdBDhqQtQoOJeGyad3z0lUqnYgpGB6Ns8oVyta00Dy_WnX0tIOZ8v6SYxHll1TrH6aejAik-mh-AphVFCwSUQqFypElag5QRGFDjQKEd96K1P8QP41c9TzA_IIQyvdAWyv_RSiS3skb0_EzDDkK2v5xWF6MiGFlvhpFLcD2Dq2pml14gaF67eQkmd8gulDoC4kSOu6KVpkvlUJg1RTbWISU40RdBUUS_9XfRZ2dwxm_SW8LYFISgm_MnlDQ6M9P1gbKEc4X-2pH_FvJCkCqm9pbVjD6LrwdLeOrDWfOaqc8Wh9BE85oNKxkNQ6o4yGRy_Eae0G_G8tZv81d3bHDB23WOdisohVr3nh_j6lbSjbNaLRTc8UgtPbAU1J_tygOfZX9DWEJeHDvYx-qmSi5FgNLPZwHrHcUsncGQ5-skhUclpE5fo4ounpFauYrUbkU6ccfnxMvitwag4IyerhTxj8In_Oj1bDO4pQru674loYrGlULHLEGCjwJJ8gDoVZR8MxO4BT3IzRvIcAQKezC6xpziGnTyImrfEGyJI_OcKfgtxIvnTqFEMS17L9Z-jsARN5FmTheP7HtSdtOMT0B4GY2FYHXxgQmMtj2bRqiLFGapiVe1_QVKDrkqXcm83aFEXnMYCZ-xlyHy>
>> :
>> [image: image.png]
>>
>>  -ej
>>
>> On Wed, Apr 15, 2026 at 8:22 AM Dmitri Bourlatchkov <[email protected]>
>> wrote:
>>
>>> Hi All,
>>>
>>> Heads up: The current state of PR [4115] looks pretty solid to me. I
>>> believe this PR is approaching a mergeable condition.
>>>
>>> Please post your reviews if you have any comments.
>>>
>>> [4115] https://github.com/apache/polaris/pull/4115
>>>
>>> Thanks,
>>> Dmitri.
>>>
>>> On Tue, Mar 3, 2026 at 3:29 PM Anand Kumar Sankaran via dev <
>>> [email protected]> wrote:
>>>
>>> > Hi Yufei and Dmitri,
>>> >
>>> > Here is a proposal for the REST endpoints for metrics and events.
>>> >
>>> > https://github.com/apache/polaris/pull/3924/changes
>>> >
>>> > I did not see any precursors for raising a PR for proposals, so trying
>>> > this.  Please let me know what you think.
>>> >
>>> > -
>>> > Anand
>>> >
>>> > From: Anand Kumar Sankaran <[email protected]>
>>> > Date: Monday, March 2, 2026 at 10:25 AM
>>> > To: [email protected] <[email protected]>
>>> > Subject: Re: Polaris Telemetry and Audit Trail
>>> >
>>> > About the REST API, based on my use cases:
>>> >
>>> >
>>> >   1.
>>> > I want to be able to query commit metrics to track files added /
>>> removed
>>> > per commit, along with record counts. The ingestion pipeline that
>>> writes
>>> > this data is owned by us and we are guaranteed to write this
>>> information
>>> > for each write.
>>> >   2.
>>> > I want to be able to query scan metrics for read. I understand clients
>>> do
>>> > not fulfill this requirement.
>>> >   3.
>>> > I want to be able to query the events table (events are persisted) -
>>> this
>>> > may supersede #2, I am not sure yet.
>>> >
>>> > All this information is in the JDBC based persistence model and is
>>> > persisted in the metastore. I currently don’t have a need to query
>>> > prometheus or open telemetry. I do publish some events to Prometheus
>>> and
>>> > they are forwarded to our dashboards elsewhere.
>>> >
>>> > About the CLI utilities, I meant the admin user utilities. In one of
>>> the
>>> > earliest drafts of my proposal, Prashant mentioned that the metrics
>>> tables
>>> > can grow indefinitely and that a similar problem exists with the events
>>> > table as well. We discussed that cleaning up of old records from both
>>> > metrics tables and events tables can be done via a CLI utility.
>>> >
>>> > I see that Yufei has covered the discussion about datasources.
>>> >
>>> > -
>>> > Anand
>>> >
>>> >
>>> >
>>> > From: Yufei Gu <[email protected]>
>>> > Date: Friday, February 27, 2026 at 9:54 PM
>>> > To: [email protected] <[email protected]>
>>> > Subject: Re: Polaris Telemetry and Audit Trail
>>> >
>>> > This Message Is From an External Sender
>>> > This message came from outside your organization.
>>> > Report Suspicious<
>>> >
>>> https://us-phishalarm-ewt.proofpoint.com/EWT/v1/Iz9xO38YGHZK!YhNDZABkHi1B699ote2uMwpOZw8i0QMCGO2Szc-HshuABGhGvwPJcymE6G2oUUxtS8xDkSrtGTPm_I3QnVDHoLMk50m9v8z_nZKTkd-bnVUbreF1u0WnfV_X5eYevZl_$
>>> > >
>>> >
>>> >
>>> > As I mentioned in
>>> >
>>> https://urldefense.com/v3/__https://github.com/apache/polaris/issues/3890__;!!Iz9xO38YGHZK!5EuyFFkk3vhRWVIRvQAWBSQfpJkTMA9HxugzDwXmN0LPPqhEFxYkFRGVhtb8AqUwXtDh2OplcMnbMDHKOxrvDU0$
>>> ,
>>> > supporting
>>> > multiple data sources is not a trivial change. I would strongly
>>> recommend
>>> > starting with a design document to carefully evaluate the architectural
>>> > implications and long term impact.
>>> >
>>> > A REST endpoint to query metrics seems reasonable given the current
>>> JDBC
>>> > based persistence model. That said, we may also consider alternative
>>> > storage models. For example, if we later adopt a time series system
>>> such as
>>> > Prometheus to store metrics, the query model and access patterns would
>>> be
>>> > fundamentally different. Designing the REST API without considering
>>> these
>>> > potential evolutions may limit flexibility. I'd suggest to start with
>>> the
>>> > use case.
>>> >
>>> > Yufei
>>> >
>>> >
>>> > On Fri, Feb 27, 2026 at 3:42 PM Dmitri Bourlatchkov <[email protected]>
>>> > wrote:
>>> >
>>> > > Hi Anand,
>>> > >
>>> > > Sharing my view... subject to discussion:
>>> > >
>>> > > 1. Adding non-IRC REST API to Polaris is perfectly fine.
>>> > >
>>> > > Figuring out specific endpoint URIs and payloads might require a few
>>> > > roundtrips, so opening a separate thread for that might be best.
>>> > > Contributors commonly create Google Docs for new API proposals too
>>> (they
>>> > > fairly easy to update as the email discussion progresses).
>>> > >
>>> > > There was a suggestion to try Markdown (with PRs) for proposals [1]
>>> ...
>>> > > feel free to give it a try if you are comfortable with that.
>>> > >
>>> > > 2. Could you clarify whether you mean end user utilities or admin
>>> user
>>> > > utilities? In the latter case those might be more suitable for the
>>> Admin
>>> > > CLI (java) not the Python CLI, IMHO.
>>> > >
>>> > > Why would these utilities be common with events? IMHO, event use
>>> cases
>>> > are
>>> > > distinct from scan/commit metrics.
>>> > >
>>> > > 3. I'd prefer separating metrics persistence from MetaStore
>>> persistence
>>> > at
>>> > > the code level, so that they could be mixed and matched
>>> independently.
>>> > The
>>> > > separate datasource question will become a non-issue with that
>>> approach,
>>> > I
>>> > > guess.
>>> > >
>>> > > The rationale for separating scan metrics and metastore persistence
>>> is
>>> > that
>>> > > "cascading deletes" between them are hardly ever required.
>>> Furthermore,
>>> > the
>>> > > data and query patterns are very different so different technologies
>>> > might
>>> > > be beneficial in each case.
>>> > >
>>> > > [1]
>>> >
>>> https://urldefense.com/v3/__https://lists.apache.org/thread/yto2wp982t43h1mqjwnslswhws5z47cy__;!!Iz9xO38YGHZK!5EuyFFkk3vhRWVIRvQAWBSQfpJkTMA9HxugzDwXmN0LPPqhEFxYkFRGVhtb8AqUwXtDh2OplcMnbMDHKxYDakNU$
>>> > >
>>> > > Cheers,
>>> > > Dmitri.
>>> > >
>>> > > On Fri, Feb 27, 2026 at 6:19 PM Anand Kumar Sankaran via dev <
>>> > > [email protected]> wrote:
>>> > >
>>> > > > Thanks all. This PR is merged now.
>>> > > >
>>> > > > Here are the follow-up features / work needed.  These were all
>>> part of
>>> > > the
>>> > > > merged PR at some point in time and were removed to reduce scope.
>>> > > >
>>> > > > Please let me know what you think.
>>> > > >
>>> > > >
>>> > > >   1.  A REST API to paginate through table metrics. This will be
>>> > non-IRC
>>> > > > standard addition.
>>> > > >   2.  Utilities for managing old records, should be common with
>>> events.
>>> > > > There was some discussion that it belongs to the CLI.
>>> > > >   3.  Separate datasource (metrics, events, even other tables?).
>>> > > >
>>> > > >
>>> > > > Anything else?
>>> > > >
>>> > > > -
>>> > > > Anand
>>> > > >
>>> > > >
>>> > >
>>> >
>>> >
>>>
>>

Reply via email to