Hi Anand, Thanks for sharing this information!
I did not dig deep into it, but just from a quick glance, I believe synchronous execution on Vert.x threads is a concern. Normally, Polaris requests should be handled on the "executor" threads. I'll make a second pass at this later (hopefully not too late :) Cheers, Dmitri. On Tue, May 5, 2026 at 9:33 AM Anand Kumar Sankaran via dev < [email protected]> wrote: > Hi all, > > I up took 1.4.1 and turned on table metrics persistence. > > For this time around, I only wanted to persist the CommitReport (much > lower volume than ScanReport). I created a CompositeMetricsReporter which > did the following (log the table metrics as of 1.3.0, a Prometheus reporter > for some metrics sent to Prometheus and a persisting reporter to persist > just the CommitReport): > > > public CompositeMetricsReporter( > @Identifier("default") final PolarisMetricsReporter logging, > @Identifier("persisting") final PolarisMetricsReporter persisting, > @Identifier("prometheus") final PolarisMetricsReporter prometheus, > final ThreadContext threadContext, > final MeterRegistry meterRegistry) { > > CommitReport persistence (the PolarisMetricsReporter "persisting" > delegate, which writes commit history to the metastore) was executing > synchronously on the Vert.x worker thread that handled each commit > request. When Aurora Serverless v2 entered a cold-start/scaling phase, each > JDBC call took seconds instead of milliseconds. With enough concurrent > commits, Vert.x's worker thread pool saturated and new requests began > failing with 503. > > To fix this, I had to do the following: > > Async, bounded persistence executor in CompositeMetricsReporter. > > CommitReport persistence is now dispatched onto a dedicated > ThreadPoolExecutor (4 threads, queue capacity 1024) rather than the calling > thread. Tasks are wrapped via ThreadContext (MicroProfile Context > Propagation) so request-scoped CDI beans (CallContext, PolarisPrincipal, > RequestIdSupplier) remain accessible on the executor thread. When the queue > is full, the task is dropped (not enqueued to grow memory) and counted via > a new metric catalog_metrics_persistence_dropped. Queue depth is exposed as > catalog_metrics_persistence_queue_size. These two metrics give me a window > into how to tune this further. > > ScanReport persistence was also disabled — scan reports represent > read-path volume and were causing write amplification under read-heavy > workloads; they are now only passed to the logging and Prometheus > delegates. > > I also had to bump up the minimum capacity of the serverless Aurora > instance. > > When we were implementing the PR, we had discussed implementing a > different data source for table metrics but abandoned this. I request that > we look at that again now. Anyone using the persisting table metrics > identifier needs to be careful. > > - > Anand >
