Hey folks,

I’m not sold on dropping realm_id.

Why it matters

   1. It's pretty useful for first-line triage in multi-tenant clusters.
   When a latency spike or error burst hits the dashboards, the first SRE
   question is “Which realm(s)?” Having the tag lets us slice the time-series
   instantly instead of hopping tools or waiting for sampled traces.
   2. Cardinality is bounded in practice. Even big deployments seldom break
   a few dozen active realms per Polaris instance. If someone configures
   “hundreds,” that’s an intentional choice we should surface and monitor.
   Some TSDBs scale just fine in that situation.
   3. Metrics and traces serve different loops. Traces are for deep-dive
   forensics once you have a suspect request; metrics drive real-time alerting
   and capacity planning. Relying only on traces every on-call to see which
   tenant is noisy.

If we really need a safety valve, add a boolean flag like
"metrics.realm-id.enabled", default true. Operators who don’t want the
dimension can flip it off; everyone else keeps the out-of-the-box
visibility that keeps multi-tenant ops sane.
Dropping the tag outright trades day-to-day operability for a theoretical
scaling worry most users will never hit. Let’s keep it and offer an opt-out
instead.

Yufei


On Fri, May 23, 2025 at 5:49 AM Alex Dutra <alex.du...@dremio.com.invalid>
wrote:

> Hi,
>
> For reference the PR is here: https://github.com/apache/polaris/pull/1662
>
> Alex
>
> On Fri, May 23, 2025 at 2:42 PM Robert Stupp <sn...@snazy.de> wrote:
> >
> > +1 on adding a config and default to false - and later entirely remove
> > the "issue".
> >
> > Sounds like good plan to me!
> >
> > On 23.05.25 13:52, Alex Dutra wrote:
> > > Hi Mike,
> > >
> > > That sounds like a plan! And yes, I agree with the slow approach. I
> > > will open a PR shortly with the new enablement flag.
> > >
> > > Alex
> > >
> > > On Fri, May 23, 2025 at 2:42 AM Michael Collado <
> collado.m...@gmail.com> wrote:
> > >> My immediate reaction is that a configuration flag would be good
> enough for
> > >> us. Our setup prevents requests without a valid realm, so we're not
> subject
> > >> to the DDOS attack you mentioned, but we should address that in the
> default
> > >> realm context resolver as well. I can do some further investigation
> to see
> > >> if it's something we could get rid of entirely (we also use log-based
> > >> metrics), but I like the slow approach of adding a configuration flag,
> > >> disabling it by default, then removing functionality as a best
> practice,
> > >> anyway.
> > >>
> > >> Mike
> > >>
> > >> On Thu, May 22, 2025 at 11:39 AM Alex Dutra
> <alex.du...@dremio.com.invalid>
> > >> wrote:
> > >>
> > >>> Hi again,
> > >>>
> > >>> Let's go back to the main concern: excessive memory required for
> metrics.
> > >>>
> > >>> To move this discussion forward, would you all agree to introduce a
> > >>> flag that activates the inclusion of realm IDs in the metrics?
> > >>>
> > >>> It seems like this could be a good middle ground that addresses
> > >>> everyone's needs.
> > >>>
> > >>> Let me know your thoughts.
> > >>>
> > >>> Alex
> > >>>
> > >>>
> > >>> On Wed, May 21, 2025 at 3:57 PM Robert Stupp <sn...@snazy.de> wrote:
> > >>>> Ouch! That sounds very simple to exploit and quite serious.
> > >>>>
> > >>>> On 21.05.25 15:38, Alex Dutra wrote:
> > >>>>> Also worth noting: http_server_requests_seconds_* metrics are
> generated
> > >>>>> even for failed requests. So it would be very easy for an attacker
> to
> > >>> forge
> > >>>>> HTTP requests containing invalid realms and take the server down.
> > >>>>>
> > >>>>> Alex
> > >>>>>
> > >>>>> On Wed, May 21, 2025 at 3:04 PM Robert Stupp <sn...@snazy.de>
> wrote:
> > >>>>>
> > >>>>>> OOMs are bad - it's extremely hard (if not impossible) for any
> user to
> > >>>>>> figure out why that happens. Bug reports would just read like
> "Polaris
> > >>>>>> runs into an OOM - no further information available".
> > >>>>>>
> > >>>>>> IMHO the system should be stable by default, not the other way
> around.
> > >>>>>> Even if there is a way to enable such huge cardinalities, that
> flag
> > >>>>>> should really be hidden and not documented anywhere - unless
> there's a
> > >>>>>> clear disclaimer saying: "Enabling this can lead to an
> unresponsive
> > >>>>>> system and risk the stability of the system. ..."
> > >>>>>>
> > >>>>>> I propose to investigate all metrics and reduce the cardinality
> of all
> > >>>>>> of those as that can easily become a serious issue.
> > >>>>>>
> > >>>>>> On 21.05.25 11:42, Alex Dutra wrote:
> > >>>>>>> Hi Mike,
> > >>>>>>>
> > >>>>>>> If you have around a hundred realms or more, imho you are
> already in
> > >>>>>>> trouble.
> > >>>>>>>
> > >>>>>>> The most problematic metric is the http_server_requests_seconds_*
> > >>> metrics
> > >>>>>>> family, e.g.:
> > >>>>>>>
> > >>>>>>>
> > >>>
> http_server_requests_seconds_count{application="Polaris",method="GET",outcome="CLIENT_ERROR",realm_id="POLARIS",status="404",uri="NOT_FOUND",}
> > >>>>>>> 1.0
> > >>>>>>>
> > >>>>>>> Since metrics in this family already have a high cardinality
> > >>> potential
> > >>>>>>> given the number of tags it supports by default, adding one more
> > >>>>>> dimension
> > >>>>>>> to them makes things (exponentially) worse.
> > >>>>>>>
> > >>>>>>> It's very easy to demonstrate that by running a small test [1].
> On my
> > >>>>>>> machine, the first 2 iterations (10 and 100 realms) complete, but
> > >>> the 3rd
> > >>>>>>> iteration (1000 realms) runs for about 1 minute then ends up in
> > >>>>>>> java.lang.OutOfMemoryError: Java heap space.
> > >>>>>>>
> > >>>>>>> That's why I advocated for removing the tag. If however you
> really
> > >>> want
> > >>>>>> to
> > >>>>>>> keep it, I'd suggest introducing a configuration flag to disable
> it
> > >>> in
> > >>>>>> two
> > >>>>>>> problematic metric families: the HTTP one shown above, and the
> > >>>>>> per-endpoint
> > >>>>>>> metrics as well.
> > >>>>>>>
> > >>>>>>> Thanks,
> > >>>>>>>
> > >>>>>>> Alex
> > >>>>>>>
> > >>>>>>> [1]:
> https://gist.github.com/adutra/414fe773e8727304b34e9249299c988d
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> On Wed, May 21, 2025 at 7:35 AM Michael Collado <
> > >>> collado.m...@gmail.com>
> > >>>>>>> wrote:
> > >>>>>>>
> > >>>>>>>> Hmm, we do use the realm tag in our metric publishing. I
> understand
> > >>> the
> > >>>>>>>> concern re: cardinality. Maybe we can support filtering metrics
> that
> > >>>>>> have
> > >>>>>>>> realm and support another metric without realm?
> > >>>>>>>>
> > >>>>>>>> On Mon, May 19, 2025 at 12:24 PM Dmitri Bourlatchkov <
> > >>> di...@apache.org>
> > >>>>>>>> wrote:
> > >>>>>>>>
> > >>>>>>>>> Removing realm_id from metrics tags makes sense to me (to avoid
> > >>> high
> > >>>>>>>>> cardinality).
> > >>>>>>>>>
> > >>>>>>>>> If we need to have insight into load differences from realm to
> > >>> realm,
> > >>>>>> it
> > >>>>>>>>> might be preferable to introduce metrics dedicated to that
> rather
> > >>> than
> > >>>>>>>>> increasing the cardinality of every endpoint metric.
> > >>>>>>>>>
> > >>>>>>>>> Cheers,
> > >>>>>>>>> Dmitri.
> > >>>>>>>>>
> > >>>>>>>>> On Thu, May 15, 2025 at 3:30 PM Alex Dutra
> > >>>>>> <alex.du...@dremio.com.invalid
> > >>>>>>>>> wrote:
> > >>>>>>>>>
> > >>>>>>>>>> Hi all,
> > >>>>>>>>>>
> > >>>>>>>>>> I would like to suggest removing the "realm_id" metric tag
> > >>> entirely.
> > >>>>>>>>>> My concern is that this tag has the potential for high
> > >>> cardinality,
> > >>>>>>>> which
> > >>>>>>>>>> is generally considered a bad practice when dealing with
> metrics.
> > >>> High
> > >>>>>>>>>> cardinality can lead to performance issues and increased
> memory
> > >>> usage.
> > >>>>>>>>>> Granted, the default realm resolver in Polaris is tailored for
> > >>> just a
> > >>>>>>>>>> handful of realms, but nothing prevents users from declaring
> > >>> hundreds
> > >>>>>>>> of
> > >>>>>>>>>> realms.
> > >>>>>>>>>>
> > >>>>>>>>>> I believe we can still effectively monitor Polaris servers
> without
> > >>>>>> this
> > >>>>>>>>>> specific tag, since the realm ID is also propagated in traces
> > >>> emitted
> > >>>>>>>> by
> > >>>>>>>>>> Polaris. Tracing is a much better fit for high-cardinality
> > >>> domains.
> > >>>>>>>>>> I'm open to discussing this further; a potential alternative
> > >>> would be
> > >>>>>>>> to
> > >>>>>>>>>> introduce a flag to disable this specific metric tag, but I
> feel
> > >>> like
> > >>>>>>>>>> removing it would be a much cleaner approach.
> > >>>>>>>>>>
> > >>>>>>>>>> Let me know your thoughts.
> > >>>>>>>>>>
> > >>>>>>>>>> Thanks,
> > >>>>>>>>>>
> > >>>>>>>>>> Alex
> > >>>>>>>>>>
> > >>>>>> --
> > >>>>>> Robert Stupp
> > >>>>>> @snazy
> > >>>>>>
> > >>>>>>
> > >>>> --
> > >>>> Robert Stupp
> > >>>> @snazy
> > >>>>
> > --
> > Robert Stupp
> > @snazy
> >
>

Reply via email to