Hi Mike,

That sounds like a plan! And yes, I agree with the slow approach. I
will open a PR shortly with the new enablement flag.

Alex

On Fri, May 23, 2025 at 2:42 AM Michael Collado <collado.m...@gmail.com> wrote:
>
> My immediate reaction is that a configuration flag would be good enough for
> us. Our setup prevents requests without a valid realm, so we're not subject
> to the DDOS attack you mentioned, but we should address that in the default
> realm context resolver as well. I can do some further investigation to see
> if it's something we could get rid of entirely (we also use log-based
> metrics), but I like the slow approach of adding a configuration flag,
> disabling it by default, then removing functionality as a best practice,
> anyway.
>
> Mike
>
> On Thu, May 22, 2025 at 11:39 AM Alex Dutra <alex.du...@dremio.com.invalid>
> wrote:
>
> > Hi again,
> >
> > Let's go back to the main concern: excessive memory required for metrics.
> >
> > To move this discussion forward, would you all agree to introduce a
> > flag that activates the inclusion of realm IDs in the metrics?
> >
> > It seems like this could be a good middle ground that addresses
> > everyone's needs.
> >
> > Let me know your thoughts.
> >
> > Alex
> >
> >
> > On Wed, May 21, 2025 at 3:57 PM Robert Stupp <sn...@snazy.de> wrote:
> > >
> > > Ouch! That sounds very simple to exploit and quite serious.
> > >
> > > On 21.05.25 15:38, Alex Dutra wrote:
> > > > Also worth noting: http_server_requests_seconds_* metrics are generated
> > > > even for failed requests. So it would be very easy for an attacker to
> > forge
> > > > HTTP requests containing invalid realms and take the server down.
> > > >
> > > > Alex
> > > >
> > > > On Wed, May 21, 2025 at 3:04 PM Robert Stupp <sn...@snazy.de> wrote:
> > > >
> > > >> OOMs are bad - it's extremely hard (if not impossible) for any user to
> > > >> figure out why that happens. Bug reports would just read like "Polaris
> > > >> runs into an OOM - no further information available".
> > > >>
> > > >> IMHO the system should be stable by default, not the other way around.
> > > >> Even if there is a way to enable such huge cardinalities, that flag
> > > >> should really be hidden and not documented anywhere - unless there's a
> > > >> clear disclaimer saying: "Enabling this can lead to an unresponsive
> > > >> system and risk the stability of the system. ..."
> > > >>
> > > >> I propose to investigate all metrics and reduce the cardinality of all
> > > >> of those as that can easily become a serious issue.
> > > >>
> > > >> On 21.05.25 11:42, Alex Dutra wrote:
> > > >>> Hi Mike,
> > > >>>
> > > >>> If you have around a hundred realms or more, imho you are already in
> > > >>> trouble.
> > > >>>
> > > >>> The most problematic metric is the http_server_requests_seconds_*
> > metrics
> > > >>> family, e.g.:
> > > >>>
> > > >>>
> > > >>
> > http_server_requests_seconds_count{application="Polaris",method="GET",outcome="CLIENT_ERROR",realm_id="POLARIS",status="404",uri="NOT_FOUND",}
> > > >>> 1.0
> > > >>>
> > > >>> Since metrics in this family already have a high cardinality
> > potential
> > > >>> given the number of tags it supports by default, adding one more
> > > >> dimension
> > > >>> to them makes things (exponentially) worse.
> > > >>>
> > > >>> It's very easy to demonstrate that by running a small test [1]. On my
> > > >>> machine, the first 2 iterations (10 and 100 realms) complete, but
> > the 3rd
> > > >>> iteration (1000 realms) runs for about 1 minute then ends up in
> > > >>> java.lang.OutOfMemoryError: Java heap space.
> > > >>>
> > > >>> That's why I advocated for removing the tag. If however you really
> > want
> > > >> to
> > > >>> keep it, I'd suggest introducing a configuration flag to disable it
> > in
> > > >> two
> > > >>> problematic metric families: the HTTP one shown above, and the
> > > >> per-endpoint
> > > >>> metrics as well.
> > > >>>
> > > >>> Thanks,
> > > >>>
> > > >>> Alex
> > > >>>
> > > >>> [1]: https://gist.github.com/adutra/414fe773e8727304b34e9249299c988d
> > > >>>
> > > >>>
> > > >>>
> > > >>> On Wed, May 21, 2025 at 7:35 AM Michael Collado <
> > collado.m...@gmail.com>
> > > >>> wrote:
> > > >>>
> > > >>>> Hmm, we do use the realm tag in our metric publishing. I understand
> > the
> > > >>>> concern re: cardinality. Maybe we can support filtering metrics that
> > > >> have
> > > >>>> realm and support another metric without realm?
> > > >>>>
> > > >>>> On Mon, May 19, 2025 at 12:24 PM Dmitri Bourlatchkov <
> > di...@apache.org>
> > > >>>> wrote:
> > > >>>>
> > > >>>>> Removing realm_id from metrics tags makes sense to me (to avoid
> > high
> > > >>>>> cardinality).
> > > >>>>>
> > > >>>>> If we need to have insight into load differences from realm to
> > realm,
> > > >> it
> > > >>>>> might be preferable to introduce metrics dedicated to that rather
> > than
> > > >>>>> increasing the cardinality of every endpoint metric.
> > > >>>>>
> > > >>>>> Cheers,
> > > >>>>> Dmitri.
> > > >>>>>
> > > >>>>> On Thu, May 15, 2025 at 3:30 PM Alex Dutra
> > > >> <alex.du...@dremio.com.invalid
> > > >>>>> wrote:
> > > >>>>>
> > > >>>>>> Hi all,
> > > >>>>>>
> > > >>>>>> I would like to suggest removing the "realm_id" metric tag
> > entirely.
> > > >>>>>>
> > > >>>>>> My concern is that this tag has the potential for high
> > cardinality,
> > > >>>> which
> > > >>>>>> is generally considered a bad practice when dealing with metrics.
> > High
> > > >>>>>> cardinality can lead to performance issues and increased memory
> > usage.
> > > >>>>>>
> > > >>>>>> Granted, the default realm resolver in Polaris is tailored for
> > just a
> > > >>>>>> handful of realms, but nothing prevents users from declaring
> > hundreds
> > > >>>> of
> > > >>>>>> realms.
> > > >>>>>>
> > > >>>>>> I believe we can still effectively monitor Polaris servers without
> > > >> this
> > > >>>>>> specific tag, since the realm ID is also propagated in traces
> > emitted
> > > >>>> by
> > > >>>>>> Polaris. Tracing is a much better fit for high-cardinality
> > domains.
> > > >>>>>>
> > > >>>>>> I'm open to discussing this further; a potential alternative
> > would be
> > > >>>> to
> > > >>>>>> introduce a flag to disable this specific metric tag, but I feel
> > like
> > > >>>>>> removing it would be a much cleaner approach.
> > > >>>>>>
> > > >>>>>> Let me know your thoughts.
> > > >>>>>>
> > > >>>>>> Thanks,
> > > >>>>>>
> > > >>>>>> Alex
> > > >>>>>>
> > > >> --
> > > >> Robert Stupp
> > > >> @snazy
> > > >>
> > > >>
> > > --
> > > Robert Stupp
> > > @snazy
> > >
> >

Reply via email to