+1 on adding a config and default to false - and later entirely remove the "issue".

Sounds like good plan to me!

On 23.05.25 13:52, Alex Dutra wrote:
Hi Mike,

That sounds like a plan! And yes, I agree with the slow approach. I
will open a PR shortly with the new enablement flag.

Alex

On Fri, May 23, 2025 at 2:42 AM Michael Collado <collado.m...@gmail.com> wrote:
My immediate reaction is that a configuration flag would be good enough for
us. Our setup prevents requests without a valid realm, so we're not subject
to the DDOS attack you mentioned, but we should address that in the default
realm context resolver as well. I can do some further investigation to see
if it's something we could get rid of entirely (we also use log-based
metrics), but I like the slow approach of adding a configuration flag,
disabling it by default, then removing functionality as a best practice,
anyway.

Mike

On Thu, May 22, 2025 at 11:39 AM Alex Dutra <alex.du...@dremio.com.invalid>
wrote:

Hi again,

Let's go back to the main concern: excessive memory required for metrics.

To move this discussion forward, would you all agree to introduce a
flag that activates the inclusion of realm IDs in the metrics?

It seems like this could be a good middle ground that addresses
everyone's needs.

Let me know your thoughts.

Alex


On Wed, May 21, 2025 at 3:57 PM Robert Stupp <sn...@snazy.de> wrote:
Ouch! That sounds very simple to exploit and quite serious.

On 21.05.25 15:38, Alex Dutra wrote:
Also worth noting: http_server_requests_seconds_* metrics are generated
even for failed requests. So it would be very easy for an attacker to
forge
HTTP requests containing invalid realms and take the server down.

Alex

On Wed, May 21, 2025 at 3:04 PM Robert Stupp <sn...@snazy.de> wrote:

OOMs are bad - it's extremely hard (if not impossible) for any user to
figure out why that happens. Bug reports would just read like "Polaris
runs into an OOM - no further information available".

IMHO the system should be stable by default, not the other way around.
Even if there is a way to enable such huge cardinalities, that flag
should really be hidden and not documented anywhere - unless there's a
clear disclaimer saying: "Enabling this can lead to an unresponsive
system and risk the stability of the system. ..."

I propose to investigate all metrics and reduce the cardinality of all
of those as that can easily become a serious issue.

On 21.05.25 11:42, Alex Dutra wrote:
Hi Mike,

If you have around a hundred realms or more, imho you are already in
trouble.

The most problematic metric is the http_server_requests_seconds_*
metrics
family, e.g.:


http_server_requests_seconds_count{application="Polaris",method="GET",outcome="CLIENT_ERROR",realm_id="POLARIS",status="404",uri="NOT_FOUND",}
1.0

Since metrics in this family already have a high cardinality
potential
given the number of tags it supports by default, adding one more
dimension
to them makes things (exponentially) worse.

It's very easy to demonstrate that by running a small test [1]. On my
machine, the first 2 iterations (10 and 100 realms) complete, but
the 3rd
iteration (1000 realms) runs for about 1 minute then ends up in
java.lang.OutOfMemoryError: Java heap space.

That's why I advocated for removing the tag. If however you really
want
to
keep it, I'd suggest introducing a configuration flag to disable it
in
two
problematic metric families: the HTTP one shown above, and the
per-endpoint
metrics as well.

Thanks,

Alex

[1]: https://gist.github.com/adutra/414fe773e8727304b34e9249299c988d



On Wed, May 21, 2025 at 7:35 AM Michael Collado <
collado.m...@gmail.com>
wrote:

Hmm, we do use the realm tag in our metric publishing. I understand
the
concern re: cardinality. Maybe we can support filtering metrics that
have
realm and support another metric without realm?

On Mon, May 19, 2025 at 12:24 PM Dmitri Bourlatchkov <
di...@apache.org>
wrote:

Removing realm_id from metrics tags makes sense to me (to avoid
high
cardinality).

If we need to have insight into load differences from realm to
realm,
it
might be preferable to introduce metrics dedicated to that rather
than
increasing the cardinality of every endpoint metric.

Cheers,
Dmitri.

On Thu, May 15, 2025 at 3:30 PM Alex Dutra
<alex.du...@dremio.com.invalid
wrote:

Hi all,

I would like to suggest removing the "realm_id" metric tag
entirely.
My concern is that this tag has the potential for high
cardinality,
which
is generally considered a bad practice when dealing with metrics.
High
cardinality can lead to performance issues and increased memory
usage.
Granted, the default realm resolver in Polaris is tailored for
just a
handful of realms, but nothing prevents users from declaring
hundreds
of
realms.

I believe we can still effectively monitor Polaris servers without
this
specific tag, since the realm ID is also propagated in traces
emitted
by
Polaris. Tracing is a much better fit for high-cardinality
domains.
I'm open to discussing this further; a potential alternative
would be
to
introduce a flag to disable this specific metric tag, but I feel
like
removing it would be a much cleaner approach.

Let me know your thoughts.

Thanks,

Alex

--
Robert Stupp
@snazy


--
Robert Stupp
@snazy

--
Robert Stupp
@snazy

Reply via email to