OOMs are bad - it's extremely hard (if not impossible) for any user to
figure out why that happens. Bug reports would just read like "Polaris
runs into an OOM - no further information available".
IMHO the system should be stable by default, not the other way around.
Even if there is a way to enable such huge cardinalities, that flag
should really be hidden and not documented anywhere - unless there's a
clear disclaimer saying: "Enabling this can lead to an unresponsive
system and risk the stability of the system. ..."
I propose to investigate all metrics and reduce the cardinality of all
of those as that can easily become a serious issue.
On 21.05.25 11:42, Alex Dutra wrote:
Hi Mike,
If you have around a hundred realms or more, imho you are already in
trouble.
The most problematic metric is the http_server_requests_seconds_* metrics
family, e.g.:
http_server_requests_seconds_count{application="Polaris",method="GET",outcome="CLIENT_ERROR",realm_id="POLARIS",status="404",uri="NOT_FOUND",}
1.0
Since metrics in this family already have a high cardinality potential
given the number of tags it supports by default, adding one more dimension
to them makes things (exponentially) worse.
It's very easy to demonstrate that by running a small test [1]. On my
machine, the first 2 iterations (10 and 100 realms) complete, but the 3rd
iteration (1000 realms) runs for about 1 minute then ends up in
java.lang.OutOfMemoryError: Java heap space.
That's why I advocated for removing the tag. If however you really want to
keep it, I'd suggest introducing a configuration flag to disable it in two
problematic metric families: the HTTP one shown above, and the per-endpoint
metrics as well.
Thanks,
Alex
[1]: https://gist.github.com/adutra/414fe773e8727304b34e9249299c988d
On Wed, May 21, 2025 at 7:35 AM Michael Collado <collado.m...@gmail.com>
wrote:
Hmm, we do use the realm tag in our metric publishing. I understand the
concern re: cardinality. Maybe we can support filtering metrics that have
realm and support another metric without realm?
On Mon, May 19, 2025 at 12:24 PM Dmitri Bourlatchkov <di...@apache.org>
wrote:
Removing realm_id from metrics tags makes sense to me (to avoid high
cardinality).
If we need to have insight into load differences from realm to realm, it
might be preferable to introduce metrics dedicated to that rather than
increasing the cardinality of every endpoint metric.
Cheers,
Dmitri.
On Thu, May 15, 2025 at 3:30 PM Alex Dutra <alex.du...@dremio.com.invalid
wrote:
Hi all,
I would like to suggest removing the "realm_id" metric tag entirely.
My concern is that this tag has the potential for high cardinality,
which
is generally considered a bad practice when dealing with metrics. High
cardinality can lead to performance issues and increased memory usage.
Granted, the default realm resolver in Polaris is tailored for just a
handful of realms, but nothing prevents users from declaring hundreds
of
realms.
I believe we can still effectively monitor Polaris servers without this
specific tag, since the realm ID is also propagated in traces emitted
by
Polaris. Tracing is a much better fit for high-cardinality domains.
I'm open to discussing this further; a potential alternative would be
to
introduce a flag to disable this specific metric tag, but I feel like
removing it would be a much cleaner approach.
Let me know your thoughts.
Thanks,
Alex
--
Robert Stupp
@snazy