Hi Alex
Thanks ! (I will take a look at the PR). Then it looks good to me.
Regards
JB
On Mon, May 26, 2025 at 5:33 PM Alex Dutra
wrote:
>
> Hi JB, hi all,
>
> My PR already has a safety net similar to what you propose: if the
> cardinality of realm IDs in HTTP metrics goes above a configurable
Hi JB, hi all,
My PR already has a safety net similar to what you propose: if the
cardinality of realm IDs in HTTP metrics goes above a configurable
threshold (100 by default), a warning is printed and no more HTTP
metrics will be recorded.
Thanks,
Alex
On Mon, May 26, 2025 at 5:22 PM Dmitri Bou
If we really need a safety valve, add a boolean flag like
"metrics.realm-id.enabled", default true. Operators who don’t want the
dimension can flip it off; everyone else keeps the out-of-the-box
visibility that keeps multi-tenant ops sane.
This is exactly what PR [1662] proposes, right?
The "bre
Hi,
What about a limit on the realm_id cardinality metric as a compromise ?
Like we keep the realm_id metric, but by configuration, the user
defines a limit (to avoid flooding/DoS).
This is just a "temp" workaround: we should request the metric only if
the realm is valid, and it should be coupled
Hi Alex, thanks for being flexible. Given how valuable the tag is for
day-to-day triage, I’d prefer an opt-out model:
1. Default is enabled, so that operators get realm-level visibility “for
free.”
2. Anyone worried about cardinality can turn it off with a single flag
(metrics.realm-id.enabled=fal
Hi Yufei,
Thanks for sharing your concerns.
In fact we were already leaning towards an enablement flag, however
from Mike's message I inferred that the metric tag would be an opt-in
feature, but you seem to suggest an opt-out feature instead.
I'm fine either way, the PR I opened uses opt-in, but
Hey folks,
I’m not sold on dropping realm_id.
Why it matters
1. It's pretty useful for first-line triage in multi-tenant clusters.
When a latency spike or error burst hits the dashboards, the first SRE
question is “Which realm(s)?” Having the tag lets us slice the time-series
instant
Hi,
For reference the PR is here: https://github.com/apache/polaris/pull/1662
Alex
On Fri, May 23, 2025 at 2:42 PM Robert Stupp wrote:
>
> +1 on adding a config and default to false - and later entirely remove
> the "issue".
>
> Sounds like good plan to me!
>
> On 23.05.25 13:52, Alex Dutra wro
+1 on adding a config and default to false - and later entirely remove
the "issue".
Sounds like good plan to me!
On 23.05.25 13:52, Alex Dutra wrote:
Hi Mike,
That sounds like a plan! And yes, I agree with the slow approach. I
will open a PR shortly with the new enablement flag.
Alex
On Fri
Hi Mike,
That sounds like a plan! And yes, I agree with the slow approach. I
will open a PR shortly with the new enablement flag.
Alex
On Fri, May 23, 2025 at 2:42 AM Michael Collado wrote:
>
> My immediate reaction is that a configuration flag would be good enough for
> us. Our setup prevents
My immediate reaction is that a configuration flag would be good enough for
us. Our setup prevents requests without a valid realm, so we're not subject
to the DDOS attack you mentioned, but we should address that in the default
realm context resolver as well. I can do some further investigation to
Hi again,
Let's go back to the main concern: excessive memory required for metrics.
To move this discussion forward, would you all agree to introduce a
flag that activates the inclusion of realm IDs in the metrics?
It seems like this could be a good middle ground that addresses
everyone's needs.
Ouch! That sounds very simple to exploit and quite serious.
On 21.05.25 15:38, Alex Dutra wrote:
Also worth noting: http_server_requests_seconds_* metrics are generated
even for failed requests. So it would be very easy for an attacker to forge
HTTP requests containing invalid realms and take th
Also worth noting: http_server_requests_seconds_* metrics are generated
even for failed requests. So it would be very easy for an attacker to forge
HTTP requests containing invalid realms and take the server down.
Alex
On Wed, May 21, 2025 at 3:04 PM Robert Stupp wrote:
> OOMs are bad - it's ex
OOMs are bad - it's extremely hard (if not impossible) for any user to
figure out why that happens. Bug reports would just read like "Polaris
runs into an OOM - no further information available".
IMHO the system should be stable by default, not the other way around.
Even if there is a way to e
Hi Mike,
If you have around a hundred realms or more, imho you are already in
trouble.
The most problematic metric is the http_server_requests_seconds_* metrics
family, e.g.:
http_server_requests_seconds_count{application="Polaris",method="GET",outcome="CLIENT_ERROR",realm_id="POLARIS",status="4
Hmm, we do use the realm tag in our metric publishing. I understand the
concern re: cardinality. Maybe we can support filtering metrics that have
realm and support another metric without realm?
On Mon, May 19, 2025 at 12:24 PM Dmitri Bourlatchkov
wrote:
> Removing realm_id from metrics tags make
Removing realm_id from metrics tags makes sense to me (to avoid high
cardinality).
If we need to have insight into load differences from realm to realm, it
might be preferable to introduce metrics dedicated to that rather than
increasing the cardinality of every endpoint metric.
Cheers,
Dmitri.
Hi,
I think it's okay to remove the realm-ID from the metric tags and leave
it in traces. So +1 from me on doing this.
High cardinality values are not good for metrics (or metrics systems)
and can easily cause a lot of "interesting situations" in production
systems - things that are hard to
Hi all,
I would like to suggest removing the "realm_id" metric tag entirely.
My concern is that this tag has the potential for high cardinality, which
is generally considered a bad practice when dealing with metrics. High
cardinality can lead to performance issues and increased memory usage.
Gra
20 matches
Mail list logo