Hi Mike, That sounds like a plan! And yes, I agree with the slow approach. I will open a PR shortly with the new enablement flag.
Alex On Fri, May 23, 2025 at 2:42 AM Michael Collado <collado.m...@gmail.com> wrote: > > My immediate reaction is that a configuration flag would be good enough for > us. Our setup prevents requests without a valid realm, so we're not subject > to the DDOS attack you mentioned, but we should address that in the default > realm context resolver as well. I can do some further investigation to see > if it's something we could get rid of entirely (we also use log-based > metrics), but I like the slow approach of adding a configuration flag, > disabling it by default, then removing functionality as a best practice, > anyway. > > Mike > > On Thu, May 22, 2025 at 11:39 AM Alex Dutra <alex.du...@dremio.com.invalid> > wrote: > > > Hi again, > > > > Let's go back to the main concern: excessive memory required for metrics. > > > > To move this discussion forward, would you all agree to introduce a > > flag that activates the inclusion of realm IDs in the metrics? > > > > It seems like this could be a good middle ground that addresses > > everyone's needs. > > > > Let me know your thoughts. > > > > Alex > > > > > > On Wed, May 21, 2025 at 3:57 PM Robert Stupp <sn...@snazy.de> wrote: > > > > > > Ouch! That sounds very simple to exploit and quite serious. > > > > > > On 21.05.25 15:38, Alex Dutra wrote: > > > > Also worth noting: http_server_requests_seconds_* metrics are generated > > > > even for failed requests. So it would be very easy for an attacker to > > forge > > > > HTTP requests containing invalid realms and take the server down. > > > > > > > > Alex > > > > > > > > On Wed, May 21, 2025 at 3:04 PM Robert Stupp <sn...@snazy.de> wrote: > > > > > > > >> OOMs are bad - it's extremely hard (if not impossible) for any user to > > > >> figure out why that happens. Bug reports would just read like "Polaris > > > >> runs into an OOM - no further information available". > > > >> > > > >> IMHO the system should be stable by default, not the other way around. > > > >> Even if there is a way to enable such huge cardinalities, that flag > > > >> should really be hidden and not documented anywhere - unless there's a > > > >> clear disclaimer saying: "Enabling this can lead to an unresponsive > > > >> system and risk the stability of the system. ..." > > > >> > > > >> I propose to investigate all metrics and reduce the cardinality of all > > > >> of those as that can easily become a serious issue. > > > >> > > > >> On 21.05.25 11:42, Alex Dutra wrote: > > > >>> Hi Mike, > > > >>> > > > >>> If you have around a hundred realms or more, imho you are already in > > > >>> trouble. > > > >>> > > > >>> The most problematic metric is the http_server_requests_seconds_* > > metrics > > > >>> family, e.g.: > > > >>> > > > >>> > > > >> > > http_server_requests_seconds_count{application="Polaris",method="GET",outcome="CLIENT_ERROR",realm_id="POLARIS",status="404",uri="NOT_FOUND",} > > > >>> 1.0 > > > >>> > > > >>> Since metrics in this family already have a high cardinality > > potential > > > >>> given the number of tags it supports by default, adding one more > > > >> dimension > > > >>> to them makes things (exponentially) worse. > > > >>> > > > >>> It's very easy to demonstrate that by running a small test [1]. On my > > > >>> machine, the first 2 iterations (10 and 100 realms) complete, but > > the 3rd > > > >>> iteration (1000 realms) runs for about 1 minute then ends up in > > > >>> java.lang.OutOfMemoryError: Java heap space. > > > >>> > > > >>> That's why I advocated for removing the tag. If however you really > > want > > > >> to > > > >>> keep it, I'd suggest introducing a configuration flag to disable it > > in > > > >> two > > > >>> problematic metric families: the HTTP one shown above, and the > > > >> per-endpoint > > > >>> metrics as well. > > > >>> > > > >>> Thanks, > > > >>> > > > >>> Alex > > > >>> > > > >>> [1]: https://gist.github.com/adutra/414fe773e8727304b34e9249299c988d > > > >>> > > > >>> > > > >>> > > > >>> On Wed, May 21, 2025 at 7:35 AM Michael Collado < > > collado.m...@gmail.com> > > > >>> wrote: > > > >>> > > > >>>> Hmm, we do use the realm tag in our metric publishing. I understand > > the > > > >>>> concern re: cardinality. Maybe we can support filtering metrics that > > > >> have > > > >>>> realm and support another metric without realm? > > > >>>> > > > >>>> On Mon, May 19, 2025 at 12:24 PM Dmitri Bourlatchkov < > > di...@apache.org> > > > >>>> wrote: > > > >>>> > > > >>>>> Removing realm_id from metrics tags makes sense to me (to avoid > > high > > > >>>>> cardinality). > > > >>>>> > > > >>>>> If we need to have insight into load differences from realm to > > realm, > > > >> it > > > >>>>> might be preferable to introduce metrics dedicated to that rather > > than > > > >>>>> increasing the cardinality of every endpoint metric. > > > >>>>> > > > >>>>> Cheers, > > > >>>>> Dmitri. > > > >>>>> > > > >>>>> On Thu, May 15, 2025 at 3:30 PM Alex Dutra > > > >> <alex.du...@dremio.com.invalid > > > >>>>> wrote: > > > >>>>> > > > >>>>>> Hi all, > > > >>>>>> > > > >>>>>> I would like to suggest removing the "realm_id" metric tag > > entirely. > > > >>>>>> > > > >>>>>> My concern is that this tag has the potential for high > > cardinality, > > > >>>> which > > > >>>>>> is generally considered a bad practice when dealing with metrics. > > High > > > >>>>>> cardinality can lead to performance issues and increased memory > > usage. > > > >>>>>> > > > >>>>>> Granted, the default realm resolver in Polaris is tailored for > > just a > > > >>>>>> handful of realms, but nothing prevents users from declaring > > hundreds > > > >>>> of > > > >>>>>> realms. > > > >>>>>> > > > >>>>>> I believe we can still effectively monitor Polaris servers without > > > >> this > > > >>>>>> specific tag, since the realm ID is also propagated in traces > > emitted > > > >>>> by > > > >>>>>> Polaris. Tracing is a much better fit for high-cardinality > > domains. > > > >>>>>> > > > >>>>>> I'm open to discussing this further; a potential alternative > > would be > > > >>>> to > > > >>>>>> introduce a flag to disable this specific metric tag, but I feel > > like > > > >>>>>> removing it would be a much cleaner approach. > > > >>>>>> > > > >>>>>> Let me know your thoughts. > > > >>>>>> > > > >>>>>> Thanks, > > > >>>>>> > > > >>>>>> Alex > > > >>>>>> > > > >> -- > > > >> Robert Stupp > > > >> @snazy > > > >> > > > >> > > > -- > > > Robert Stupp > > > @snazy > > > > >