Hi, For reference the PR is here: https://github.com/apache/polaris/pull/1662
Alex On Fri, May 23, 2025 at 2:42 PM Robert Stupp <sn...@snazy.de> wrote: > > +1 on adding a config and default to false - and later entirely remove > the "issue". > > Sounds like good plan to me! > > On 23.05.25 13:52, Alex Dutra wrote: > > Hi Mike, > > > > That sounds like a plan! And yes, I agree with the slow approach. I > > will open a PR shortly with the new enablement flag. > > > > Alex > > > > On Fri, May 23, 2025 at 2:42 AM Michael Collado <collado.m...@gmail.com> > > wrote: > >> My immediate reaction is that a configuration flag would be good enough for > >> us. Our setup prevents requests without a valid realm, so we're not subject > >> to the DDOS attack you mentioned, but we should address that in the default > >> realm context resolver as well. I can do some further investigation to see > >> if it's something we could get rid of entirely (we also use log-based > >> metrics), but I like the slow approach of adding a configuration flag, > >> disabling it by default, then removing functionality as a best practice, > >> anyway. > >> > >> Mike > >> > >> On Thu, May 22, 2025 at 11:39 AM Alex Dutra <alex.du...@dremio.com.invalid> > >> wrote: > >> > >>> Hi again, > >>> > >>> Let's go back to the main concern: excessive memory required for metrics. > >>> > >>> To move this discussion forward, would you all agree to introduce a > >>> flag that activates the inclusion of realm IDs in the metrics? > >>> > >>> It seems like this could be a good middle ground that addresses > >>> everyone's needs. > >>> > >>> Let me know your thoughts. > >>> > >>> Alex > >>> > >>> > >>> On Wed, May 21, 2025 at 3:57 PM Robert Stupp <sn...@snazy.de> wrote: > >>>> Ouch! That sounds very simple to exploit and quite serious. > >>>> > >>>> On 21.05.25 15:38, Alex Dutra wrote: > >>>>> Also worth noting: http_server_requests_seconds_* metrics are generated > >>>>> even for failed requests. So it would be very easy for an attacker to > >>> forge > >>>>> HTTP requests containing invalid realms and take the server down. > >>>>> > >>>>> Alex > >>>>> > >>>>> On Wed, May 21, 2025 at 3:04 PM Robert Stupp <sn...@snazy.de> wrote: > >>>>> > >>>>>> OOMs are bad - it's extremely hard (if not impossible) for any user to > >>>>>> figure out why that happens. Bug reports would just read like "Polaris > >>>>>> runs into an OOM - no further information available". > >>>>>> > >>>>>> IMHO the system should be stable by default, not the other way around. > >>>>>> Even if there is a way to enable such huge cardinalities, that flag > >>>>>> should really be hidden and not documented anywhere - unless there's a > >>>>>> clear disclaimer saying: "Enabling this can lead to an unresponsive > >>>>>> system and risk the stability of the system. ..." > >>>>>> > >>>>>> I propose to investigate all metrics and reduce the cardinality of all > >>>>>> of those as that can easily become a serious issue. > >>>>>> > >>>>>> On 21.05.25 11:42, Alex Dutra wrote: > >>>>>>> Hi Mike, > >>>>>>> > >>>>>>> If you have around a hundred realms or more, imho you are already in > >>>>>>> trouble. > >>>>>>> > >>>>>>> The most problematic metric is the http_server_requests_seconds_* > >>> metrics > >>>>>>> family, e.g.: > >>>>>>> > >>>>>>> > >>> http_server_requests_seconds_count{application="Polaris",method="GET",outcome="CLIENT_ERROR",realm_id="POLARIS",status="404",uri="NOT_FOUND",} > >>>>>>> 1.0 > >>>>>>> > >>>>>>> Since metrics in this family already have a high cardinality > >>> potential > >>>>>>> given the number of tags it supports by default, adding one more > >>>>>> dimension > >>>>>>> to them makes things (exponentially) worse. > >>>>>>> > >>>>>>> It's very easy to demonstrate that by running a small test [1]. On my > >>>>>>> machine, the first 2 iterations (10 and 100 realms) complete, but > >>> the 3rd > >>>>>>> iteration (1000 realms) runs for about 1 minute then ends up in > >>>>>>> java.lang.OutOfMemoryError: Java heap space. > >>>>>>> > >>>>>>> That's why I advocated for removing the tag. If however you really > >>> want > >>>>>> to > >>>>>>> keep it, I'd suggest introducing a configuration flag to disable it > >>> in > >>>>>> two > >>>>>>> problematic metric families: the HTTP one shown above, and the > >>>>>> per-endpoint > >>>>>>> metrics as well. > >>>>>>> > >>>>>>> Thanks, > >>>>>>> > >>>>>>> Alex > >>>>>>> > >>>>>>> [1]: https://gist.github.com/adutra/414fe773e8727304b34e9249299c988d > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> On Wed, May 21, 2025 at 7:35 AM Michael Collado < > >>> collado.m...@gmail.com> > >>>>>>> wrote: > >>>>>>> > >>>>>>>> Hmm, we do use the realm tag in our metric publishing. I understand > >>> the > >>>>>>>> concern re: cardinality. Maybe we can support filtering metrics that > >>>>>> have > >>>>>>>> realm and support another metric without realm? > >>>>>>>> > >>>>>>>> On Mon, May 19, 2025 at 12:24 PM Dmitri Bourlatchkov < > >>> di...@apache.org> > >>>>>>>> wrote: > >>>>>>>> > >>>>>>>>> Removing realm_id from metrics tags makes sense to me (to avoid > >>> high > >>>>>>>>> cardinality). > >>>>>>>>> > >>>>>>>>> If we need to have insight into load differences from realm to > >>> realm, > >>>>>> it > >>>>>>>>> might be preferable to introduce metrics dedicated to that rather > >>> than > >>>>>>>>> increasing the cardinality of every endpoint metric. > >>>>>>>>> > >>>>>>>>> Cheers, > >>>>>>>>> Dmitri. > >>>>>>>>> > >>>>>>>>> On Thu, May 15, 2025 at 3:30 PM Alex Dutra > >>>>>> <alex.du...@dremio.com.invalid > >>>>>>>>> wrote: > >>>>>>>>> > >>>>>>>>>> Hi all, > >>>>>>>>>> > >>>>>>>>>> I would like to suggest removing the "realm_id" metric tag > >>> entirely. > >>>>>>>>>> My concern is that this tag has the potential for high > >>> cardinality, > >>>>>>>> which > >>>>>>>>>> is generally considered a bad practice when dealing with metrics. > >>> High > >>>>>>>>>> cardinality can lead to performance issues and increased memory > >>> usage. > >>>>>>>>>> Granted, the default realm resolver in Polaris is tailored for > >>> just a > >>>>>>>>>> handful of realms, but nothing prevents users from declaring > >>> hundreds > >>>>>>>> of > >>>>>>>>>> realms. > >>>>>>>>>> > >>>>>>>>>> I believe we can still effectively monitor Polaris servers without > >>>>>> this > >>>>>>>>>> specific tag, since the realm ID is also propagated in traces > >>> emitted > >>>>>>>> by > >>>>>>>>>> Polaris. Tracing is a much better fit for high-cardinality > >>> domains. > >>>>>>>>>> I'm open to discussing this further; a potential alternative > >>> would be > >>>>>>>> to > >>>>>>>>>> introduce a flag to disable this specific metric tag, but I feel > >>> like > >>>>>>>>>> removing it would be a much cleaner approach. > >>>>>>>>>> > >>>>>>>>>> Let me know your thoughts. > >>>>>>>>>> > >>>>>>>>>> Thanks, > >>>>>>>>>> > >>>>>>>>>> Alex > >>>>>>>>>> > >>>>>> -- > >>>>>> Robert Stupp > >>>>>> @snazy > >>>>>> > >>>>>> > >>>> -- > >>>> Robert Stupp > >>>> @snazy > >>>> > -- > Robert Stupp > @snazy >