Hi,

For reference the PR is here: https://github.com/apache/polaris/pull/1662

Alex

On Fri, May 23, 2025 at 2:42 PM Robert Stupp <sn...@snazy.de> wrote:
>
> +1 on adding a config and default to false - and later entirely remove
> the "issue".
>
> Sounds like good plan to me!
>
> On 23.05.25 13:52, Alex Dutra wrote:
> > Hi Mike,
> >
> > That sounds like a plan! And yes, I agree with the slow approach. I
> > will open a PR shortly with the new enablement flag.
> >
> > Alex
> >
> > On Fri, May 23, 2025 at 2:42 AM Michael Collado <collado.m...@gmail.com> 
> > wrote:
> >> My immediate reaction is that a configuration flag would be good enough for
> >> us. Our setup prevents requests without a valid realm, so we're not subject
> >> to the DDOS attack you mentioned, but we should address that in the default
> >> realm context resolver as well. I can do some further investigation to see
> >> if it's something we could get rid of entirely (we also use log-based
> >> metrics), but I like the slow approach of adding a configuration flag,
> >> disabling it by default, then removing functionality as a best practice,
> >> anyway.
> >>
> >> Mike
> >>
> >> On Thu, May 22, 2025 at 11:39 AM Alex Dutra <alex.du...@dremio.com.invalid>
> >> wrote:
> >>
> >>> Hi again,
> >>>
> >>> Let's go back to the main concern: excessive memory required for metrics.
> >>>
> >>> To move this discussion forward, would you all agree to introduce a
> >>> flag that activates the inclusion of realm IDs in the metrics?
> >>>
> >>> It seems like this could be a good middle ground that addresses
> >>> everyone's needs.
> >>>
> >>> Let me know your thoughts.
> >>>
> >>> Alex
> >>>
> >>>
> >>> On Wed, May 21, 2025 at 3:57 PM Robert Stupp <sn...@snazy.de> wrote:
> >>>> Ouch! That sounds very simple to exploit and quite serious.
> >>>>
> >>>> On 21.05.25 15:38, Alex Dutra wrote:
> >>>>> Also worth noting: http_server_requests_seconds_* metrics are generated
> >>>>> even for failed requests. So it would be very easy for an attacker to
> >>> forge
> >>>>> HTTP requests containing invalid realms and take the server down.
> >>>>>
> >>>>> Alex
> >>>>>
> >>>>> On Wed, May 21, 2025 at 3:04 PM Robert Stupp <sn...@snazy.de> wrote:
> >>>>>
> >>>>>> OOMs are bad - it's extremely hard (if not impossible) for any user to
> >>>>>> figure out why that happens. Bug reports would just read like "Polaris
> >>>>>> runs into an OOM - no further information available".
> >>>>>>
> >>>>>> IMHO the system should be stable by default, not the other way around.
> >>>>>> Even if there is a way to enable such huge cardinalities, that flag
> >>>>>> should really be hidden and not documented anywhere - unless there's a
> >>>>>> clear disclaimer saying: "Enabling this can lead to an unresponsive
> >>>>>> system and risk the stability of the system. ..."
> >>>>>>
> >>>>>> I propose to investigate all metrics and reduce the cardinality of all
> >>>>>> of those as that can easily become a serious issue.
> >>>>>>
> >>>>>> On 21.05.25 11:42, Alex Dutra wrote:
> >>>>>>> Hi Mike,
> >>>>>>>
> >>>>>>> If you have around a hundred realms or more, imho you are already in
> >>>>>>> trouble.
> >>>>>>>
> >>>>>>> The most problematic metric is the http_server_requests_seconds_*
> >>> metrics
> >>>>>>> family, e.g.:
> >>>>>>>
> >>>>>>>
> >>> http_server_requests_seconds_count{application="Polaris",method="GET",outcome="CLIENT_ERROR",realm_id="POLARIS",status="404",uri="NOT_FOUND",}
> >>>>>>> 1.0
> >>>>>>>
> >>>>>>> Since metrics in this family already have a high cardinality
> >>> potential
> >>>>>>> given the number of tags it supports by default, adding one more
> >>>>>> dimension
> >>>>>>> to them makes things (exponentially) worse.
> >>>>>>>
> >>>>>>> It's very easy to demonstrate that by running a small test [1]. On my
> >>>>>>> machine, the first 2 iterations (10 and 100 realms) complete, but
> >>> the 3rd
> >>>>>>> iteration (1000 realms) runs for about 1 minute then ends up in
> >>>>>>> java.lang.OutOfMemoryError: Java heap space.
> >>>>>>>
> >>>>>>> That's why I advocated for removing the tag. If however you really
> >>> want
> >>>>>> to
> >>>>>>> keep it, I'd suggest introducing a configuration flag to disable it
> >>> in
> >>>>>> two
> >>>>>>> problematic metric families: the HTTP one shown above, and the
> >>>>>> per-endpoint
> >>>>>>> metrics as well.
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>>
> >>>>>>> Alex
> >>>>>>>
> >>>>>>> [1]: https://gist.github.com/adutra/414fe773e8727304b34e9249299c988d
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On Wed, May 21, 2025 at 7:35 AM Michael Collado <
> >>> collado.m...@gmail.com>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> Hmm, we do use the realm tag in our metric publishing. I understand
> >>> the
> >>>>>>>> concern re: cardinality. Maybe we can support filtering metrics that
> >>>>>> have
> >>>>>>>> realm and support another metric without realm?
> >>>>>>>>
> >>>>>>>> On Mon, May 19, 2025 at 12:24 PM Dmitri Bourlatchkov <
> >>> di...@apache.org>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Removing realm_id from metrics tags makes sense to me (to avoid
> >>> high
> >>>>>>>>> cardinality).
> >>>>>>>>>
> >>>>>>>>> If we need to have insight into load differences from realm to
> >>> realm,
> >>>>>> it
> >>>>>>>>> might be preferable to introduce metrics dedicated to that rather
> >>> than
> >>>>>>>>> increasing the cardinality of every endpoint metric.
> >>>>>>>>>
> >>>>>>>>> Cheers,
> >>>>>>>>> Dmitri.
> >>>>>>>>>
> >>>>>>>>> On Thu, May 15, 2025 at 3:30 PM Alex Dutra
> >>>>>> <alex.du...@dremio.com.invalid
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Hi all,
> >>>>>>>>>>
> >>>>>>>>>> I would like to suggest removing the "realm_id" metric tag
> >>> entirely.
> >>>>>>>>>> My concern is that this tag has the potential for high
> >>> cardinality,
> >>>>>>>> which
> >>>>>>>>>> is generally considered a bad practice when dealing with metrics.
> >>> High
> >>>>>>>>>> cardinality can lead to performance issues and increased memory
> >>> usage.
> >>>>>>>>>> Granted, the default realm resolver in Polaris is tailored for
> >>> just a
> >>>>>>>>>> handful of realms, but nothing prevents users from declaring
> >>> hundreds
> >>>>>>>> of
> >>>>>>>>>> realms.
> >>>>>>>>>>
> >>>>>>>>>> I believe we can still effectively monitor Polaris servers without
> >>>>>> this
> >>>>>>>>>> specific tag, since the realm ID is also propagated in traces
> >>> emitted
> >>>>>>>> by
> >>>>>>>>>> Polaris. Tracing is a much better fit for high-cardinality
> >>> domains.
> >>>>>>>>>> I'm open to discussing this further; a potential alternative
> >>> would be
> >>>>>>>> to
> >>>>>>>>>> introduce a flag to disable this specific metric tag, but I feel
> >>> like
> >>>>>>>>>> removing it would be a much cleaner approach.
> >>>>>>>>>>
> >>>>>>>>>> Let me know your thoughts.
> >>>>>>>>>>
> >>>>>>>>>> Thanks,
> >>>>>>>>>>
> >>>>>>>>>> Alex
> >>>>>>>>>>
> >>>>>> --
> >>>>>> Robert Stupp
> >>>>>> @snazy
> >>>>>>
> >>>>>>
> >>>> --
> >>>> Robert Stupp
> >>>> @snazy
> >>>>
> --
> Robert Stupp
> @snazy
>

Reply via email to