After discussing this issue in the Pulsar community meeting today, I realized that I might not have described the issue well enough. The fundamental problem is that brokers create 4 metrics for every cursor and there is no way to disable them. The change to make these metrics optional has already been merged into the branch-2.7.
I decided to build a custom broker image that included the fix, so I am no longer blocked by this issue. However, I do think other users will be impacted by it if they have many topic subscriptions. Thanks, Michael On Thu, Apr 1, 2021 at 12:09 AM Michael Marshall <mikemars...@gmail.com> wrote: > Thank you for your offer, Enrico. > > > If you are now blocked, then other users will be blocked as well. > > I looked into this more today, and I believe that I am blocked on using > 2.7.1. I configured prometheus's server side filtering for the high > cardinality metrics, but my prometheus instance is still getting OOMKilled > due to the collective size of the metrics payload returned by my brokers. > In my use case, I encountered problems with around 40k topics each with a > single subscription. For reference, I ran the same load against > 2.7.0 brokers and had no issues with my prometheus instance. > > Sijie, > > Thanks for your reply. > > > The bugfix releases are usually made monthly based on demand. We can > > probably wait 1~2 weeks to see if there are any other fixes to include > > before cutting a 2.7.2 release. Does that make sense? > > Are there known bug fixes that you are looking to get merged in the next 1 > or 2 weeks? > > I agree with the general timeline of doing bug fix releases monthly based > on demand. I also think there should be room for extraordinary > circumstances where we should release early to fix an issue that impacts > many users. Given Pulsar's advertised ability to handle up to a million > topics, I think this is such a situation. Let me know what you think. > > Thanks, > Michael > > On Wed, Mar 31, 2021 at 6:31 PM Sijie Guo <guosi...@gmail.com> wrote: > >> Michael, >> >> The bugfix releases are usually made monthly based on demand. We can >> probably wait 1~2 weeks to see if there are any other fixes to include >> before cutting a 2.7.2 release. Does that make sense? >> >> Thanks, >> Sijie >> >> On Tue, Mar 30, 2021 at 9:55 PM Michael Marshall <mikemars...@gmail.com> >> wrote: >> >> > Hi All, >> > >> > I propose and request that we release version 2.7.2 to fix a regression >> > introduced in 2.7.1. >> > >> > Pulsar 2.7.1 introduced cursor level metrics without including the >> ability >> > to disable them (https://github.com/apache/pulsar/pull/9618). I >> recently >> > discovered the metrics when I created a Pulsar 2.7.1 cluster, created >> > thousands of topics and subscriptions, and then started to have problems >> > with my prometheus instance because of an influx of metrics. The fix to >> > make these metrics optional and disabled by default has already been >> merged >> > to the "branch-2.7" branch (https://github.com/apache/pulsar/pull/9814 >> ). >> > >> > Given the cardinality of the metrics produced for every cursor and the >> fact >> > that Pulsar is supposed to handle many topics and subscriptions with >> ease, >> > I consider the creation of too many metrics a regression, and I think >> it is >> > important to release a new, latest version. >> > >> > Further, 2.7.1 included several important bug fixes (e.g. one to fix >> tiered >> > storage to AWS S3), so I would prefer to move forward instead of back to >> > 2.7.0. >> > >> > What do others think about cutting a 2.7.2 release now? Do others agree >> > that creating metrics for every cursor should be considered a >> regression? >> > If not, does the community have a helpful guide to determine what >> should be >> > considered a regression? >> > >> > Before writing this email, I consulted PIP 47, Pulsar's time based >> release >> > plan. ( >> > https://github.com/apache/pulsar/wiki/PIP-47%3A-Time-Based-Release-Plan >> ). >> > The PIP mentions that there will be bug fix releases for the last 4 >> > releases, but it doesn't mention a cadence. >> > >> > Tangentially, I am wondering why the 2.7.1 release wasn't held up to >> > include this configuration fix. PR 9814 was submitted before the 2.7.1 >> tag >> > was created and was merged just 2 days after the tag's creation. What >> are >> > the criteria for holding up a release? >> > >> > Thanks for considering my request, and thanks for any feedback you can >> > provide. >> > >> > Best, >> > Michael Marshall >> > >> >