After discussing this issue in the Pulsar community meeting today, I
realized that I might not have described the issue well enough. The
fundamental problem is that brokers create 4 metrics for every cursor and
there is no way to disable them. The change to make these metrics optional
has already been merged into the branch-2.7.

I decided to build a custom broker image that included the fix, so I am no
longer blocked by this issue. However, I do think other users will be
impacted by it if they have many topic subscriptions.

Thanks,
Michael



On Thu, Apr 1, 2021 at 12:09 AM Michael Marshall <mikemars...@gmail.com>
wrote:

> Thank you for your offer, Enrico.
>
> > If you are now blocked, then other users will be blocked as well.
>
> I looked into this more today, and I believe that I am blocked on using
> 2.7.1. I configured prometheus's server side filtering for the high
> cardinality metrics, but my prometheus instance is still getting OOMKilled
> due to the collective size of the metrics payload returned by my brokers.
> In my use case, I encountered problems with around 40k topics each with a
> single subscription. For reference, I ran the same load against
> 2.7.0 brokers and had no issues with my prometheus instance.
>
> Sijie,
>
> Thanks for your reply.
>
> > The bugfix releases are usually made monthly based on demand. We can
> > probably wait 1~2 weeks to see if there are any other fixes to include
> > before cutting a 2.7.2 release. Does that make sense?
>
> Are there known bug fixes that you are looking to get merged in the next 1
> or 2 weeks?
>
> I agree with the general timeline of doing bug fix releases monthly based
> on demand. I also think there should be room for extraordinary
> circumstances where we should release early to fix an issue that impacts
> many users. Given Pulsar's advertised ability to handle up to a million
> topics, I think this is such a situation. Let me know what you think.
>
> Thanks,
> Michael
>
> On Wed, Mar 31, 2021 at 6:31 PM Sijie Guo <guosi...@gmail.com> wrote:
>
>> Michael,
>>
>> The bugfix releases are usually made monthly based on demand. We can
>> probably wait 1~2 weeks to see if there are any other fixes to include
>> before cutting a 2.7.2 release. Does that make sense?
>>
>> Thanks,
>> Sijie
>>
>> On Tue, Mar 30, 2021 at 9:55 PM Michael Marshall <mikemars...@gmail.com>
>> wrote:
>>
>> > Hi All,
>> >
>> > I propose and request that we release version 2.7.2 to fix a regression
>> > introduced in 2.7.1.
>> >
>> > Pulsar 2.7.1 introduced cursor level metrics without including the
>> ability
>> > to disable them (https://github.com/apache/pulsar/pull/9618). I
>> recently
>> > discovered the metrics when I created a Pulsar 2.7.1 cluster, created
>> > thousands of topics and subscriptions, and then started to have problems
>> > with my prometheus instance because of an influx of metrics. The fix to
>> > make these metrics optional and disabled by default has already been
>> merged
>> > to the "branch-2.7" branch (https://github.com/apache/pulsar/pull/9814
>> ).
>> >
>> > Given the cardinality of the metrics produced for every cursor and the
>> fact
>> > that Pulsar is supposed to handle many topics and subscriptions with
>> ease,
>> > I consider the creation of too many metrics a regression, and I think
>> it is
>> > important to release a new, latest version.
>> >
>> > Further, 2.7.1 included several important bug fixes (e.g. one to fix
>> tiered
>> > storage to AWS S3), so I would prefer to move forward instead of back to
>> > 2.7.0.
>> >
>> > What do others think about cutting a 2.7.2 release now? Do others agree
>> > that creating metrics for every cursor should be considered a
>> regression?
>> > If not, does the community have a helpful guide to determine what
>> should be
>> > considered a regression?
>> >
>> > Before writing this email, I consulted PIP 47, Pulsar's time based
>> release
>> > plan. (
>> > https://github.com/apache/pulsar/wiki/PIP-47%3A-Time-Based-Release-Plan
>> ).
>> > The PIP mentions that there will be bug fix releases for the last 4
>> > releases, but it doesn't mention a cadence.
>> >
>> > Tangentially, I am wondering why the 2.7.1 release wasn't held up to
>> > include this configuration fix. PR 9814 was submitted before the 2.7.1
>> tag
>> > was created and was merged just 2 days after the tag's creation. What
>> are
>> > the criteria for holding up a release?
>> >
>> > Thanks for considering my request, and thanks for any feedback you can
>> > provide.
>> >
>> > Best,
>> > Michael Marshall
>> >
>>
>

Reply via email to