Hi Nikolai, I completely understand your concern and pain. But let's understand ordering semantics for exclusive/fail-over subscription. This subscription makes sure to dispatch new messages which the broker is reading using readPosition must be in order but that doesn't mean the broker will never handle redelivery of unack message and broker will not stop dispatching by considering the end of the world. However, key-Shared sub thinks it's the end of the world. key-shared sub extends the shared sub behavior with additional speciality of consumer dispatch affinity based on message key. Broker tries its best to perform this responsibility but it's still not guaranteed as it will be really costly to give such guarantee and if one requires it then we can use a failover sub. In shared or even in other subscriptions, brokers don't restrict or stop dispatching of redelivery messages. However, in key-shared sub, broker stops forever dispatching new and redelivery messages in certain conditions and additionally broker goes into an infinite iterative sequence of reading duplicate cold messages and discarding them which causes IO and CPU impact both on broker and bookie. Pulsar must not have such broken and incorrect semantic behavior in the system. I have created an issue: https://github.com/apache/pulsar/issues/21656 which has a simple unit test case to reproduce this behavior and that's what is happening in multiple production Pulsar systems and users are not happy with it.
Therefore, we should not stop dispatching and restricting redeliver messages which will automatically address almost all stuck issues and the broker can still follow the ordering guarantee for new messages which it promises for other subs as well. Thanks, Rajan On Fri, Dec 1, 2023 at 12:03 AM Nikolai <tadeus...@gmail.com> wrote: > Hi Rajan, > > Thanks for the review! > > It looks like the root cause of "stuck" keyword is the nature of > Key_Shared subscription which must guarantee messages ordering. I > understand it is really hard to achieve such guarantees without having > separate physical partitions but this feature is very essential for many > use cases. I also understand that the solution I've provided is more a > hack than a solid solution. I was trying to touch as little code as I > can and, unfortunately, I see no other options to avoid stuck dispatches > in the current implementation approach. > > I also want to highlight that the current implementation leads entire > system to hang due to a single message corruption in case consumers > disconnects. This breaks one of a big Pulsar advantage: independent keys > processing. > > Do you have plans for Key_Shared dispatcher rework? I would be happy to > contribute to such a project if I can help somehow. > > > Best Regards, > > Nikolai > > > > On 2023/12/01 06:34:59 Rajan Dhabalia wrote: > > Hi Nikolai, > > > > Thanks for bringing this to the attention. I have seen multiple PIP > and bug > > fixes for the Key-Shared subscription and I feel Key_Shared subscription > > dispatcher is developed with multiple fundamental issues which > required to > > introduce scheduled-task to unblock stuck reads, introducing stuck > delivery > > flag, and now we are seeing at least 2-3 PIP to add few more hacks to > > control such stuck dispatch. After reviewing code, it seems > > > > PersistentStickyKeyDispatcherMultipleConsumers::getRestrictedMaxEntriesForConsumer(..) > > function can still return 0 dispatch messages with multiple different > edge > > cases and fundamental dispatch is not clean at all. This is the only > > dispatcher introduced after open sourcing Pulsar and unfortunately it > was > > not developed with a cleaner approach. I understand the pain of such > stuck > > dispatch and even the current release is having a bugs where dispatch is > > getting stuck and many production systems are struggling with it and you > > must be suggesting this PIP to get work around for this stuck issue. > But I > > don't think this is the right approach to follow. > > I think the correct approach is to get rid of this "stuck" keyword > from the > > dispatcher because having "stuck" concept in dispatcher means it has a > > known flow and bug which we are not solving but we are adding more > debt to > > perform the work around. We never had any such concept in other > dispatchers > > which was introduced when Pulsar was introduced originally and it just > > works with clear semantics and permit flow mechanism , and we should > follow > > the same practice for this subscription. > > > > Thanks, > > Rajan > > > > On Thu, Nov 30, 2023 at 9:18 AM Nikolai <ta...@gmail.com> wrote: > > > > > Hello! > > > > > > I submitted a new PIP to add a configuration option which allows to > skip > > > blocking recently joined consumers for Key_Shared subscriptions > > > > > > It introduces additional memory consumption, so it will be disabled by > > > default. It would fix issues like > > > https://github.com/apache/pulsar/issues/21199 > > > > > > Link to the PIP: https://github.com/apache/pulsar/pull/21615 > > > > > > PR with the PIP implementation: > > > https://github.com/apache/pulsar/pull/21579 > > > > > > > > > Best Regards, > > > > > > Nikolai > > > > > > > > >