Mooshua opened a new issue, #24676:
URL: https://github.com/apache/pulsar/issues/24676

   ### Search before reporting
   
   - [x] I searched in the [issues](https://github.com/apache/pulsar/issues) 
and found nothing similar.
   
   
   ### Motivation
   
   Suppose we have three nodes each receiving updates for some resources to 
provide values to pulsar consumers. We want the load to be balanced between 
each of these three nodes evenly, in order to reduce the load on both the 
brokers and the nodes themselves. Basically, instead of using WaitForExclusive 
for *leader elections*, we use them for load balancing.
   
   Currently, I assign each topic a primary node. When that node creates a 
producer for that topic, it uses a `ExclusiveWithFencing` access mode, and the 
two "standby" nodes for that topic establish `WaitForExclusive` producers.
   
   Therefore, the connection process is:
   - Iterate all watched resources
   - For each resource we are a primary for, establish a `ExclusiveWithFence` 
producer for that topic
   - For each resource we are a standby for, establish a `WaitForExclusive` 
producer for that topic
   - When any of our producers are created, connect to the external resource 
update channel and stream updates into pulsar, effectively taking on the load 
for that resource
   
   For my use case, I want nodes to be able to regularly drop out in order to 
perform rolling updates or other maintenance. **However, when they drop, their 
producer channels tend to go to *one* node, the first node to register 
WaitForExclusive access with the brokers.** This leaves one node running 2/3rds 
of the cluster's work, and the other node only carrying 1/3.
   
   I'm fine with this sort of failure state, but I feel like this change would 
be a simple way to improve the failure scenario for use cases like mine.
   
   ### Solution
   
   When a producer files `WaitForExclusive` access with the broker, and it is 
not immediately assigned exclusitivity, it should be put into a *random* spot 
in the queue, rather than the same spot every time. This would ensure that, 
should a primary node fail, it's work would be randomly distributed among the 
remaining nodes.
   
   ### Alternatives
   
   One alternative is to sort the queue based on a producer-provided priority. 
This would allow producers to establish an order of which producer gets 
exclusivity when the exclusive producer closes. This exact functionality could 
then be implemented on the producer side by having each producer pick a random 
priority, creating a random order. This could also have use cases far beyond 
randomizing producer rebalancing.
   
   For example, creating a priority queue could enable establishing a whole 
other cluster of fallover producers in a backup region, and de-prioritizing 
those so they only gain control when the entire "primary" cluster gets taken 
down.
   
   ### Anything else?
   
   _No response_
   
   ### Are you willing to submit a PR?
   
   - [x] I'm willing to submit a PR!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to