I think there's a handful of different ways to address this... In general, it seems like your consumption isn't keeping up with your production otherwise you wouldn't have such a large build-up of messages on one of the brokers. It's a good idea to balance message production with adequate consumption to keep the number of messages on the broker as a low as possible. Obviously this can't always be done which is why solutions like paging exist. However, the ideal situation is low message accumulation in the broker. With that in mind, I recommend you explore flow control for your producers. If there isn't such a large build-up of messages then redistribution will be a much smaller problem (if a problem at all).
Another option would be to have master/slave pairs rather than individual cluster nodes. That way in the case of a failure consumers fail-over to the slave and stay relatively balanced across the two nodes rather than accumulating on a single node. You could even go as far as setting the message-load-balancing to STRICT to avoid redistribution altogether. You could also increase the size of your cluster so that redistribution happens to two nodes rather than just one. That should theoretically cut the relative burden on each node in half when using 3 nodes vs. 2. Justin On Tue, Jun 18, 2019 at 4:22 PM Dan Langford <danlangf...@gmail.com> wrote: > we are using Artemis 2.8.1 and we have 2 nodes in a cluster (Jgroup, TCP > ping, load balancing=On Demand). we build each queue and address on each > node and put address settings and security settings on each node (via the > jolokia http api). the two nodes are behind a single vip so each incoming > connection doesnt know which node it will be assigned. > > a producer can connect to NodeA and send a fair number of messages. maybe > 24 million. If NodeA goes down for whatever reason (memory or disk > problems, or scheduled OS patching) the consumers on NodeA will be > disconnected. As they try to reconnect the vip will direct them all to the > other available node, NodeB. when NodeA comes back online it notices all > the consumers over on NodeB and redistributes all the messages in their > queues. > > That can cause NodeA to take a long time and a lot of memory to start. It > also causes the cluster/redistribution queue to become very deep and it can > take many hours for them to all get redistributed over to NodeB. If NodeB > has any problems as a result of the onslaught of messages and becomes > unavailable or goes down then all the consumers will be disconnected, they > will reconnect and connect to NodeA and start the problem all over. > > What advice would you have for us? is there a better cluster/ha design we > could go with that would allow messages to redistribute across a cluster > but also not bottleneck the cluster/redistribution queue on startup? we > considered one time using backups that would become live and serve those > messages immediately but ran into a lot of problems with the once stopped > nodes failing to come up in a clean state. i can expound on that more if > thats the direction i should be exploring. > > any insight you have is very much appreciated. >