we are using Artemis 2.8.1 and we have 2 nodes in a cluster (Jgroup, TCP ping, load balancing=On Demand). we build each queue and address on each node and put address settings and security settings on each node (via the jolokia http api). the two nodes are behind a single vip so each incoming connection doesnt know which node it will be assigned.
a producer can connect to NodeA and send a fair number of messages. maybe 24 million. If NodeA goes down for whatever reason (memory or disk problems, or scheduled OS patching) the consumers on NodeA will be disconnected. As they try to reconnect the vip will direct them all to the other available node, NodeB. when NodeA comes back online it notices all the consumers over on NodeB and redistributes all the messages in their queues. That can cause NodeA to take a long time and a lot of memory to start. It also causes the cluster/redistribution queue to become very deep and it can take many hours for them to all get redistributed over to NodeB. If NodeB has any problems as a result of the onslaught of messages and becomes unavailable or goes down then all the consumers will be disconnected, they will reconnect and connect to NodeA and start the problem all over. What advice would you have for us? is there a better cluster/ha design we could go with that would allow messages to redistribute across a cluster but also not bottleneck the cluster/redistribution queue on startup? we considered one time using backups that would become live and serve those messages immediately but ran into a lot of problems with the once stopped nodes failing to come up in a clean state. i can expound on that more if thats the direction i should be exploring. any insight you have is very much appreciated.