we are using Artemis 2.8.1 and we have 2 nodes in a cluster (Jgroup, TCP
ping, load balancing=On Demand). we build each queue and address on each
node and put address settings and security settings on each node (via the
jolokia http api). the two nodes are behind a single vip so each incoming
connection doesnt know which node it will be assigned.

a producer can connect to NodeA and send a fair number of messages. maybe
24 million. If NodeA goes down for whatever reason (memory or disk
problems, or scheduled OS patching) the  consumers on NodeA will  be
disconnected. As they try to reconnect the vip will direct them all to the
other available node, NodeB. when NodeA comes back online it notices all
the consumers over on NodeB and redistributes all the messages in their
queues.

That can cause NodeA to take a long time and a lot of memory to start. It
also causes the cluster/redistribution queue to become very deep and it can
take many hours for them to all get redistributed over to NodeB. If NodeB
has any problems as a result of the onslaught of messages and becomes
unavailable or goes down then all the consumers will be disconnected, they
will reconnect and connect to NodeA and start the problem all over.

What advice would you have for us? is there a better cluster/ha design we
could go with that would allow messages to redistribute across a cluster
but also not bottleneck the cluster/redistribution queue on startup? we
considered one time using backups that would become live and serve those
messages immediately but ran into a lot of problems with the once stopped
nodes failing to come up in a clean state. i can expound on that more if
thats the direction i should be exploring.

any insight you have is very much appreciated.

Reply via email to