Can you provide your <cluster-connections> from broker.xml? I suspect you're using the default <reconnect-attempts> value of -1 which means that when a broker drops out of the cluster the other nodes to which the node was previously connected will attempt to reconnect forever and, in the meantime, will continue routing messages for that node to the internal store-and-forward queue.
Also, if you're using multicast discovery then you're likely sharing the same multicast address and port between your different environments (e.g. dev & prod) which typically isn't desirable as it allows cross-environment clustering like you're seeing. Lastly, if you experienced split-brain then I suspect you're using replication for HA. If that's true then you should definitely be mitigating split-brain as discussed in the documentation [1]. Justin [1] https://activemq.apache.org/components/artemis/documentation/latest/network-isolation.html#network-isolation-split-brain On Tue, Oct 22, 2024 at 8:51 AM Macias, Erick <emac...@ti.com.invalid> wrote: > Hello, > > We had a strange error on ActiveMQ last week, and wanted to check if > someone has experienced this before. > > Background > A couple of weeks ago we patched the ActiveMQ Prod VMs, after they were > restarted the wrong configuration was setup causing a "Split brain" > problem between the master and the slave. > > To troubleshoot the invalid configuration before going to production we > had 2 test VMs created to verify the update process from the previous > (static configuration) the new configuration using Multi Cast. The testing > worked as expected and we were ready to update the configuration on > production. > > On Sept 27th the correct configuration was (same as you are currently > using) we ended up having 2 masters and 2 slaves on at the same time - this > happened because the test VMs had not been turned off yet. When we realized > this, we turned the test VMs immediately. There were no errors or warnings > in the ActiveMQ or Activity Manager logs, thus we thought there would not > be an issue. > > A couple days after (Oct 1st) the test VMs were decommissioned, and ERRORs > started being generated in the ActiveMQ logs, because it could not find the > test VMs: > > Example Error Message > 2024-10-01 12:40:19,056 ERROR [org.apache.activemq.artemis.core.client] > AMQ214016: Failed to create netty connection > java.net.UnknownHostException: amq11test > at java.net.InetAddress$CachedAddresses.get(InetAddress.java:797) > ~[?:?] > at java.net.InetAddress.getAllByName0(InetAddress.java:1533) ~[?:?] > at java.net.InetAddress.getAllByName(InetAddress.java:1386) ~[?:?] > at java.net.InetAddress.getAllByName(InetAddress.java:1307) ~[?:?] > at java.net.InetAddress.getByName(InetAddress.java:1257) ~[?:?] > at io.netty.util.internal.SocketUtils$8.run(SocketUtils.java:156) > ~[netty-common-4.1.86.Final.jar:4.1.86.Final] > at io.netty.util.internal.SocketUtils$8.run(SocketUtils.java:153) > ~[netty-common-4.1.86.Final.jar:4.1.86.Final] > at java.security.AccessController.doPrivileged(Native Method) > ~[?:?] > .... > > On Oct 3rd at 8:15 AM the program scheduling work continued communicating > with ActiveMQ, however no jobs were being pulled from the ActiveMQ queues. > The logs on the ActiveMQ only included the previous error I had included, > and there were no errors on program scheduling work. > > Solution > > * Restarted the master ActiveMQ - this solved the Failed to create > netty connection ERROR > * Added a monitor (checkAMQLog) script to Active MQ to get notified > if an ERROR or warning is triggered > * For future ActiveMQ debugging in test VMs -use a different port for > troubleshooting > > We are working to perform a root cause analysis on this issue - however we > are not able to find a specific error in the artemis log when the jobs > stopped being pulled from the queue. Please let me know if this behavior is > expected or additional commands that can be used to troubleshoot in future > if it were to happen again. > > Thanks for your help! > Erick >