Hello, We had a strange error on ActiveMQ last week, and wanted to check if someone has experienced this before.
Background A couple of weeks ago we patched the ActiveMQ Prod VMs, after they were restarted the wrong configuration was setup causing a "Split brain" problem between the master and the slave. To troubleshoot the invalid configuration before going to production we had 2 test VMs created to verify the update process from the previous (static configuration) the new configuration using Multi Cast. The testing worked as expected and we were ready to update the configuration on production. On Sept 27th the correct configuration was (same as you are currently using) we ended up having 2 masters and 2 slaves on at the same time - this happened because the test VMs had not been turned off yet. When we realized this, we turned the test VMs immediately. There were no errors or warnings in the ActiveMQ or Activity Manager logs, thus we thought there would not be an issue. A couple days after (Oct 1st) the test VMs were decommissioned, and ERRORs started being generated in the ActiveMQ logs, because it could not find the test VMs: Example Error Message 2024-10-01 12:40:19,056 ERROR [org.apache.activemq.artemis.core.client] AMQ214016: Failed to create netty connection java.net.UnknownHostException: amq11test at java.net.InetAddress$CachedAddresses.get(InetAddress.java:797) ~[?:?] at java.net.InetAddress.getAllByName0(InetAddress.java:1533) ~[?:?] at java.net.InetAddress.getAllByName(InetAddress.java:1386) ~[?:?] at java.net.InetAddress.getAllByName(InetAddress.java:1307) ~[?:?] at java.net.InetAddress.getByName(InetAddress.java:1257) ~[?:?] at io.netty.util.internal.SocketUtils$8.run(SocketUtils.java:156) ~[netty-common-4.1.86.Final.jar:4.1.86.Final] at io.netty.util.internal.SocketUtils$8.run(SocketUtils.java:153) ~[netty-common-4.1.86.Final.jar:4.1.86.Final] at java.security.AccessController.doPrivileged(Native Method) ~[?:?] .... On Oct 3rd at 8:15 AM the program scheduling work continued communicating with ActiveMQ, however no jobs were being pulled from the ActiveMQ queues. The logs on the ActiveMQ only included the previous error I had included, and there were no errors on program scheduling work. Solution * Restarted the master ActiveMQ - this solved the Failed to create netty connection ERROR * Added a monitor (checkAMQLog) script to Active MQ to get notified if an ERROR or warning is triggered * For future ActiveMQ debugging in test VMs -use a different port for troubleshooting We are working to perform a root cause analysis on this issue - however we are not able to find a specific error in the artemis log when the jobs stopped being pulled from the queue. Please let me know if this behavior is expected or additional commands that can be used to troubleshoot in future if it were to happen again. Thanks for your help! Erick