Hello,

We had a strange error on ActiveMQ last week, and wanted to check if someone 
has experienced this before.

Background
A couple of weeks ago we patched the ActiveMQ Prod VMs, after they were 
restarted the wrong configuration  was setup causing a "Split brain" problem 
between the master and the slave.

To troubleshoot the invalid configuration before going to production we had 2 
test VMs created to verify the update process from the previous (static 
configuration) the new configuration using Multi Cast. The testing worked as 
expected and we were ready to update the configuration on production.

On Sept 27th the correct configuration was (same as you are currently using) we 
ended up having 2 masters and 2 slaves on at the same time - this happened 
because the test VMs had not been turned off yet. When we realized this, we 
turned the test VMs immediately. There were no errors or warnings in the 
ActiveMQ or Activity Manager logs, thus we thought there would not be an issue.

A couple days after (Oct 1st) the test VMs were decommissioned, and ERRORs 
started being generated in the ActiveMQ logs, because it could not find the 
test VMs:

Example Error Message
2024-10-01 12:40:19,056 ERROR [org.apache.activemq.artemis.core.client] 
AMQ214016: Failed to create netty connection
java.net.UnknownHostException: amq11test
        at java.net.InetAddress$CachedAddresses.get(InetAddress.java:797) ~[?:?]
        at java.net.InetAddress.getAllByName0(InetAddress.java:1533) ~[?:?]
        at java.net.InetAddress.getAllByName(InetAddress.java:1386) ~[?:?]
        at java.net.InetAddress.getAllByName(InetAddress.java:1307) ~[?:?]
        at java.net.InetAddress.getByName(InetAddress.java:1257) ~[?:?]
        at io.netty.util.internal.SocketUtils$8.run(SocketUtils.java:156) 
~[netty-common-4.1.86.Final.jar:4.1.86.Final]
        at io.netty.util.internal.SocketUtils$8.run(SocketUtils.java:153) 
~[netty-common-4.1.86.Final.jar:4.1.86.Final]
        at java.security.AccessController.doPrivileged(Native Method) ~[?:?]
        ....

On Oct 3rd at 8:15 AM the program scheduling work continued communicating with 
ActiveMQ, however no jobs were being pulled from the ActiveMQ queues. The logs 
on the ActiveMQ only included the previous error I had included, and there were 
no errors on program scheduling work.

Solution

  *   Restarted the master ActiveMQ - this solved the Failed to create netty 
connection  ERROR
  *   Added a monitor (checkAMQLog)  script to Active MQ to get notified if an 
ERROR or warning is triggered
  *   For future ActiveMQ debugging in test VMs -use a different port for 
troubleshooting

We are working to perform a root cause analysis on this issue - however we are 
not able to find a specific error in the artemis log when the jobs stopped 
being pulled from the queue. Please let me know if this behavior is expected or 
additional commands that can be used to troubleshoot in future if it were to 
happen again.

Thanks for your help!
Erick

Reply via email to