Artemis Replication / blocking state

Jean-Pascal Briquet Wed, 01 Nov 2023 09:04:00 -0700

Hello,



I'm running an Artemis cluster in version 2.24.0 with ZK external quorum
and have noticed a blocking state problem.

It happens in a loop on a primary node just after the end of the message
replication with a backup node.



Sequence :

   - Primary is up and live
   - Backup connect to primary and start the message replication
   - Replication is ended successfully (backup can become live if needed) /
   AMQ221024
   - Primary blocks
   - After 10sec - Timeout of all cluster network connections on primary /
   AMQ224088
   - Artemis critical analyser detect this state, does a thread-dump and
   stop the Artemis process

AMQ224079: The process for the virtual machine will be killed, as component
org.apache.activemq.artemis.core.persistence.impl.journal.JournalStorageManager@617c90de
 is not responsive

   - Backup become live
   - Restart of the Primary (installed as a linux service)
   - Primary is in non-live state
   - Primary connect to backup and message replication starts
   - Replication to the primary ends successfully
   - Live role come back to Primary as failback is enabled
   - (loop to first step)



As of now, I have no idea why it is happening at the end of the replication
process, I could be wrong but I don't see any evidence in the thread-dump
produced by the critical analyser.

I also wonder why it is only happening on the primary and never on the
backup as it sends replication data too.


Does someone have any ideas or insight about the cause of the behaviour ?


Thanks

Artemis Replication / blocking state

Reply via email to