Configuration and logs…
The relevant broker.xml configuration from the primary (broker 1) is:
…
<connectors>
<connector name="broker1">tcp://1.1.1.2:61616</connector>
<connector name="broker2">tcp://1.1.1.3:61616</connector>
</connectors>
<cluster-connections>
<cluster-connection name="my-cluster">
<connector-ref>broker1</connector-ref>
<static-connectors>
<connector-ref>broker2</connector-ref>
</static-connectors>
</cluster-connection>
</cluster-connections>
<ha-policy>
<shared-store>
<primary>
<failover-on-shutdown>true</failover-on-shutdown>
</primary>
</shared-store>
</ha-policy>
…
The relevant broker.xml configuration from the backup (broker 2) is:
…
<connectors>
<connector name="broker1">tcp://1.1.1.2:61616</connector>
<connector name="broker2">tcp://1.1.1.3:61616</connector>
</connectors>
<cluster-connections>
<cluster-connection name="my-cluster">
<connector-ref>broker2</connector-ref>
<static-connectors>
<connector-ref>broker1</connector-ref>
</static-connectors>
</cluster-connection>
</cluster-connections>
<ha-policy>
<shared-store>
<backup>
<allow-failback>false</allow-failback>
</backup>
</shared-store>
</ha-policy>
…
Startup on the primary:
…
2025-05-01 12:51:23,567 INFO [org.apache.activemq.artemis.core.server]
AMQ221006: Waiting to obtain primary lock
…
2025-05-01 12:51:23,747 INFO [org.apache.activemq.artemis.core.server]
AMQ221034: Waiting indefinitely to obtain primary lock
2025-05-01 12:51:23,748 INFO [org.apache.activemq.artemis.core.server]
AMQ221035: Primary Server Obtained primary lock
…
2025-05-01 12:51:24,289 INFO [org.apache.activemq.artemis.core.server]
AMQ221007: Server is now active
…
Backup is started and somehow becomes primary:
…
2025-05-01 12:51:48,473 INFO [org.apache.activemq.artemis.core.server]
AMQ221032: Waiting to become backup node
2025-05-01 12:51:48,474 INFO [org.apache.activemq.artemis.core.server]
AMQ221033: ** got backup lock
…
2025-05-01 12:51:48,659 INFO [org.apache.activemq.artemis.core.server]
AMQ221109: Apache ActiveMQ Artemis Backup Server version 2.40.0
[339308e1-25f3-11f0-996a-0200ec1b9c8e] started; waiting for primary to fail
before activating
2025-05-01 12:51:48,809 INFO [org.apache.activemq.artemis.core.server]
AMQ221031: backup announced
…
2025-05-01 12:52:06,129 INFO [org.apache.activemq.artemis.core.server]
AMQ221010: Backup Server is now active
…
2025-05-01 12:52:07,158 INFO [org.apache.activemq.artemis.core.client]
AMQ214036: Connection closure to 1.1.1.2/1.1.1.2:61616 has been detected:
AMQ219015: The connection was disconnected because of server shutdown
[code=DISCONNECTED]
…
Primary loses the lock unexpectedly and shuts down:
…
2025-05-01 12:52:16,352 WARN
[org.apache.activemq.artemis.core.server.impl.FileLockNodeManager] Lost the
lock according to the monitor, notifying listeners
2025-05-01 12:52:16,353 ERROR [org.apache.activemq.artemis.core.server]
AMQ222010: Critical IO Error, shutting down the server. file=Lost NodeManager
lock, message=NULL
java.io.IOException: lost lock
at
org.apache.activemq.artemis.core.server.impl.SharedStorePrimaryActivation.lambda$registerActiveLockListener$0(SharedStorePrimaryActivation.java:124)
~[artemis-server-2.40.0.jar:2.40.0]
at
org.apache.activemq.artemis.core.server.NodeManager.lambda$notifyLostLock$0(NodeManager.java:167)
~[artemis-server-2.40.0.jar:2.40.0]
…
2025-05-01 12:52:16,528 INFO [org.apache.activemq.artemis.core.server]
AMQ221002: Apache ActiveMQ Artemis Message Broker version 2.40.0
[339308e1-25f3-11f0-996a-0200ec1b9c8e] stopped, uptime 52.978 seconds
…
Regards,
William Crowell
From: William Crowell <[email protected]>
Date: Thursday, May 1, 2025 at 9:20 AM
To: [email protected] <[email protected]>
Subject: Apache Artemis 2.40.0: Strange File Locking Behavior On NFSv4
Good morning,
Disclaimer: This is not a bug, but a configuration issue.
We are using Apache Artemis 2.40.0 on Rocky Linux 9. We are configuring a
primary/backup pair on separate hosts and putting the data directory on an
NSFv4 mount, and we are experiencing problems with the locking mechanism. I do
know that NFS is not recommended for production use, but that is what we are
limited to.
We are following this documentation:
https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Factivemq.apache.org%2Fcomponents%2Fartemis%2Fdocumentation%2Flatest%2Fha.html%23nfs-mount-recommendations&data=05%7C02%7CWCrowell%40perforce.com%7Ced516a05e7a84979776108dd88b2fa9d%7C95b666d19a7549ab95a38969fbcdc08c%7C0%7C0%7C638817024473992611%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=CT9JsvPfkAuYgmOTAmU2p%2FARwlU3XaxFqvVpdo2PPys%3D&reserved=0<https://activemq.apache.org/components/artemis/documentation/latest/ha.html#nfs-mount-recommendations>
What is happening is that the primary loses the lock and goes down after the
backup node was started. The mount options on both brokers we are using are:
vers=4.1,defaults,lazytime,noatime,nodiratime,rsize=1048576,wsize=1048576,sync,intr,noac
We then tried to startup the nodes sequentially. The primary lost the lock and
went down shortly after the backup node was started.
We also tested file locking on brokers 1 and 2:
Broker 1:
$ date; flock -x /data/test.lock -c "sleep 30"; echo $?; date
Thu May 1 12:42:52 PM GMT 2025
0
Thu May 1 12:43:22 PM GMT 2025
Broker 2:
date; flock -n /data/test.lock -c "echo lock acquired"; echo $?; date
Thu May 1 12:42:46 PM GMT 2025
1
Thu May 1 12:42:47 PM GMT 2025
This means that broker 2 was unable to acquire the lock because broker 1
already had it which is not consistent with the behavior on the Apache Artemis
brokers. I also tested this on AWS, and the failover works fine as expected.
What am I missing here?
Regards,
William Crowell
This e-mail may contain information that is privileged or confidential. If you
are not the intended recipient, please delete the e-mail and any attachments
and notify us immediately.
This e-mail may contain information that is privileged or confidential. If you
are not the intended recipient, please delete the e-mail and any attachments
and notify us immediately.