Would you mind turning on DEBUG logging for org.apache.activemq.artemis.core.server.impl.FileLockNodeManager [1], reproducing, and uploading the full logs someplace accessible?
Regarding your NFS mount options...I was perusing the NFS man page [2] and saw this note about noatime & nodiratime: In particular, the atime/noatime, diratime/nodiratime, relatime/norelatime, and strictatime/nostrictatime mount options have no effect on NFS mounts. I could not find any reference to lazytime, and I also noticed that you weren't using all the recommendations from the ActiveMQ Artemis documentation. I wonder if you might try something like this: vers=4.1,soft,sync,intr,noac,lookupcache=none,timeo=50,retrans=3 Is NFS being used in your AWS use-case? If so, what mount options are being used? To be clear, I'm not an NFS expert by any means. Usually this stuff just works with the recommended settings. Justin [1] https://activemq.apache.org/components/artemis/documentation/latest/logging.html#configuring-a-specific-level-for-a-logger [2] https://www.man7.org/linux/man-pages/man5/nfs.5.html On Thu, May 1, 2025 at 9:02 AM William Crowell <wcrow...@perforce.com.invalid> wrote: > Configuration and logs… > > The relevant broker.xml configuration from the primary (broker 1) is: > > … > <connectors> > <connector name="broker1">tcp://1.1.1.2:61616</connector> > <connector name="broker2">tcp://1.1.1.3:61616</connector> > </connectors> > > <cluster-connections> > <cluster-connection name="my-cluster"> > <connector-ref>broker1</connector-ref> > <static-connectors> > <connector-ref>broker2</connector-ref> > </static-connectors> > </cluster-connection> > </cluster-connections> > > <ha-policy> > <shared-store> > <primary> > <failover-on-shutdown>true</failover-on-shutdown> > </primary> > </shared-store> > </ha-policy> > … > > The relevant broker.xml configuration from the backup (broker 2) is: > > … > <connectors> > <connector name="broker1">tcp://1.1.1.2:61616</connector> > <connector name="broker2">tcp://1.1.1.3:61616</connector> > </connectors> > > <cluster-connections> > <cluster-connection name="my-cluster"> > <connector-ref>broker2</connector-ref> > <static-connectors> > <connector-ref>broker1</connector-ref> > </static-connectors> > </cluster-connection> > </cluster-connections> > > <ha-policy> > <shared-store> > <backup> > <allow-failback>false</allow-failback> > </backup> > </shared-store> > </ha-policy> > … > > Startup on the primary: > > … > 2025-05-01 12:51:23,567 INFO [org.apache.activemq.artemis.core.server] > AMQ221006: Waiting to obtain primary lock > … > 2025-05-01 12:51:23,747 INFO [org.apache.activemq.artemis.core.server] > AMQ221034: Waiting indefinitely to obtain primary lock > 2025-05-01 12:51:23,748 INFO [org.apache.activemq.artemis.core.server] > AMQ221035: Primary Server Obtained primary lock > … > 2025-05-01 12:51:24,289 INFO [org.apache.activemq.artemis.core.server] > AMQ221007: Server is now active > … > > Backup is started and somehow becomes primary: > > … > 2025-05-01 12:51:48,473 INFO [org.apache.activemq.artemis.core.server] > AMQ221032: Waiting to become backup node > 2025-05-01 12:51:48,474 INFO [org.apache.activemq.artemis.core.server] > AMQ221033: ** got backup lock > … > 2025-05-01 12:51:48,659 INFO [org.apache.activemq.artemis.core.server] > AMQ221109: Apache ActiveMQ Artemis Backup Server version 2.40.0 > [339308e1-25f3-11f0-996a-0200ec1b9c8e] started; waiting for primary to fail > before activating > 2025-05-01 12:51:48,809 INFO [org.apache.activemq.artemis.core.server] > AMQ221031: backup announced > … > 2025-05-01 12:52:06,129 INFO [org.apache.activemq.artemis.core.server] > AMQ221010: Backup Server is now active > … > 2025-05-01 12:52:07,158 INFO [org.apache.activemq.artemis.core.client] > AMQ214036: Connection closure to 1.1.1.2/1.1.1.2:61616 has been detected: > AMQ219015: The connection was disconnected because of server shutdown > [code=DISCONNECTED] > … > > Primary loses the lock unexpectedly and shuts down: > > … > 2025-05-01 12:52:16,352 WARN > [org.apache.activemq.artemis.core.server.impl.FileLockNodeManager] Lost the > lock according to the monitor, notifying listeners > 2025-05-01 12:52:16,353 ERROR [org.apache.activemq.artemis.core.server] > AMQ222010: Critical IO Error, shutting down the server. file=Lost > NodeManager lock, message=NULL > java.io.IOException: lost lock > at > org.apache.activemq.artemis.core.server.impl.SharedStorePrimaryActivation.lambda$registerActiveLockListener$0(SharedStorePrimaryActivation.java:124) > ~[artemis-server-2.40.0.jar:2.40.0] > at > org.apache.activemq.artemis.core.server.NodeManager.lambda$notifyLostLock$0(NodeManager.java:167) > ~[artemis-server-2.40.0.jar:2.40.0] > … > 2025-05-01 12:52:16,528 INFO [org.apache.activemq.artemis.core.server] > AMQ221002: Apache ActiveMQ Artemis Message Broker version 2.40.0 > [339308e1-25f3-11f0-996a-0200ec1b9c8e] stopped, uptime 52.978 seconds > … > > Regards, > > William Crowell > > From: William Crowell <wcrow...@perforce.com.INVALID> > Date: Thursday, May 1, 2025 at 9:20 AM > To: users@activemq.apache.org <users@activemq.apache.org> > Subject: Apache Artemis 2.40.0: Strange File Locking Behavior On NFSv4 > Good morning, > > Disclaimer: This is not a bug, but a configuration issue. > > We are using Apache Artemis 2.40.0 on Rocky Linux 9. We are configuring a > primary/backup pair on separate hosts and putting the data directory on an > NSFv4 mount, and we are experiencing problems with the locking mechanism. > I do know that NFS is not recommended for production use, but that is what > we are limited to. > > We are following this documentation: > https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Factivemq.apache.org%2Fcomponents%2Fartemis%2Fdocumentation%2Flatest%2Fha.html%23nfs-mount-recommendations&data=05%7C02%7CWCrowell%40perforce.com%7Ced516a05e7a84979776108dd88b2fa9d%7C95b666d19a7549ab95a38969fbcdc08c%7C0%7C0%7C638817024473992611%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=CT9JsvPfkAuYgmOTAmU2p%2FARwlU3XaxFqvVpdo2PPys%3D&reserved=0 > < > https://activemq.apache.org/components/artemis/documentation/latest/ha.html#nfs-mount-recommendations > > > > What is happening is that the primary loses the lock and goes down after > the backup node was started. The mount options on both brokers we are > using are: > > > vers=4.1,defaults,lazytime,noatime,nodiratime,rsize=1048576,wsize=1048576,sync,intr,noac > > We then tried to startup the nodes sequentially. The primary lost the > lock and went down shortly after the backup node was started. > > We also tested file locking on brokers 1 and 2: > > Broker 1: > > $ date; flock -x /data/test.lock -c "sleep 30"; echo $?; date > Thu May 1 12:42:52 PM GMT 2025 > 0 > Thu May 1 12:43:22 PM GMT 2025 > > Broker 2: > > date; flock -n /data/test.lock -c "echo lock acquired"; echo $?; date > Thu May 1 12:42:46 PM GMT 2025 > 1 > Thu May 1 12:42:47 PM GMT 2025 > > This means that broker 2 was unable to acquire the lock because broker 1 > already had it which is not consistent with the behavior on the Apache > Artemis brokers. I also tested this on AWS, and the failover works fine as > expected. > > What am I missing here? > > Regards, > > William Crowell > > > > This e-mail may contain information that is privileged or confidential. If > you are not the intended recipient, please delete the e-mail and any > attachments and notify us immediately. > > > This e-mail may contain information that is privileged or confidential. If > you are not the intended recipient, please delete the e-mail and any > attachments and notify us immediately. > >