Would you mind turning on DEBUG logging for
org.apache.activemq.artemis.core.server.impl.FileLockNodeManager [1],
reproducing, and uploading the full logs someplace accessible?

Regarding your NFS mount options...I was perusing the NFS man page [2] and
saw this note about noatime & nodiratime:

    In particular, the atime/noatime, diratime/nodiratime,
relatime/norelatime, and strictatime/nostrictatime mount options have no
effect on NFS mounts.

I could not find any reference to lazytime, and I also noticed that you
weren't using all the recommendations from the ActiveMQ Artemis
documentation. I wonder if you might try something like this:

   vers=4.1,soft,sync,intr,noac,lookupcache=none,timeo=50,retrans=3

Is NFS being used in your AWS use-case? If so, what mount options are being
used?

To be clear, I'm not an NFS expert by any means. Usually this stuff just
works with the recommended settings.


Justin

[1]
https://activemq.apache.org/components/artemis/documentation/latest/logging.html#configuring-a-specific-level-for-a-logger
[2] https://www.man7.org/linux/man-pages/man5/nfs.5.html

On Thu, May 1, 2025 at 9:02 AM William Crowell
<wcrow...@perforce.com.invalid> wrote:

> Configuration and logs…
>
> The relevant broker.xml configuration from the primary (broker 1) is:
>
> …
>       <connectors>
>          <connector name="broker1">tcp://1.1.1.2:61616</connector>
>          <connector name="broker2">tcp://1.1.1.3:61616</connector>
>       </connectors>
>
>       <cluster-connections>
>          <cluster-connection name="my-cluster">
>             <connector-ref>broker1</connector-ref>
>             <static-connectors>
>                <connector-ref>broker2</connector-ref>
>             </static-connectors>
>          </cluster-connection>
>       </cluster-connections>
>
>       <ha-policy>
>          <shared-store>
>             <primary>
>                <failover-on-shutdown>true</failover-on-shutdown>
>             </primary>
>          </shared-store>
>       </ha-policy>
> …
>
> The relevant broker.xml configuration from the backup (broker 2) is:
>
> …
>       <connectors>
>          <connector name="broker1">tcp://1.1.1.2:61616</connector>
>          <connector name="broker2">tcp://1.1.1.3:61616</connector>
>       </connectors>
>
>       <cluster-connections>
>          <cluster-connection name="my-cluster">
>             <connector-ref>broker2</connector-ref>
>             <static-connectors>
>                <connector-ref>broker1</connector-ref>
>             </static-connectors>
>          </cluster-connection>
>       </cluster-connections>
>
>       <ha-policy>
>          <shared-store>
>             <backup>
>                <allow-failback>false</allow-failback>
>             </backup>
>          </shared-store>
>       </ha-policy>
> …
>
> Startup on the primary:
>
> …
> 2025-05-01 12:51:23,567 INFO  [org.apache.activemq.artemis.core.server]
> AMQ221006: Waiting to obtain primary lock
> …
> 2025-05-01 12:51:23,747 INFO  [org.apache.activemq.artemis.core.server]
> AMQ221034: Waiting indefinitely to obtain primary lock
> 2025-05-01 12:51:23,748 INFO  [org.apache.activemq.artemis.core.server]
> AMQ221035: Primary Server Obtained primary lock
> …
> 2025-05-01 12:51:24,289 INFO  [org.apache.activemq.artemis.core.server]
> AMQ221007: Server is now active
> …
>
> Backup is started and somehow becomes primary:
>
> …
> 2025-05-01 12:51:48,473 INFO  [org.apache.activemq.artemis.core.server]
> AMQ221032: Waiting to become backup node
> 2025-05-01 12:51:48,474 INFO  [org.apache.activemq.artemis.core.server]
> AMQ221033: ** got backup lock
> …
> 2025-05-01 12:51:48,659 INFO  [org.apache.activemq.artemis.core.server]
> AMQ221109: Apache ActiveMQ Artemis Backup Server version 2.40.0
> [339308e1-25f3-11f0-996a-0200ec1b9c8e] started; waiting for primary to fail
> before activating
> 2025-05-01 12:51:48,809 INFO  [org.apache.activemq.artemis.core.server]
> AMQ221031: backup announced
> …
> 2025-05-01 12:52:06,129 INFO  [org.apache.activemq.artemis.core.server]
> AMQ221010: Backup Server is now active
> …
> 2025-05-01 12:52:07,158 INFO  [org.apache.activemq.artemis.core.client]
> AMQ214036: Connection closure to 1.1.1.2/1.1.1.2:61616 has been detected:
> AMQ219015: The connection was disconnected because of server shutdown
> [code=DISCONNECTED]
> …
>
> Primary loses the lock unexpectedly and shuts down:
>
> …
> 2025-05-01 12:52:16,352 WARN
> [org.apache.activemq.artemis.core.server.impl.FileLockNodeManager] Lost the
> lock according to the monitor, notifying listeners
> 2025-05-01 12:52:16,353 ERROR [org.apache.activemq.artemis.core.server]
> AMQ222010: Critical IO Error, shutting down the server. file=Lost
> NodeManager lock, message=NULL
> java.io.IOException: lost lock
>         at
> org.apache.activemq.artemis.core.server.impl.SharedStorePrimaryActivation.lambda$registerActiveLockListener$0(SharedStorePrimaryActivation.java:124)
> ~[artemis-server-2.40.0.jar:2.40.0]
>         at
> org.apache.activemq.artemis.core.server.NodeManager.lambda$notifyLostLock$0(NodeManager.java:167)
> ~[artemis-server-2.40.0.jar:2.40.0]
> …
> 2025-05-01 12:52:16,528 INFO  [org.apache.activemq.artemis.core.server]
> AMQ221002: Apache ActiveMQ Artemis Message Broker version 2.40.0
> [339308e1-25f3-11f0-996a-0200ec1b9c8e] stopped, uptime 52.978 seconds
> …
>
> Regards,
>
> William Crowell
>
> From: William Crowell <wcrow...@perforce.com.INVALID>
> Date: Thursday, May 1, 2025 at 9:20 AM
> To: users@activemq.apache.org <users@activemq.apache.org>
> Subject: Apache Artemis 2.40.0: Strange File Locking Behavior On NFSv4
> Good morning,
>
> Disclaimer: This is not a bug, but a configuration issue.
>
> We are using Apache Artemis 2.40.0 on Rocky Linux 9.  We are configuring a
> primary/backup pair on separate hosts and putting the data directory on an
> NSFv4 mount, and we are experiencing problems with the locking mechanism.
> I do know that NFS is not recommended for production use, but that is what
> we are limited to.
>
> We are following this documentation:
> https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Factivemq.apache.org%2Fcomponents%2Fartemis%2Fdocumentation%2Flatest%2Fha.html%23nfs-mount-recommendations&data=05%7C02%7CWCrowell%40perforce.com%7Ced516a05e7a84979776108dd88b2fa9d%7C95b666d19a7549ab95a38969fbcdc08c%7C0%7C0%7C638817024473992611%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=CT9JsvPfkAuYgmOTAmU2p%2FARwlU3XaxFqvVpdo2PPys%3D&reserved=0
> <
> https://activemq.apache.org/components/artemis/documentation/latest/ha.html#nfs-mount-recommendations
> >
>
> What is happening is that the primary loses the lock and goes down after
> the backup node was started.  The mount options on both brokers we are
> using are:
>
>
> vers=4.1,defaults,lazytime,noatime,nodiratime,rsize=1048576,wsize=1048576,sync,intr,noac
>
> We then tried to startup the nodes sequentially.  The primary lost the
> lock and went down shortly after the backup node was started.
>
> We also tested file locking on brokers 1 and 2:
>
> Broker 1:
>
> $ date; flock -x /data/test.lock  -c "sleep 30"; echo $?; date
> Thu May  1 12:42:52 PM GMT 2025
> 0
> Thu May  1 12:43:22 PM GMT 2025
>
> Broker 2:
>
> date; flock -n /data/test.lock  -c "echo lock acquired"; echo $?; date
> Thu May  1 12:42:46 PM GMT 2025
> 1
> Thu May  1 12:42:47 PM GMT 2025
>
> This means that broker 2 was unable to acquire the lock because broker 1
> already had it which is not consistent with the behavior on the Apache
> Artemis brokers.  I also tested this on AWS, and the failover works fine as
> expected.
>
> What am I missing here?
>
> Regards,
>
> William Crowell
>
>
>
> This e-mail may contain information that is privileged or confidential. If
> you are not the intended recipient, please delete the e-mail and any
> attachments and notify us immediately.
>
>
> This e-mail may contain information that is privileged or confidential. If
> you are not the intended recipient, please delete the e-mail and any
> attachments and notify us immediately.
>
>

Reply via email to