Justin,

Everything works fine on AWS, and I cannot reproduce the issue.  I am working 
with someone who has on-premise hosts.

The issue is that the backup is taking away the lock from the primary when it 
is not supposed to.  It’s like the backup host gets impatient and just grabs 
the lock.  Sent you the logs directly.

All options are valid in that link except for intr which was removed.  I am 
verifying the Linux kernel build version, but it should be very recent.

Regards,

William Crowell

From: Justin Bertram <jbert...@apache.org>
Date: Thursday, May 1, 2025 at 12:57 PM
To: users@activemq.apache.org <users@activemq.apache.org>
Subject: Re: Apache Artemis 2.40.0: Strange File Locking Behavior On NFSv4
> Those option listed in the link regarding NFS that will work is sync.

I'm not following you here. Which exact options are you referring to?

> intr was removed and ignored in later versions of the Linux kernel.

That's correct. It remains in the recommended settings because we don't
know what version of the Linux kernel folks are using.

> I did try mounting the 2 brokers are mounting using these options (on
AWS)...I was not able to reproduce the issue on AWS.

I'm lost here. Can you clarify exactly what is and isn't working on AWS and
in your problematic environment?


Justin

On Thu, May 1, 2025 at 11:36 AM William Crowell
<wcrow...@perforce.com.invalid> wrote:

> Justin,
>
> Yes, already ahead of you on this.
>
> We are also checking if the clocks are in sync between the brokers and the
> NFS share.  I know you can have some drift in clock between the servers,
> but it cannot be very large.
>
> Those option listed in the link regarding NFS that will work is sync.
> intr was removed and ignored in later versions of the Linux kernel.
>
> I did try mounting the 2 brokers are mounting using these options (on AWS):
>
>
> sudo mount -o
> vers=4.1,soft,sync,intr,noac,lookupcache=none,timeo=50,retrans=3  -t nfs
> 1.1.1.4:/data /data
>
> I was not able to reproduce the issue on AWS.
>
> If you want the full logs, then I can upload those to gist or something.
> Working on this.
>
> This is what I am seeing on the other system having the problem.  Here is
> the primary:
>
> …
> 2025-05-01 15:07:15,441 DEBUG
> [org.apache.activemq.artemis.core.server.impl.FileLockNodeManager] Lock
> appears to be valid; double check by reading status
> 2025-05-01 15:07:15,441 DEBUG
> [org.apache.activemq.artemis.core.server.impl.FileLockNodeManager] getting
> state...
> 2025-05-01 15:07:15,441 DEBUG
> [org.apache.activemq.artemis.core.server.impl.FileLockNodeManager] trying
> to lock position: 0
> 2025-05-01 15:07:15,441 DEBUG
> [org.apache.activemq.artemis.core.server.impl.FileLockNodeManager] locked
> position: 0
> 2025-05-01 15:07:15,441 DEBUG
> [org.apache.activemq.artemis.core.server.impl.FileLockNodeManager] lock:
> sun.nio.ch.FileLockImpl[0:9223372036854775807 exclusive valid]
> 2025-05-01 15:07:15,442 DEBUG
> [org.apache.activemq.artemis.core.server.impl.FileLockNodeManager] state: L
> (THIS MEANS ACTIVE)
> 2025-05-01 15:07:15,442 DEBUG
> [org.apache.activemq.artemis.core.server.impl.FileLockNodeManager] Lock
> appears to be valid; triple check by comparing timestamp
> 2025-05-01 15:07:15,444 DEBUG
> [org.apache.activemq.artemis.core.server.impl.FileLockNodeManager] Lock
> file /data/server.lock originally locked at 2025-05-01T15:06:16.787+0000
> was modified at 2025-05-01T15:07:05.184+0000
> 2025-05-01 15:07:15,445 WARN
> [org.apache.activemq.artemis.core.server.impl.FileLockNodeManager] Lost the
> lock according to the monitor, notifying listeners
> 2025-05-01 15:07:15,445 ERROR [org.apache.activemq.artemis.core.server]
> AMQ222010: Critical IO Error, shutting down the server. file=Lost
> NodeManager lock, message=NULL
> java.io.IOException: lost lock
> …
>
> On the backup it tries once a second to get the lock and can’t which is
> good, but something finally happens at 3:07:05pm, and it mistakenly thinks
> it can get the lock:
>
> …
> 2025-05-01 15:07:05,183 DEBUG
> [org.apache.activemq.artemis.core.server.impl.FileLockNodeManager] locked
> position: 0
> 2025-05-01 15:07:05,183 DEBUG
> [org.apache.activemq.artemis.core.server.impl.FileLockNodeManager] lock:
> sun.nio.ch.FileLockImpl[0:9223372036854775807 exclusive valid]
> 2025-05-01 15:07:05,183 DEBUG
> [org.apache.activemq.artemis.core.server.impl.FileLockNodeManager] state: L
> (THIS MEANS ACTIVE)
> 2025-05-01 15:07:05,184 DEBUG
> [org.apache.activemq.artemis.core.server.impl.FileLockNodeManager] acquired
> primary node lock state = L
> 2025-05-01 15:07:05,186 DEBUG
> [org.apache.activemq.artemis.core.server.impl.FileLockNodeManager] touched
> /data/server.lock; new time: 1746112025184
> …
>
> Regards,
>
> William Crowell
>
> From: Justin Bertram <jbert...@apache.org>
> Date: Thursday, May 1, 2025 at 11:44 AM
> To: users@activemq.apache.org <users@activemq.apache.org>
> Subject: Re: Apache Artemis 2.40.0: Strange File Locking Behavior On NFSv4
> Would you mind turning on DEBUG logging for
> org.apache.activemq.artemis.core.server.impl.FileLockNodeManager [1],
> reproducing, and uploading the full logs someplace accessible?
>
> Regarding your NFS mount options...I was perusing the NFS man page [2] and
> saw this note about noatime & nodiratime:
>
>     In particular, the atime/noatime, diratime/nodiratime,
> relatime/norelatime, and strictatime/nostrictatime mount options have no
> effect on NFS mounts.
>
> I could not find any reference to lazytime, and I also noticed that you
> weren't using all the recommendations from the ActiveMQ Artemis
> documentation. I wonder if you might try something like this:
>
>    vers=4.1,soft,sync,intr,noac,lookupcache=none,timeo=50,retrans=3
>
> Is NFS being used in your AWS use-case? If so, what mount options are being
> used?
>
> To be clear, I'm not an NFS expert by any means. Usually this stuff just
> works with the recommended settings.
>
>
> Justin
>
> [1]
>
> https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Factivemq.apache.org%2Fcomponents%2Fartemis%2Fdocumentation%2Flatest%2Flogging.html%23configuring-a-specific-level-for-a-logger&data=05%7C02%7CWCrowell%40perforce.com%7Cea4b336d5ff741a1080a08dd88d138d3%7C95b666d19a7549ab95a38969fbcdc08c%7C0%7C0%7C638817154367345531%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=uc2XAQGb1kQrlsLuXuyBNnwessy2NSIJMeYPYp%2BG%2Fm8%3D&reserved=0<https://activemq.apache.org/components/artemis/documentation/latest/logging.html#configuring-a-specific-level-for-a-logger>
> <
> https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Factivemq.apache.org%2Fcomponents%2Fartemis%2Fdocumentation%2Flatest%2Flogging.html%23configuring-a-specific-level-for-a-logger&data=05%7C02%7CWCrowell%40perforce.com%7Cea4b336d5ff741a1080a08dd88d138d3%7C95b666d19a7549ab95a38969fbcdc08c%7C0%7C0%7C638817154367372389%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=DcwaqGZ1X3khgjELVHZYLUm9nFmT6ams%2FI3hS0OKnvo%3D&reserved=0<https://activemq.apache.org/components/artemis/documentation/latest/logging.html#configuring-a-specific-level-for-a-logger>
> >
> [2]
> https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.man7.org%2Flinux%2Fman-pages%2Fman5%2Fnfs.5.html&data=05%7C02%7CWCrowell%40perforce.com%7Cea4b336d5ff741a1080a08dd88d138d3%7C95b666d19a7549ab95a38969fbcdc08c%7C0%7C0%7C638817154367387531%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=sSVYcOBCe8zNHfmjoreCcquI9DuESRhemBv0ZXwrx0Q%3D&reserved=0<https://www.man7.org/linux/man-pages/man5/nfs.5.html>
> <https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.man7.org%2Flinux%2Fman-pages%2Fman5%2Fnfs.5.html&data=05%7C02%7CWCrowell%40perforce.com%7Cea4b336d5ff741a1080a08dd88d138d3%7C95b666d19a7549ab95a38969fbcdc08c%7C0%7C0%7C638817154367399014%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=3YHRxjqwPpFMDw3yB5ucRRmNADYbkjT6n%2FNm6fbw3p4%3D&reserved=0<https://www.man7.org/linux/man-pages/man5/nfs.5.html>>
>
> On Thu, May 1, 2025 at 9:02 AM William Crowell
> <wcrow...@perforce.com.invalid> wrote:
>
> > Configuration and logs…
> >
> > The relevant broker.xml configuration from the primary (broker 1) is:
> >
> > …
> >       <connectors>
> >          <connector name="broker1">tcp://1.1.1.2:61616</connector>
> >          <connector name="broker2">tcp://1.1.1.3:61616</connector>
> >       </connectors>
> >
> >       <cluster-connections>
> >          <cluster-connection name="my-cluster">
> >             <connector-ref>broker1</connector-ref>
> >             <static-connectors>
> >                <connector-ref>broker2</connector-ref>
> >             </static-connectors>
> >          </cluster-connection>
> >       </cluster-connections>
> >
> >       <ha-policy>
> >          <shared-store>
> >             <primary>
> >                <failover-on-shutdown>true</failover-on-shutdown>
> >             </primary>
> >          </shared-store>
> >       </ha-policy>
> > …
> >
> > The relevant broker.xml configuration from the backup (broker 2) is:
> >
> > …
> >       <connectors>
> >          <connector name="broker1">tcp://1.1.1.2:61616</connector>
> >          <connector name="broker2">tcp://1.1.1.3:61616</connector>
> >       </connectors>
> >
> >       <cluster-connections>
> >          <cluster-connection name="my-cluster">
> >             <connector-ref>broker2</connector-ref>
> >             <static-connectors>
> >                <connector-ref>broker1</connector-ref>
> >             </static-connectors>
> >          </cluster-connection>
> >       </cluster-connections>
> >
> >       <ha-policy>
> >          <shared-store>
> >             <backup>
> >                <allow-failback>false</allow-failback>
> >             </backup>
> >          </shared-store>
> >       </ha-policy>
> > …
> >
> > Startup on the primary:
> >
> > …
> > 2025-05-01 12:51:23,567 INFO  [org.apache.activemq.artemis.core.server]
> > AMQ221006: Waiting to obtain primary lock
> > …
> > 2025-05-01 12:51:23,747 INFO  [org.apache.activemq.artemis.core.server]
> > AMQ221034: Waiting indefinitely to obtain primary lock
> > 2025-05-01 12:51:23,748 INFO  [org.apache.activemq.artemis.core.server]
> > AMQ221035: Primary Server Obtained primary lock
> > …
> > 2025-05-01 12:51:24,289 INFO  [org.apache.activemq.artemis.core.server]
> > AMQ221007: Server is now active
> > …
> >
> > Backup is started and somehow becomes primary:
> >
> > …
> > 2025-05-01 12:51:48,473 INFO  [org.apache.activemq.artemis.core.server]
> > AMQ221032: Waiting to become backup node
> > 2025-05-01 12:51:48,474 INFO  [org.apache.activemq.artemis.core.server]
> > AMQ221033: ** got backup lock
> > …
> > 2025-05-01 12:51:48,659 INFO  [org.apache.activemq.artemis.core.server]
> > AMQ221109: Apache ActiveMQ Artemis Backup Server version 2.40.0
> > [339308e1-25f3-11f0-996a-0200ec1b9c8e] started; waiting for primary to
> fail
> > before activating
> > 2025-05-01 12:51:48,809 INFO  [org.apache.activemq.artemis.core.server]
> > AMQ221031: backup announced
> > …
> > 2025-05-01 12:52:06,129 INFO  [org.apache.activemq.artemis.core.server]
> > AMQ221010: Backup Server is now active
> > …
> > 2025-05-01 12:52:07,158 INFO  [org.apache.activemq.artemis.core.client]
> > AMQ214036: Connection closure to 1.1.1.2/1.1.1.2:61616 has been
> detected:
> > AMQ219015: The connection was disconnected because of server shutdown
> > [code=DISCONNECTED]
> > …
> >
> > Primary loses the lock unexpectedly and shuts down:
> >
> > …
> > 2025-05-01 12:52:16,352 WARN
> > [org.apache.activemq.artemis.core.server.impl.FileLockNodeManager] Lost
> the
> > lock according to the monitor, notifying listeners
> > 2025-05-01 12:52:16,353 ERROR [org.apache.activemq.artemis.core.server]
> > AMQ222010: Critical IO Error, shutting down the server. file=Lost
> > NodeManager lock, message=NULL
> > java.io.IOException: lost lock
> >         at
> >
> org.apache.activemq.artemis.core.server.impl.SharedStorePrimaryActivation.lambda$registerActiveLockListener$0(SharedStorePrimaryActivation.java:124)
> > ~[artemis-server-2.40.0.jar:2.40.0]
> >         at
> >
> org.apache.activemq.artemis.core.server.NodeManager.lambda$notifyLostLock$0(NodeManager.java:167)
> > ~[artemis-server-2.40.0.jar:2.40.0]
> > …
> > 2025-05-01 12:52:16,528 INFO  [org.apache.activemq.artemis.core.server]
> > AMQ221002: Apache ActiveMQ Artemis Message Broker version 2.40.0
> > [339308e1-25f3-11f0-996a-0200ec1b9c8e] stopped, uptime 52.978 seconds
> > …
> >
> > Regards,
> >
> > William Crowell
> >
> > From: William Crowell <wcrow...@perforce.com.INVALID>
> > Date: Thursday, May 1, 2025 at 9:20 AM
> > To: users@activemq.apache.org <users@activemq.apache.org>
> > Subject: Apache Artemis 2.40.0: Strange File Locking Behavior On NFSv4
> > Good morning,
> >
> > Disclaimer: This is not a bug, but a configuration issue.
> >
> > We are using Apache Artemis 2.40.0 on Rocky Linux 9.  We are configuring
> a
> > primary/backup pair on separate hosts and putting the data directory on
> an
> > NSFv4 mount, and we are experiencing problems with the locking mechanism.
> > I do know that NFS is not recommended for production use, but that is
> what
> > we are limited to.
> >
> > We are following this documentation:
> >
> https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Factivemq.apache.org%2Fcomponents%2Fartemis%2Fdocumentation%2Flatest%2Fha.html%23nfs-mount-recommendations&data=05%7C02%7CWCrowell%40perforce.com%7Cea4b336d5ff741a1080a08dd88d138d3%7C95b666d19a7549ab95a38969fbcdc08c%7C0%7C0%7C638817154367409964%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=yvejGEnfwSJ3gZR9EgKHszRx62ttdN4rP1Hc9ZRKakE%3D&reserved=0<https://activemq.apache.org/components/artemis/documentation/latest/ha.html#nfs-mount-recommendations>
> <
> https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Factivemq.apache.org%2Fcomponents%2Fartemis%2Fdocumentation%2Flatest%2Fha.html%23nfs-mount-recommendations&data=05%7C02%7CWCrowell%40perforce.com%7Cea4b336d5ff741a1080a08dd88d138d3%7C95b666d19a7549ab95a38969fbcdc08c%7C0%7C0%7C638817154367422871%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=zr%2BLEjYM6gaS3PMPwj8s6jpnq1NO%2BsMCYguQqIU3ZF4%3D&reserved=0<https://activemq.apache.org/components/artemis/documentation/latest/ha.html#nfs-mount-recommendations>
> >
> > <
> >
> https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Factivemq.apache.org%2Fcomponents%2Fartemis%2Fdocumentation%2Flatest%2Fha.html%23nfs-mount-recommendations&data=05%7C02%7CWCrowell%40perforce.com%7Cea4b336d5ff741a1080a08dd88d138d3%7C95b666d19a7549ab95a38969fbcdc08c%7C0%7C0%7C638817154367436036%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=%2F6ozJtx1wmcbaEDZTLCZJLrqNMlRpeGD9ls5C2zkuk0%3D&reserved=0<https://activemq.apache.org/components/artemis/documentation/latest/ha.html#nfs-mount-recommendations>
> <
> https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Factivemq.apache.org%2Fcomponents%2Fartemis%2Fdocumentation%2Flatest%2Fha.html%23nfs-mount-recommendations&data=05%7C02%7CWCrowell%40perforce.com%7Cea4b336d5ff741a1080a08dd88d138d3%7C95b666d19a7549ab95a38969fbcdc08c%7C0%7C0%7C638817154367446580%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=3PhlYK7H4qzIcoTbJmDpFeCe9pcJ%2FAhRhB%2BtMrUHUqs%3D&reserved=0<https://activemq.apache.org/components/artemis/documentation/latest/ha.html#nfs-mount-recommendations>
> >
> > >
> >
> > What is happening is that the primary loses the lock and goes down after
> > the backup node was started.  The mount options on both brokers we are
> > using are:
> >
> >
> >
> vers=4.1,defaults,lazytime,noatime,nodiratime,rsize=1048576,wsize=1048576,sync,intr,noac
> >
> > We then tried to startup the nodes sequentially.  The primary lost the
> > lock and went down shortly after the backup node was started.
> >
> > We also tested file locking on brokers 1 and 2:
> >
> > Broker 1:
> >
> > $ date; flock -x /data/test.lock  -c "sleep 30"; echo $?; date
> > Thu May  1 12:42:52 PM GMT 2025
> > 0
> > Thu May  1 12:43:22 PM GMT 2025
> >
> > Broker 2:
> >
> > date; flock -n /data/test.lock  -c "echo lock acquired"; echo $?; date
> > Thu May  1 12:42:46 PM GMT 2025
> > 1
> > Thu May  1 12:42:47 PM GMT 2025
> >
> > This means that broker 2 was unable to acquire the lock because broker 1
> > already had it which is not consistent with the behavior on the Apache
> > Artemis brokers.  I also tested this on AWS, and the failover works fine
> as
> > expected.
> >
> > What am I missing here?
> >
> > Regards,
> >
> > William Crowell
> >
> >
> >
> > This e-mail may contain information that is privileged or confidential.
> If
> > you are not the intended recipient, please delete the e-mail and any
> > attachments and notify us immediately.
> >
> >
> > This e-mail may contain information that is privileged or confidential.
> If
> > you are not the intended recipient, please delete the e-mail and any
> > attachments and notify us immediately.
> >
> >
>
>
> CAUTION: This email originated from outside of the organization. Do not
> click on links or open attachments unless you recognize the sender and know
> the content is safe.
>
>
> This e-mail may contain information that is privileged or confidential. If
> you are not the intended recipient, please delete the e-mail and any
> attachments and notify us immediately.
>
>


CAUTION: This email originated from outside of the organization. Do not click 
on links or open attachments unless you recognize the sender and know the 
content is safe.


This e-mail may contain information that is privileged or confidential. If you 
are not the intended recipient, please delete the e-mail and any attachments 
and notify us immediately.

Reply via email to