Anuj

Have a look at https://issues.apache.org/jira/browse/AMQ-5549

You cannot avoid to have either two or zero master-brokers online during a
failover. The question is how long this situation lasts (see Arthur
Naseef's comment on AMQ-5549).

In my failover-tests with NFS shared storage I was able to reproduce very
different scenarios:
- the former master broker *never* shuts down
- the former master broker shuts down after 15 minutes
- the former master broker shuts down after 20 seconds

The only difference between these scenarios were NFS settings. My overall
impression is that the failover only works with a highly available shared
storage. As soon as one or multiple brokers lose the NFS connection, the
situation is getting crazy and I even "managed" it to corrupt the
persistence store during my tests.

Also notice the both-brokers-down-problem (
https://issues.apache.org/jira/browse/AMQ-5568) that I discovered during my
tests.

Cheers
Stephan


On Thu, Apr 30, 2015 at 2:40 PM, Tim Bain <tb...@alumni.duke.edu> wrote:

> An NFS problem was the first thing I thought of when I saw out-of-order log
> lines, especially since you've had that problem before.  And this outage
> lasted for over two minutes (which doesn't count as "slow" in my book;
> that's "unavailable" or "down" to me), which is pretty crazy; hopefully
> your ops team has looked into how that happened and taken steps to ensure
> it doesn't happen again.
>
> A NFS outage does justify a failover to the backup broker; to understand
> why, think about what prevents failover during normal operation.  The
> master broker holds a file system lock on a DB lock file, and the slave
> broker tries repeatedly to acquire the same lock.  As long as it can't, it
> knows the master broker is up and it can't become the master; at the point
> where the lock disappears because the master broker can't access NFS, the
> slave becomes active (at least, if it can access NFS; if not, then it
> doesn't know that it could become active and it can't read the messages
> from disk anyway).  This is exactly what you would want to happen.
>
> The real problem here is the one in your last paragraph: when the slave
> acquires the lock because the master can't access NFS, the master isn't
> detecting that and becoming the slave.  I'd suggest you try to recreate
> this failure (in a dev environment) by causing the master broker to be
> unable to access NFS and confirming that the master remains active even
> after the slave becomes the master.  Assuming that happens, submit a JIRA
> bug report to describe the problem.  Make sure you provide lots of details
> about your NFS setup (include version numbers, file system type, etc.) and
> about the O/Ses of the machines the brokers run on, since the behavior
> might vary based on some of those things and you want to make sure that
> whoever investigates this can reproduce it.  But make sure you can
> reproduce it first.
>
> Tim
> Hi,
>
> I got the logs in this order only and after further checking the system I
> got to know that NFS(where we put kahadb and broker logs) was slow during
> that time.
>
> I can understand the delay in logs or I/O operations are slow during that
> time but it does not justify why failover also open it's transport
> connector.
>
> The main concern here is that the (master-slave-shared-storage)topology is
> broken which should not happen in any case. If I/O operations are not
> happening, master broker should stop and let the failover serve the clients
> but here master didn't stop and both opened the connector.
>
> Thanks,
> Anuj
>
>
>
>
> --
> View this message in context:
>
> http://activemq.2283324.n4.nabble.com/ActiveMQ-master-slave-topology-issue-BUG-tp4695677p4695731.html
> Sent from the ActiveMQ - User mailing list archive at Nabble.com.
>

Reply via email to