RE: Potential message loss seen with HA topology in Artemis 2.6.2 on failback

Udayan Sahu Wed, 18 Jul 2018 09:03:52 -0700

I also thought that bringing slave first before master would solve the problem, 
but it didn’t ...


Slave waits with a message 
"AMQ221109: Apache ActiveMQ Artemis Backup Server version 2.6.2 [null] started, 
waiting live to fail before it gets active"

as soon as master is started it says

AMQ221024: Backup server 
ActiveMQServerImpl::serverUUID=e0c8c135-8834-11e8-a326-0a0027000014 is 
synchronized with live-server.
AMQ221031: backup announced


As we want fail-back functionality, we have used following in slave

                             
<max-saved-replicated-journals-size>0</max-saved-replicated-journals-size>

I have strong feeling that this may be messing it up, please confirm

Thanks

--- Udayan Sahu

-----Original Message-----
From: Clebert Suconic [mailto:clebert.suco...@gmail.com] 
Sent: Wednesday, July 18, 2018 6:28 AM
To: Udayan Sahu <udayan.s...@oracle.com>
Cc: users@activemq.apache.org
Subject: Re: Potential message loss seen with HA topology in Artemis 2.6.2 on 
failback

You could have another passive backup that would assume when M1 is killed and 
it could become the backup.

But if the node is alone and you killed it. you need to start it first.

On Wed, Jul 18, 2018 at 9:27 AM, Clebert Suconic <clebert.suco...@gmail.com> 
wrote:
> At the moment you have to start the latest server to be alive first.
>
> I know there's a task to compare age of the journals before 
> synchronizing it.. but it's not done yet.
>
> On Tue, Jul 17, 2018 at 6:48 PM, Udayan Sahu <udayan.s...@oracle.com> wrote:
>> Its simple HA subsystem, with a simple ask in replicated state 
>> system, it should start from last committed state…
>>
>>
>>
>> Step1: Master (M1) & Standby (S1) Alive
>>
>> Step2: Producer Send 10 Message à M1 receives it and replicates it to 
>> S1
>>
>> Step3: Kill Master ( M1) à It makes S1 as New Master
>>
>> Step4: Producer Send 10 Message à S1 receives messages and is not 
>> replicated as M1 is Down
>>
>> Step5: Kill Standby ( S1 )
>>
>> Step6: Start Master ( M1 )
>>
>> Step7: Start Standby (S1) ( it sync with Master (M1) discarding its 
>> internal state )
>>
>> This is wrong. M1 should sync with S1 since S1 represents the current 
>> state of the queue.
>>
>>
>>
>> How can we protect Step 4 Messages being lost… We are using 
>> transacted session and calling commit to make sure messages are persisted..
>>
>>
>>
>> --- Udayan Sahu
>>
>>
>>
>>
>>
>> From: Clebert Suconic [mailto:clebert.suco...@gmail.com]
>> Sent: Tuesday, July 17, 2018 2:50 PM
>> To: users@activemq.apache.org
>> Cc: Udayan Sahu <udayan.s...@oracle.com>
>> Subject: Re: Potential message loss seen with HA topology in Artemis 
>> 2.6.2 on failback
>>
>>
>>
>> Ha is about preserving the journals between failures.
>>
>>
>>
>> When you read and send messages you may still have an failure during 
>> the reading.  I would need to understand what you do in case of a 
>> failure with your consumer and producer.
>>
>>
>>
>> Retries on send and duplicate detection are key for your case.
>>
>>
>>
>> You could also play with XA and a transaction manager.
>>
>>
>>
>> On Tue, Jul 17, 2018 at 5:01 PM Neha Sareen <neha.sar...@oracle.com> wrote:
>>
>> Hi,
>>
>>
>>
>> We are setting up a cluster of 6 brokers using Artemis 2.6.2.
>>
>>
>>
>> The cluster has 3 groups.
>>
>> - Each group has one master, and one slave broker pair.
>>
>> - The HA uses replication.
>>
>> - Each master broker configuration has the flag 
>> 'check-for-live-server' set to true.
>>
>> - Each slave broker configuration has the flag 'allow-failback' set to true.
>>
>> - We use static connectors for allowing cluster topology discovery.
>>
>> - Each broker's static connector list includes the connectors to the 
>> other 5 servers in the cluster.
>>
>> - Each broker declares its acceptor.
>>
>> - Each broker exports its own connector information via the  'connector-ref'
>> configuration element.
>>
>> - The acceptor and the connector URLs for each broker are identical 
>> with respect to the host and port information
>>
>>
>>
>> We have a standalone test application that creates producers and
>>
>> consumers to write messages and receive messages respectively using a 
>> transacted JMS session.
>>
>>
>>
>>> We are trying to execute an automatic failover test case followed by 
>>> failback as follows:
>>
>> TestCase -1
>>
>> Step1: Master & Standby Alive
>>
>> Step2: Producer Send Message , say 9 messages
>>
>> Step3: Kill Master
>>
>> Step4: Producer Send Message , say another 9 messages
>>
>> Step5: Kill Standby
>>
>> Step6: Start Master
>>
>> Step7: Start Standby.
>>
>> What we see is that it sync with Master discarding its internal state 
>> , and we are able to consume only 9 messages, leading to a loss of 9 
>> messages
>>
>>
>>
>>
>>
>> Test Case - 2
>>
>> Step1: Master & Standby Alive
>>
>> Step2: Producer Send Message
>>
>> Step3: Kill Master
>>
>> Step4: Producer Send Message
>>
>> Step5: Kill Standby
>>
>> Step6: Start Standby ( it waits for Master )
>>
>> Step7: Start Master (Question does it wait for slave ??)
>>
>> Step8: Consume Message
>>
>>
>>
>> Can someone provide any insights here regarding the potential message loss?
>>
>> Also are there alternatives to a different topology we may use here 
>> to get around this issue?
>>
>>
>>
>> Thanks
>>
>> Neha
>>
>>
>>
>> --
>>
>> Clebert Suconic
>
>
>
> --
> Clebert Suconic



--
Clebert Suconic

RE: Potential message loss seen with HA topology in Artemis 2.6.2 on failback

Reply via email to