Thanks Daemeon !!

I wil capture the output of netstats and share in next few days. We were 
thinking of taking tcp dumps also. If its a network issue and increasing 
request timeout worked, not sure how Cassandra is dropping messages based on 
timeout.Repair messages are non droppable and not supposed to be timedout.


2 of the 3 nodes in the DC are able to complete repair without any issue. Just 
one node is problematic.


I also observed frequent messages in logs of other nodes which say that hints 
replay timedout..and the node where hints were being replayed is always a 
remote dc node. Is it related some how?


Thanks

Anuj

Sent from Yahoo Mail on Android

From:"daemeon reiydelle" <daeme...@gmail.com>
Date:Thu, 12 Nov, 2015 at 10:34 am
Subject:Re: Repair Hangs while requesting Merkle Trees

Have you checked the network statistics on that machine? (netstats -tas) while 
attempting to repair ... if netstats show ANY issues you have a problem. If you 
can put the command in a loop running every 60 seconds for maybe 15 minutes and 
post back?

Out of curiousity, how many remote DC nodes are getting successfully repaired?



.......
“Life should not be a journey to the grave with the intention of arriving 
safely in a
pretty and well preserved body, but rather to skid in broadside in a cloud of 
smoke,
thoroughly used up, totally worn out, and loudly proclaiming “Wow! What a 
Ride!” 
- Hunter Thompson

Daemeon C.M. Reiydelle
USA (+1) 415.501.0198
London (+44) (0) 20 8144 9872


On Wed, Nov 11, 2015 at 1:06 PM, Anuj Wadehra <anujw_2...@yahoo.co.in> wrote:

Hi,


we are using 2.0.14. We have 2 DCs at remote locations with 10GBps 
connectivity.We are able to complete repair (-par -pr) on 5 nodes. On only one 
node in DC2, we are unable to complete repair as it always hangs. Node sends 
Merkle Tree requests, but one or more nodes in DC1 (remote) never show that 
they sent the merkle tree reply to requesting node.
Repair hangs infinitely. 

After increasing request_timeout_in_ms on affected node, we were able to 
successfully run repair on one of the two occassions.

Any comments, why this is happening on just one node? In 
OutboundTcpConnection.java,  when isTimeOut method always returns false for 
non-droppable verb such as Merkle Tree Request(verb=REPAIR_MESSAGE),why 
increasing request timeout solved problem on one occasion ?



Thanks

Anuj Wadehra 




On Thursday, 12 November 2015 2:35 AM, Anuj Wadehra <anujw_2...@yahoo.co.in> 
wrote:



Hi,


We have 2 DCs at remote locations with 10GBps connectivity.We are able to 
complete repair (-par -pr) on 5 nodes. On only one node in DC2, we are unable 
to complete repair as it always hangs. Node sends Merkle Tree requests, but one 
or more nodes in DC1 (remote) never show that they sent the merkle tree reply 
to requesting node.
Repair hangs infinitely. 

After increasing request_timeout_in_ms on affected node, we were able to 
successfully run repair on one of the two occassions.

Any comments, why this is happening on just one node? In 
OutboundTcpConnection.java,  when isTimeOut method always returns false for 
non-droppable verb such as Merkle Tree Request(verb=REPAIR_MESSAGE),why 
increasing request timeout solved problem on one occasion ?



Thanks

Anuj Wadehra




Reply via email to