Repair hangs when merkle tree request is not acknowledged

Paul Sudol Thu, 04 Apr 2013 06:30:02 -0700

Hello,

I have a cluster with 4 nodes, 2 nodes in 2 data centers. I had a hardware 
failure in one DC and had to replace the nodes. I'm running 1.2.3 on all of the 
nodes now. I was able to run nodetool rebuild on the two replacement nodes, but 
now I cannot finish a repair on any of them. I have 18 column families, if I 
run a repair on a single CF at a time, I can get the node repaired eventually. 
A repair on a certain CF will fail, and I run it again and again, eventually it 
will succeed.


I've got an RF of 2, 1 copy in each DC, so the repair needs to pull data from 
the other DC to finish it's repair.

The problem seems to be that the merkle tree request sometimes is not received 
by the node in the other DC. Usually when the merkle tree request is sent, the 
nodes that it was sent to start a compaciton/validation. In certain cases this 
does not happen, only the node that I ran the repair on will begin 
compaction/validation and send the merkle tree to itself. Then it's waiting for 
a merkle tree from the other node, and it will never get it. After about 24 
hours it will time out and say the node in question died.

Is there a setting I can use to force the merkle tree request to be 
acknowledged or resent if it's not acknowledged? I setup NTPD on all the nodes 
and tried the cross_node_timeout, but that did not help.

Thanks in advance,

Paul

Repair hangs when merkle tree request is not acknowledged

Reply via email to