> A repair on a certain CF will fail, and I run it again and again, eventually 
> it will succeed.
How does it fail?

Can you see the repair start on the other node ? 
If you are getting errors in the log about streaming failing because a node 
died, and the FailureDetector is in the call stack, change the 
phi_convict_threshold. You can set it in the yaml file or via JMX on the 
FailureDetectorMBean, in either case boost it from 8 to 16 to get the repair 
through. This will make it less likely that a node is marked as down, you 
probably want to run with 8 or a little bit higher normally. 

Cheers

-----------------
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 4/04/2013, at 6:41 PM, Paul Sudol <paulsu...@gmail.com> wrote:

> Hello,
> 
> I have a cluster with 4 nodes, 2 nodes in 2 data centers. I had a hardware 
> failure in one DC and had to replace the nodes. I'm running 1.2.3 on all of 
> the nodes now. I was able to run nodetool rebuild on the two replacement 
> nodes, but now I cannot finish a repair on any of them. I have 18 column 
> families, if I run a repair on a single CF at a time, I can get the node 
> repaired eventually. A repair on a certain CF will fail, and I run it again 
> and again, eventually it will succeed.
> 
> I've got an RF of 2, 1 copy in each DC, so the repair needs to pull data from 
> the other DC to finish it's repair.
> 
> The problem seems to be that the merkle tree request sometimes is not 
> received by the node in the other DC. Usually when the merkle tree request is 
> sent, the nodes that it was sent to start a compaciton/validation. In certain 
> cases this does not happen, only the node that I ran the repair on will begin 
> compaction/validation and send the merkle tree to itself. Then it's waiting 
> for a merkle tree from the other node, and it will never get it. After about 
> 24 hours it will time out and say the node in question died.
> 
> Is there a setting I can use to force the merkle tree request to be 
> acknowledged or resent if it's not acknowledged? I setup NTPD on all the 
> nodes and tried the cross_node_timeout, but that did not help.
> 
> Thanks in advance,
> 
> Paul

Reply via email to