Re: Repair failure under 0.8.6

Maxim Potekhin Sun, 04 Dec 2011 10:11:39 -0800

I capped heap and the error is still there. So I keep seeing "node dead"
messages even when I know the nodes were OK. Where and how do I tweak
timeouts?



9d-cfc9-4cbc-9f1d-1467341388b8, endpoint /130.199.185.193 died

INFO [GossipStage:1] 2011-12-04 00:26:16,362 Gossiper.java (line 683)InetAddress /130.199.185.193 is now UPERROR [AntiEntropySessions:1] 2011-12-04 00:26:16,518AbstractCassandraDaemon.java (line 139) Fatal exception in threadThread[Anti\

EntropySessions:1,5,RMI Runtime]

java.lang.RuntimeException: java.io.IOException: Problem during repairsession manual-repair-a6a655dc-63f0-4c1c-9c0b-0621f5692ba2, \

endpoint /130.199.185.194 died

atorg.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:34)atjava.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)atjava.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)

        at java.util.concurrent.FutureTask.run(FutureTask.java:138)

atjava.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)atjava.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

        at java.lang.Thread.run(Thread.java:662)

Caused by: java.io.IOException: Problem during repair sessionmanual-repair-a6a655dc-63f0-4c1c-9c0b-0621f5692ba2, endpoint /130.199\

.185.194 died

atorg.apache.cassandra.service.AntiEntropyService$RepairSession.failedNode(AntiEntropyService.java:712)atorg.apache.cassandra.service.AntiEntropyService$RepairSession.convict(AntiEntropyService.java:749)atorg.apache.cassandra.gms.FailureDetector.interpret(FailureDetector.java:155)atorg.apache.cassandra.gms.Gossiper.doStatusCheck(Gossiper.java:527)

        at org.apache.cassandra.gms.Gossiper.access$700(Gossiper.java:57)

atorg.apache.cassandra.gms.Gossiper$GossipTask.run(Gossiper.java:157)



On 12/3/2011 8:34 PM, Maxim Potekhin wrote:

Thank you Peter. Before I look into details as you suggest,
may I ask what you mean "automatically restarted"? They way
the box and Cassandra are set up in my case is such that the
death of either if final.

Also, how do I look for full GC? I just realized that in the latest
install, I might have omitted capping the heap size -- and the
nodes have 48GB each. I guess this could be a problem, precipitating
GC death, right?

Thank you

Maxim


On 12/3/2011 7:46 PM, Peter Schuller wrote:

quite understand how Cassandra declared a node dead (in the below).Was is a
timeout? How do I fix that?

I was about to respond to say that repair doesn't fail just due to
failure detection, but this appears to have been broken by
CASSANDRA-2433 :(

Unless there is a subtle bug the exception you're seeing should be
indicative that it really was considered Down by the node. You might
grep the log for references ot the node in question (UP or DOWN) to
confirm. The question is why though. I would check if the node has
maybe automatically restarted, or went into full GC, etc.

Re: Repair failure under 0.8.6

Reply via email to