I capped heap and the error is still there. So I keep seeing "node dead"
messages even when I know the nodes were OK. Where and how do I tweak
timeouts?
9d-cfc9-4cbc-9f1d-1467341388b8, endpoint /130.199.185.193 died
INFO [GossipStage:1] 2011-12-04 00:26:16,362 Gossiper.java (line 683)
InetAddress /130.199.185.193 is now UP
ERROR [AntiEntropySessions:1] 2011-12-04 00:26:16,518
AbstractCassandraDaemon.java (line 139) Fatal exception in thread
Thread[Anti\
EntropySessions:1,5,RMI Runtime]
java.lang.RuntimeException: java.io.IOException: Problem during repair
session manual-repair-a6a655dc-63f0-4c1c-9c0b-0621f5692ba2, \
endpoint /130.199.185.194 died
at
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:34)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
at
java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.io.IOException: Problem during repair session
manual-repair-a6a655dc-63f0-4c1c-9c0b-0621f5692ba2, endpoint /130.199\
.185.194 died
at
org.apache.cassandra.service.AntiEntropyService$RepairSession.failedNode(AntiEntropyService.java:712)
at
org.apache.cassandra.service.AntiEntropyService$RepairSession.convict(AntiEntropyService.java:749)
at
org.apache.cassandra.gms.FailureDetector.interpret(FailureDetector.java:155)
at
org.apache.cassandra.gms.Gossiper.doStatusCheck(Gossiper.java:527)
at org.apache.cassandra.gms.Gossiper.access$700(Gossiper.java:57)
at
org.apache.cassandra.gms.Gossiper$GossipTask.run(Gossiper.java:157)
On 12/3/2011 8:34 PM, Maxim Potekhin wrote:
Thank you Peter. Before I look into details as you suggest,
may I ask what you mean "automatically restarted"? They way
the box and Cassandra are set up in my case is such that the
death of either if final.
Also, how do I look for full GC? I just realized that in the latest
install, I might have omitted capping the heap size -- and the
nodes have 48GB each. I guess this could be a problem, precipitating
GC death, right?
Thank you
Maxim
On 12/3/2011 7:46 PM, Peter Schuller wrote:
quite understand how Cassandra declared a node dead (in the below).
Was is a
timeout? How do I fix that?
I was about to respond to say that repair doesn't fail just due to
failure detection, but this appears to have been broken by
CASSANDRA-2433 :(
Unless there is a subtle bug the exception you're seeing should be
indicative that it really was considered Down by the node. You might
grep the log for references ot the node in question (UP or DOWN) to
confirm. The question is why though. I would check if the node has
maybe automatically restarted, or went into full GC, etc.