I capped heap and the error is still there. So I keep seeing "node dead"
messages even when I know the nodes were OK. Where and how do I tweak
timeouts?


9d-cfc9-4cbc-9f1d-1467341388b8, endpoint /130.199.185.193 died
INFO [GossipStage:1] 2011-12-04 00:26:16,362 Gossiper.java (line 683) InetAddress /130.199.185.193 is now UP ERROR [AntiEntropySessions:1] 2011-12-04 00:26:16,518 AbstractCassandraDaemon.java (line 139) Fatal exception in thread Thread[Anti\
EntropySessions:1,5,RMI Runtime]
java.lang.RuntimeException: java.io.IOException: Problem during repair session manual-repair-a6a655dc-63f0-4c1c-9c0b-0621f5692ba2, \
endpoint /130.199.185.194 died
at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:34) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)
Caused by: java.io.IOException: Problem during repair session manual-repair-a6a655dc-63f0-4c1c-9c0b-0621f5692ba2, endpoint /130.199\
.185.194 died
at org.apache.cassandra.service.AntiEntropyService$RepairSession.failedNode(AntiEntropyService.java:712) at org.apache.cassandra.service.AntiEntropyService$RepairSession.convict(AntiEntropyService.java:749) at org.apache.cassandra.gms.FailureDetector.interpret(FailureDetector.java:155) at org.apache.cassandra.gms.Gossiper.doStatusCheck(Gossiper.java:527)
        at org.apache.cassandra.gms.Gossiper.access$700(Gossiper.java:57)
at org.apache.cassandra.gms.Gossiper$GossipTask.run(Gossiper.java:157)


On 12/3/2011 8:34 PM, Maxim Potekhin wrote:
Thank you Peter. Before I look into details as you suggest,
may I ask what you mean "automatically restarted"? They way
the box and Cassandra are set up in my case is such that the
death of either if final.

Also, how do I look for full GC? I just realized that in the latest
install, I might have omitted capping the heap size -- and the
nodes have 48GB each. I guess this could be a problem, precipitating
GC death, right?

Thank you

Maxim


On 12/3/2011 7:46 PM, Peter Schuller wrote:
quite understand how Cassandra declared a node dead (in the below). Was is a
timeout? How do I fix that?
I was about to respond to say that repair doesn't fail just due to
failure detection, but this appears to have been broken by
CASSANDRA-2433 :(

Unless there is a subtle bug the exception you're seeing should be
indicative that it really was considered Down by the node. You might
grep the log for references ot the node in question (UP or DOWN) to
confirm. The question is why though. I would check if the node has
maybe automatically restarted, or went into full GC, etc.


Reply via email to