I bet the problem is with the other tasks on the executor that Gossip heartbeat runs on.
I see at least two that could cause blocking: hint cleanup post-delivery and flush-expired-memtables, both of which call forceFlush which will block if the flush queue + threads are full. We've run into this before (CASSANDRA-2253); we should move Gossip back to its own dedicated executor or it will keep happening whenever someone accidentally puts something on the "shared" executor that can block. Created https://issues.apache.org/jira/browse/CASSANDRA-2554 to fix this. Thanks for tracking down the problem! On Mon, Apr 25, 2011 at 11:51 AM, Terje Marthinussen <tmarthinus...@gmail.com> wrote: > Got just enough time to look at this done today to verify that: > > Sometimes nodes (under pressure) fails to send heartbeats for long > enough to get marked as dead by other nodes (why is a good question, > which I need to check better. Does not seem to be GC). > > The node does however start sending heartbeats again and other nodes > log that they receive the heartbeats, but this will not get it marked > as UP again until restarted. > > So, seems like 2 issues: > - Nodes pausing (may be just node overload) > - Nodes are not marked as UP unless restarted > > Regards, > Terje > > On 24 Apr 2011, at 23:24, Terje Marthinussen <tmarthinus...@gmail.com> wrote: > >> World as seen from .81 in the below ring >> .81 Up Normal 85.55 GB 8.33% Token(bytes[30]) >> .82 Down Normal 83.23 GB 8.33% Token(bytes[313230]) >> .83 Up Normal 70.43 GB 8.33% Token(bytes[313437]) >> .84 Up Normal 81.7 GB 8.33% Token(bytes[313836]) >> .85 Up Normal 108.39 GB 8.33% Token(bytes[323336]) >> .86 Up Normal 126.19 GB 8.33% Token(bytes[333234]) >> .87 Up Normal 127.16 GB 8.33% Token(bytes[333939]) >> .88 Up Normal 135.92 GB 8.33% Token(bytes[343739]) >> .89 Up Normal 117.1 GB 8.33% Token(bytes[353730]) >> .90 Up Normal 101.67 GB 8.33% Token(bytes[363635]) >> .91 Down Normal 88.33 GB 8.33% Token(bytes[383036]) >> .92 Up Normal 129.95 GB 8.33% Token(bytes[6a]) >> >> >> From .82 >> .81 Down Normal 85.55 GB 8.33% Token(bytes[30]) >> .82 Up Normal 83.23 GB 8.33% Token(bytes[313230]) >> .83 Up Normal 70.43 GB 8.33% Token(bytes[313437]) >> .84 Up Normal 81.7 GB 8.33% Token(bytes[313836]) >> .85 Up Normal 108.39 GB 8.33% Token(bytes[323336]) >> .86 Up Normal 126.19 GB 8.33% Token(bytes[333234]) >> .87 Up Normal 127.16 GB 8.33% Token(bytes[333939]) >> .88 Up Normal 135.92 GB 8.33% Token(bytes[343739]) >> .89 Up Normal 117.1 GB 8.33% Token(bytes[353730]) >> .90 Up Normal 101.67 GB 8.33% Token(bytes[363635]) >> .91 Down Normal 88.33 GB 8.33% Token(bytes[383036]) >> .92 Up Normal 129.95 GB 8.33% Token(bytes[6a]) >> >> From .84 >> 10.10.42.81 Down Normal 85.55 GB 8.33% Token(bytes[30]) >> 10.10.42.82 Down Normal 83.23 GB 8.33% Token(bytes[313230]) >> 10.10.42.83 Up Normal 70.43 GB 8.33% Token(bytes[313437]) >> 10.10.42.84 Up Normal 81.7 GB 8.33% Token(bytes[313836]) >> 10.10.42.85 Up Normal 108.39 GB 8.33% Token(bytes[323336]) >> 10.10.42.86 Up Normal 126.19 GB 8.33% Token(bytes[333234]) >> 10.10.42.87 Up Normal 127.16 GB 8.33% Token(bytes[333939]) >> 10.10.42.88 Up Normal 135.92 GB 8.33% Token(bytes[343739]) >> 10.10.42.89 Up Normal 117.1 GB 8.33% Token(bytes[353730]) >> 10.10.42.90 Up Normal 101.67 GB 8.33% Token(bytes[363635]) >> 10.10.42.91 Down Normal 88.33 GB 8.33% Token(bytes[383036]) >> 10.10.42.92 Up Normal 129.95 GB 8.33% Token(bytes[6a]) >> >> All of the nodes seems to be working when looked at individually and I can >> see on for instance .84 that >> INFO [ScheduledTasks:1] 2011-04-24 04:51:53,164 Gossiper.java (line 611) >> InetAddress /.81 is now dead. >> >> but there is no other messages related to the nodes "dissappearing" as far >> as I can see in the 18 hours since that message occured. >> >> Restarting seems to recover things, but nodes seems to go away again (0.8 >> also seem to be prone to commit logs being unreadable in some cases?) >> >> This is 0.8 build from trunk last Friday. >> >> I will try to enable some more debugging tomorrow to see if there is >> something interesting, just curious if anyone else had noticed something >> like this. >> >> Regards, >> Terje >> >> > -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com