I bet the problem is with the other tasks on the executor that Gossip
heartbeat runs on.

I see at least two that could cause blocking: hint cleanup
post-delivery and flush-expired-memtables, both of which call
forceFlush which will block if the flush queue + threads are full.

We've run into this before (CASSANDRA-2253); we should move Gossip
back to its own dedicated executor or it will keep happening whenever
someone accidentally puts something on the "shared" executor that can
block.

Created https://issues.apache.org/jira/browse/CASSANDRA-2554 to fix
this.  Thanks for tracking down the problem!

On Mon, Apr 25, 2011 at 11:51 AM, Terje Marthinussen
<tmarthinus...@gmail.com> wrote:
> Got just enough time to look at this done today to verify that:
>
> Sometimes nodes (under pressure) fails to send heartbeats for  long
> enough to get marked as dead by other nodes (why is a good question,
> which I need to check better. Does not seem to be GC).
>
> The node does however start sending heartbeats again and other nodes
> log that they receive the heartbeats,  but this will not get it marked
> as UP again until restarted.
>
> So, seems like 2 issues:
> - Nodes pausing (may be just node overload)
> - Nodes are not marked as UP unless restarted
>
> Regards,
> Terje
>
> On 24 Apr 2011, at 23:24, Terje Marthinussen <tmarthinus...@gmail.com> wrote:
>
>> World as seen from .81 in the below ring
>> .81     Up     Normal  85.55 GB        8.33%   Token(bytes[30])
>> .82     Down   Normal  83.23 GB        8.33%   Token(bytes[313230])
>> .83     Up     Normal  70.43 GB        8.33%   Token(bytes[313437])
>> .84     Up     Normal  81.7 GB         8.33%   Token(bytes[313836])
>> .85     Up     Normal  108.39 GB       8.33%   Token(bytes[323336])
>> .86     Up     Normal  126.19 GB       8.33%   Token(bytes[333234])
>> .87     Up     Normal  127.16 GB       8.33%   Token(bytes[333939])
>> .88     Up     Normal  135.92 GB       8.33%   Token(bytes[343739])
>> .89     Up     Normal  117.1 GB        8.33%   Token(bytes[353730])
>> .90     Up     Normal  101.67 GB       8.33%   Token(bytes[363635])
>> .91     Down   Normal  88.33 GB        8.33%   Token(bytes[383036])
>> .92     Up     Normal  129.95 GB       8.33%   Token(bytes[6a])
>>
>>
>> From .82
>> .81     Down   Normal  85.55 GB        8.33%   Token(bytes[30])
>> .82     Up     Normal  83.23 GB        8.33%   Token(bytes[313230])
>> .83     Up     Normal  70.43 GB        8.33%   Token(bytes[313437])
>> .84     Up     Normal  81.7 GB         8.33%   Token(bytes[313836])
>> .85     Up     Normal  108.39 GB       8.33%   Token(bytes[323336])
>> .86     Up     Normal  126.19 GB       8.33%   Token(bytes[333234])
>> .87     Up     Normal  127.16 GB       8.33%   Token(bytes[333939])
>> .88     Up     Normal  135.92 GB       8.33%   Token(bytes[343739])
>> .89     Up     Normal  117.1 GB        8.33%   Token(bytes[353730])
>> .90     Up     Normal  101.67 GB       8.33%   Token(bytes[363635])
>> .91     Down   Normal  88.33 GB        8.33%   Token(bytes[383036])
>> .92     Up     Normal  129.95 GB       8.33%   Token(bytes[6a])
>>
>> From .84
>> 10.10.42.81     Down   Normal  85.55 GB        8.33%   Token(bytes[30])
>> 10.10.42.82     Down   Normal  83.23 GB        8.33%   Token(bytes[313230])
>> 10.10.42.83     Up     Normal  70.43 GB        8.33%   Token(bytes[313437])
>> 10.10.42.84     Up     Normal  81.7 GB         8.33%   Token(bytes[313836])
>> 10.10.42.85     Up     Normal  108.39 GB       8.33%   Token(bytes[323336])
>> 10.10.42.86     Up     Normal  126.19 GB       8.33%   Token(bytes[333234])
>> 10.10.42.87     Up     Normal  127.16 GB       8.33%   Token(bytes[333939])
>> 10.10.42.88     Up     Normal  135.92 GB       8.33%   Token(bytes[343739])
>> 10.10.42.89     Up     Normal  117.1 GB        8.33%   Token(bytes[353730])
>> 10.10.42.90     Up     Normal  101.67 GB       8.33%   Token(bytes[363635])
>> 10.10.42.91     Down   Normal  88.33 GB        8.33%   Token(bytes[383036])
>> 10.10.42.92     Up     Normal  129.95 GB       8.33%   Token(bytes[6a])
>>
>> All of the nodes seems to be working when looked at individually and I can 
>> see on for instance .84 that
>>  INFO [ScheduledTasks:1] 2011-04-24 04:51:53,164 Gossiper.java (line 611) 
>> InetAddress /.81 is now dead.
>>
>> but there is no other messages related to the nodes "dissappearing"  as far 
>> as I can see in the 18 hours since that message occured.
>>
>> Restarting seems to recover things, but nodes seems to go away again (0.8 
>> also seem to be prone to commit logs being unreadable in some cases?)
>>
>> This is 0.8 build from trunk last Friday.
>>
>> I will try to enable some more debugging tomorrow to see if there is 
>> something interesting, just curious if anyone else had noticed something 
>> like this.
>>
>> Regards,
>> Terje
>>
>>
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com

Reply via email to