Re: Cassandra single unreachable node causing total cluster outage

Jeff Jirsa Tue, 27 Nov 2018 18:36:10 -0800

Could also be the app not detecting the host is down and it keeps trying to use 
it as a coordinator



-- 
Jeff Jirsa


> On Nov 27, 2018, at 6:33 PM, Ben Slater <ben.sla...@instaclustr.com> wrote:
> 
> In what way does the cluster become unstable (ie more specifically what are 
> the symptoms)? My first thought would be the loss of the node causing the 
> other nodes to become overloaded but that doesn’t seem to fit with  your 
> point 2.
> 
> Cheers
> Ben
> --- 
> Ben Slater
> Chief Product Officer
> 
>     
> Read our latest technical blog posts here.
> This email has been sent on behalf of Instaclustr Pty. Limited (Australia) 
> and Instaclustr Inc (USA).
> This email and any attachments may contain confidential and legally 
> privileged information.  If you are not the intended recipient, do not copy 
> or disclose its content, but please reply to this email immediately and 
> highlight the error to the sender and then immediately delete the message.
> 
> 
>> On Tue, 27 Nov 2018 at 16:32, Agrawal, Pratik <paagr...@amazon.com.invalid> 
>> wrote:
>> Hello all,
>> 
>>  
>> 
>> Setup:
>> 
>>  
>> 
>> 18 Cassandra node cluster. Cassandra version 2.2.8
>> 
>> Amazon C3.2x large machines.
>> 
>> Replication factor of 3 (in 3 different AZs).
>> 
>> Read and Write using Quorum.
>> 
>>  
>> 
>> Use case:
>> 
>>  
>> 
>> Short lived data with heavy updates (I know we are abusing Cassandra here) 
>> with gc grace period of 15 minutes (I know it sounds ridiculous). 
>> Level-tiered compaction strategy.
>> Timeseries data, no updates (short lived) (1 hr). TTLed out using 
>> Date-tiered compaction strategy.
>> Timeseries data, no updates (long lived) (7 days). TTLed out using 
>> Date-tiered compaction strategy.
>>  
>> 
>> Overall high read and write throughput (100000/second)
>> 
>>  
>> 
>> Problem:
>> 
>> The EC2 machine becomes unreachable (we reproduced the issue by taking down 
>> network card) and the entire cluster becomes unstable for the time until the 
>> down node is removed from the cluster. The node is shown as DN node while 
>> doing nodetool status. Our understanding was that a single node down in one 
>> AZ should not impact other nodes. We are unable to understand why a single 
>> node going down is causing entire cluster to become unstable. Is there any 
>> open bug around this?
>> We tried another experiment by killing Cassandra process but in this case we 
>> only see a blip in latencies but all the other nodes are still healthy and 
>> responsive (as expected).
>>  
>> 
>> Any thoughts/comments on what could be the issue here?
>> 
>>  
>> 
>> Thanks,
>> Pratik
>> 
>>  
>> 
>>  
>> 
>>

Re: Cassandra single unreachable node causing total cluster outage

Reply via email to