Re: Cassandra single unreachable node causing total cluster outage

cclive1601你 Wed, 12 Dec 2018 05:22:30 -0800

Hi can you put some java stack ?

Agrawal, Pratik <paagr...@amazon.com.invalid> 于2018年12月11日周二 下午10:26写道：


> Hello all,
>
>
>
> I’ve been doing more analysis and I’ve few questions:
>
>
>
>    1. We observed that most of the requests are blocked on NTR queue. I
>    increased the queue size from 128 (default) to 1024 and this time the
>    system does recover automatically (latencies go back to normal) without
>    removing node from the cluster.
>    2. Is there a way to fail fast the NTR requests rather than being
>    blocked on the NTR queue when the queue is full?
>
>
>
> Thanks,
>
> Pratik
>
> *From: *"Agrawal, Pratik" <paagr...@amazon.com>
> *Date: *Monday, December 3, 2018 at 11:55 PM
> *To: *"user@cassandra.apache.org" <user@cassandra.apache.org>, Marc
> Selwan <marc.sel...@datastax.com>
> *Cc: *Jeff Jirsa <jji...@gmail.com>, Ben Slater <
> ben.sla...@instaclustr.com>
> *Subject: *Re: Cassandra single unreachable node causing total cluster
> outage
>
>
>
> Hello,
>
>
>
>    1. Cassandra latencies spiked 5-6 times the normal. (Read and write
>    both). The latencies were in higher single digit seconds.
>    2. As I said in my previous email, we don’t bound the NTR threads and
>    queue, the Cassandra nodes NTR queue started piling up and requests started
>    getting blocked. 8 (mainly 4) out of 18 nodes in the cluster had NTR
>    requests blocked.
>    3. As a result of 1.) and 2.) the Cassandra system resources spiked
>    up.(CPU, IO, system load, # SStables (10 times, 250->2500), Memtable switch
>    count, Pending compactions etc.)
>    4. One interesting thing we observed was the read calls with quorum
>    consistency were not having any issues (high latencies and requests backing
>    up) while the read calls with serial consistency were consistently failing
>    on client side due to C* timeouts.
>    5. We used Nodetool removenode command to remove the ndoe from the
>    cluster. The node wasn’t reachable (IP down).
>
>
>
> One thing which we don’t understand is as soon as we remove the dead node
> from the cluster the system recovers within a minute(s). My main question
> is, is there a bug in C* with respect to Cassandra serial consistency calls
> getting blocked on some dead node resource and the resources getting
> released as soon as the dead node is removed from the cluster OR are we
> hitting some limit here?
>
>
>
> Also, as the cluster size increases the impact of the dead node decreases
> on serial consistency read decreases (as in the latency spike up for a
> minute or two and the system automatically recovers).
>
>
>
> Any pointers?
>
>
>
> Thanks,
>
> Pratik
>
>
>
> *From: *Marc Selwan <marc.sel...@datastax.com>
> *Reply-To: *"user@cassandra.apache.org" <user@cassandra.apache.org>
> *Date: *Monday, December 3, 2018 at 1:09 AM
> *To: *"user@cassandra.apache.org" <user@cassandra.apache.org>
> *Cc: *Jeff Jirsa <jji...@gmail.com>, Ben Slater <
> ben.sla...@instaclustr.com>
> *Subject: *Re: Cassandra single unreachable node causing total cluster
> outage
>
>
>
> Ben's question is a good one - What are the exact symptoms you're
> experiencing? Is it latency spikes? Nodes flapping? That'll help us figure
> out where to look.
>
>
>
> When you removed the down node, which command did you use?
>
>
>
> Best,
>
> Marc
>
>
>
> On Sun, Dec 2, 2018 at 1:36 PM Agrawal, Pratik <paagr...@amazon.com.invalid>
> wrote:
>
> One other thing I forgot to add:
>
>
>
> native_transport_max_threads: 128
>
>
>
> we have commented this setting out, should we bound this? I am planning to
> experiment with this setting to bound it.
>
>
>
> Thanks,
> Pratik
>
>
>
> *From: *"Agrawal, Pratik" <paagr...@amazon.com>
> *Date: *Sunday, December 2, 2018 at 4:33 PM
> *To: *"user@cassandra.apache.org" <user@cassandra.apache.org>, Jeff Jirsa
> <jji...@gmail.com>, Ben Slater <ben.sla...@instaclustr.com>
>
>
> *Subject: *Re: Cassandra single unreachable node causing total cluster
> outage
>
>
>
> I looked into some of the logs and I saw that at the time of the event the
> Native requests started getting blocked.
>
>
>
> e.g.
>
>  [INFO] org.apache.cassandra.utils.StatusLogger:
> Native-Transport-Requests       128       133       51795821        16
>           19114
>
>
>
> The number of blocked requests kept on increasing over the period of 5
> minutes and became constant.
>
>
>
> As soon as we remove the dead node from the cluster, things recover pretty
> quickly and cluster becomes stable.
>
>
>
> Any pointers on what to look for debugging why requests are getting
> blocked when a nodes goes down??
>
>
>
> Also, one other thing to note that we reproduced this scenario in our test
> environment and as we scale up the cluster the cluster automatically
> recover in matter of minutes without removing the node from the cluster. It
> seems like we are reaching some vertical scalability limit (maybe because
> of our configuration).
>
>
>
>
>
> Thanks,
>
> Pratik
>
> *From: *Jeff Jirsa <jji...@gmail.com>
> *Reply-To: *"user@cassandra.apache.org" <user@cassandra.apache.org>
> *Date: *Tuesday, November 27, 2018 at 9:37 PM
> *To: *"user@cassandra.apache.org" <user@cassandra.apache.org>
> *Subject: *Re: Cassandra single unreachable node causing total cluster
> outage
>
>
>
> Could also be the app not detecting the host is down and it keeps trying
> to use it as a coordinator
>
>
>
> --
>
> Jeff Jirsa
>
>
>
>
> On Nov 27, 2018, at 6:33 PM, Ben Slater <ben.sla...@instaclustr.com>
> wrote:
>
> In what way does the cluster become unstable (ie more specifically what
> are the symptoms)? My first thought would be the loss of the node causing
> the other nodes to become overloaded but that doesn’t seem to fit with
>  your point 2.
>
>
>
> Cheers
>
> Ben
>
> ---
>
> *Ben Slater*
> *Chief Product Officer*
>
> *Error! Filename not specified.*
>
> *Error! Filename not specified.*
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.facebook.com_instaclustr&d=DwMGaQ&c=adz96Xi0w1RHqtPMowiL2g&r=E6NVfMr2TIhW42QMfARTvsfCLtdF-oEA3KfAQRfVZdk&m=otXyNz2TYzjVC2J099Gls6ouoJNWS0gRWcF_ifF7SK4&s=Xd_TFiFGevJNlBDe8YBlsOWpPx3ppl7LgklTrp8PH2A&e=>
>   *Error! Filename not specified.*
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__twitter.com_instaclustr&d=DwMGaQ&c=adz96Xi0w1RHqtPMowiL2g&r=E6NVfMr2TIhW42QMfARTvsfCLtdF-oEA3KfAQRfVZdk&m=otXyNz2TYzjVC2J099Gls6ouoJNWS0gRWcF_ifF7SK4&s=epE5B8XtjE7etFaFYgWYWD-S87VbVIZ1fo3EuWBZUeQ&e=>
>   *Error! Filename not specified.*
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_instaclustr&d=DwMGaQ&c=adz96Xi0w1RHqtPMowiL2g&r=E6NVfMr2TIhW42QMfARTvsfCLtdF-oEA3KfAQRfVZdk&m=otXyNz2TYzjVC2J099Gls6ouoJNWS0gRWcF_ifF7SK4&s=lK0z0xWZZpeqfIjRK1hVXbEbaJfyQ05h5gNAUKm2HyQ&e=>
>
> Read our latest technical blog posts here
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.instaclustr.com_blog_&d=DwMGaQ&c=adz96Xi0w1RHqtPMowiL2g&r=E6NVfMr2TIhW42QMfARTvsfCLtdF-oEA3KfAQRfVZdk&m=otXyNz2TYzjVC2J099Gls6ouoJNWS0gRWcF_ifF7SK4&s=_SLQsDkWnEKXzjPnhbN-Y4kibvu5lsptqjaoq9ad3d0&e=>
> .
>
> This email has been sent on behalf of Instaclustr Pty. Limited (Australia)
> and Instaclustr Inc (USA).
>
> This email and any attachments may contain confidential and legally
> privileged information.  If you are not the intended recipient, do not copy
> or disclose its content, but please reply to this email immediately and
> highlight the error to the sender and then immediately delete the message.
>
>
>
>
>
> On Tue, 27 Nov 2018 at 16:32, Agrawal, Pratik <paagr...@amazon.com.invalid>
> wrote:
>
> Hello all,
>
>
>
> *Setup:*
>
>
>
> 18 Cassandra node cluster. Cassandra version 2.2.8
>
> Amazon C3.2x large machines.
>
> Replication factor of 3 (in 3 different AZs).
>
> Read and Write using Quorum.
>
>
>
> *Use case:*
>
>
>
>    1. Short lived data with heavy updates (I know we are abusing
>    Cassandra here) with gc grace period of 15 minutes (I know it sounds
>    ridiculous). Level-tiered compaction strategy.
>    2. Timeseries data, no updates (short lived) (1 hr). TTLed out using
>    Date-tiered compaction strategy.
>    3. Timeseries data, no updates (long lived) (7 days). TTLed out using
>    Date-tiered compaction strategy.
>
>
>
> Overall high read and write throughput (100000/second)
>
>
>
> *Problem:*
>
>    1. The EC2 machine becomes unreachable (we reproduced the issue by
>    taking down network card) and the entire cluster becomes unstable for the
>    time until the down node is removed from the cluster. The node is shown as
>    DN node while doing nodetool status. Our understanding was that a single
>    node down in one AZ should not impact other nodes. We are unable to
>    understand why a single node going down is causing entire cluster to become
>    unstable. Is there any open bug around this?
>    2. We tried another experiment by killing Cassandra process but in
>    this case we only see a blip in latencies but all the other nodes are still
>    healthy and responsive (as expected).
>
>
>
> Any thoughts/comments on what could be the issue here?
>
>
>
> Thanks,
> Pratik
>
>
>
>
>
>
>
> --
>
> Marc Selwan | DataStax | Product Management | (925) 413-7079
>
>
>
>
>


-- 
you are the apple of my eye !

Re: Cassandra single unreachable node causing total cluster outage

Reply via email to