Hi can you put some java stack ? Agrawal, Pratik <paagr...@amazon.com.invalid> 于2018年12月11日周二 下午10:26写道:
> Hello all, > > > > I’ve been doing more analysis and I’ve few questions: > > > > 1. We observed that most of the requests are blocked on NTR queue. I > increased the queue size from 128 (default) to 1024 and this time the > system does recover automatically (latencies go back to normal) without > removing node from the cluster. > 2. Is there a way to fail fast the NTR requests rather than being > blocked on the NTR queue when the queue is full? > > > > Thanks, > > Pratik > > *From: *"Agrawal, Pratik" <paagr...@amazon.com> > *Date: *Monday, December 3, 2018 at 11:55 PM > *To: *"user@cassandra.apache.org" <user@cassandra.apache.org>, Marc > Selwan <marc.sel...@datastax.com> > *Cc: *Jeff Jirsa <jji...@gmail.com>, Ben Slater < > ben.sla...@instaclustr.com> > *Subject: *Re: Cassandra single unreachable node causing total cluster > outage > > > > Hello, > > > > 1. Cassandra latencies spiked 5-6 times the normal. (Read and write > both). The latencies were in higher single digit seconds. > 2. As I said in my previous email, we don’t bound the NTR threads and > queue, the Cassandra nodes NTR queue started piling up and requests started > getting blocked. 8 (mainly 4) out of 18 nodes in the cluster had NTR > requests blocked. > 3. As a result of 1.) and 2.) the Cassandra system resources spiked > up.(CPU, IO, system load, # SStables (10 times, 250->2500), Memtable switch > count, Pending compactions etc.) > 4. One interesting thing we observed was the read calls with quorum > consistency were not having any issues (high latencies and requests backing > up) while the read calls with serial consistency were consistently failing > on client side due to C* timeouts. > 5. We used Nodetool removenode command to remove the ndoe from the > cluster. The node wasn’t reachable (IP down). > > > > One thing which we don’t understand is as soon as we remove the dead node > from the cluster the system recovers within a minute(s). My main question > is, is there a bug in C* with respect to Cassandra serial consistency calls > getting blocked on some dead node resource and the resources getting > released as soon as the dead node is removed from the cluster OR are we > hitting some limit here? > > > > Also, as the cluster size increases the impact of the dead node decreases > on serial consistency read decreases (as in the latency spike up for a > minute or two and the system automatically recovers). > > > > Any pointers? > > > > Thanks, > > Pratik > > > > *From: *Marc Selwan <marc.sel...@datastax.com> > *Reply-To: *"user@cassandra.apache.org" <user@cassandra.apache.org> > *Date: *Monday, December 3, 2018 at 1:09 AM > *To: *"user@cassandra.apache.org" <user@cassandra.apache.org> > *Cc: *Jeff Jirsa <jji...@gmail.com>, Ben Slater < > ben.sla...@instaclustr.com> > *Subject: *Re: Cassandra single unreachable node causing total cluster > outage > > > > Ben's question is a good one - What are the exact symptoms you're > experiencing? Is it latency spikes? Nodes flapping? That'll help us figure > out where to look. > > > > When you removed the down node, which command did you use? > > > > Best, > > Marc > > > > On Sun, Dec 2, 2018 at 1:36 PM Agrawal, Pratik <paagr...@amazon.com.invalid> > wrote: > > One other thing I forgot to add: > > > > native_transport_max_threads: 128 > > > > we have commented this setting out, should we bound this? I am planning to > experiment with this setting to bound it. > > > > Thanks, > Pratik > > > > *From: *"Agrawal, Pratik" <paagr...@amazon.com> > *Date: *Sunday, December 2, 2018 at 4:33 PM > *To: *"user@cassandra.apache.org" <user@cassandra.apache.org>, Jeff Jirsa > <jji...@gmail.com>, Ben Slater <ben.sla...@instaclustr.com> > > > *Subject: *Re: Cassandra single unreachable node causing total cluster > outage > > > > I looked into some of the logs and I saw that at the time of the event the > Native requests started getting blocked. > > > > e.g. > > [INFO] org.apache.cassandra.utils.StatusLogger: > Native-Transport-Requests 128 133 51795821 16 > 19114 > > > > The number of blocked requests kept on increasing over the period of 5 > minutes and became constant. > > > > As soon as we remove the dead node from the cluster, things recover pretty > quickly and cluster becomes stable. > > > > Any pointers on what to look for debugging why requests are getting > blocked when a nodes goes down?? > > > > Also, one other thing to note that we reproduced this scenario in our test > environment and as we scale up the cluster the cluster automatically > recover in matter of minutes without removing the node from the cluster. It > seems like we are reaching some vertical scalability limit (maybe because > of our configuration). > > > > > > Thanks, > > Pratik > > *From: *Jeff Jirsa <jji...@gmail.com> > *Reply-To: *"user@cassandra.apache.org" <user@cassandra.apache.org> > *Date: *Tuesday, November 27, 2018 at 9:37 PM > *To: *"user@cassandra.apache.org" <user@cassandra.apache.org> > *Subject: *Re: Cassandra single unreachable node causing total cluster > outage > > > > Could also be the app not detecting the host is down and it keeps trying > to use it as a coordinator > > > > -- > > Jeff Jirsa > > > > > On Nov 27, 2018, at 6:33 PM, Ben Slater <ben.sla...@instaclustr.com> > wrote: > > In what way does the cluster become unstable (ie more specifically what > are the symptoms)? My first thought would be the loss of the node causing > the other nodes to become overloaded but that doesn’t seem to fit with > your point 2. > > > > Cheers > > Ben > > --- > > *Ben Slater* > *Chief Product Officer* > > *Error! Filename not specified.* > > *Error! Filename not specified.* > <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.facebook.com_instaclustr&d=DwMGaQ&c=adz96Xi0w1RHqtPMowiL2g&r=E6NVfMr2TIhW42QMfARTvsfCLtdF-oEA3KfAQRfVZdk&m=otXyNz2TYzjVC2J099Gls6ouoJNWS0gRWcF_ifF7SK4&s=Xd_TFiFGevJNlBDe8YBlsOWpPx3ppl7LgklTrp8PH2A&e=> > *Error! Filename not specified.* > <https://urldefense.proofpoint.com/v2/url?u=https-3A__twitter.com_instaclustr&d=DwMGaQ&c=adz96Xi0w1RHqtPMowiL2g&r=E6NVfMr2TIhW42QMfARTvsfCLtdF-oEA3KfAQRfVZdk&m=otXyNz2TYzjVC2J099Gls6ouoJNWS0gRWcF_ifF7SK4&s=epE5B8XtjE7etFaFYgWYWD-S87VbVIZ1fo3EuWBZUeQ&e=> > *Error! Filename not specified.* > <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_instaclustr&d=DwMGaQ&c=adz96Xi0w1RHqtPMowiL2g&r=E6NVfMr2TIhW42QMfARTvsfCLtdF-oEA3KfAQRfVZdk&m=otXyNz2TYzjVC2J099Gls6ouoJNWS0gRWcF_ifF7SK4&s=lK0z0xWZZpeqfIjRK1hVXbEbaJfyQ05h5gNAUKm2HyQ&e=> > > Read our latest technical blog posts here > <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.instaclustr.com_blog_&d=DwMGaQ&c=adz96Xi0w1RHqtPMowiL2g&r=E6NVfMr2TIhW42QMfARTvsfCLtdF-oEA3KfAQRfVZdk&m=otXyNz2TYzjVC2J099Gls6ouoJNWS0gRWcF_ifF7SK4&s=_SLQsDkWnEKXzjPnhbN-Y4kibvu5lsptqjaoq9ad3d0&e=> > . > > This email has been sent on behalf of Instaclustr Pty. Limited (Australia) > and Instaclustr Inc (USA). > > This email and any attachments may contain confidential and legally > privileged information. If you are not the intended recipient, do not copy > or disclose its content, but please reply to this email immediately and > highlight the error to the sender and then immediately delete the message. > > > > > > On Tue, 27 Nov 2018 at 16:32, Agrawal, Pratik <paagr...@amazon.com.invalid> > wrote: > > Hello all, > > > > *Setup:* > > > > 18 Cassandra node cluster. Cassandra version 2.2.8 > > Amazon C3.2x large machines. > > Replication factor of 3 (in 3 different AZs). > > Read and Write using Quorum. > > > > *Use case:* > > > > 1. Short lived data with heavy updates (I know we are abusing > Cassandra here) with gc grace period of 15 minutes (I know it sounds > ridiculous). Level-tiered compaction strategy. > 2. Timeseries data, no updates (short lived) (1 hr). TTLed out using > Date-tiered compaction strategy. > 3. Timeseries data, no updates (long lived) (7 days). TTLed out using > Date-tiered compaction strategy. > > > > Overall high read and write throughput (100000/second) > > > > *Problem:* > > 1. The EC2 machine becomes unreachable (we reproduced the issue by > taking down network card) and the entire cluster becomes unstable for the > time until the down node is removed from the cluster. The node is shown as > DN node while doing nodetool status. Our understanding was that a single > node down in one AZ should not impact other nodes. We are unable to > understand why a single node going down is causing entire cluster to become > unstable. Is there any open bug around this? > 2. We tried another experiment by killing Cassandra process but in > this case we only see a blip in latencies but all the other nodes are still > healthy and responsive (as expected). > > > > Any thoughts/comments on what could be the issue here? > > > > Thanks, > Pratik > > > > > > > > -- > > Marc Selwan | DataStax | Product Management | (925) 413-7079 > > > > > -- you are the apple of my eye !