/** * Verbs it's okay to drop if the request has been queued longer than the request timeout. These * all correspond to client requests or something triggered by them; we don't want to * drop internal messages like bootstrap or repair notifications. */ public static final EnumSet<Verb> DROPPABLE_VERBS = EnumSet.of(Verb.BINARY,
Verb._TRACE, Verb.MUTATION, Verb.READ_REPAIR, Verb.READ, Verb.RANGE_SLICE, Verb.PAGED_RANGE, Verb.REQUEST_RESPONSE); The short term solution would probably increase the timeout in your yaml file but i suggest you get the monitoring graphs (ping internode, block io) ready so it will give better indication which might be the exact problem. Jason On Tue, Jan 7, 2014 at 2:30 AM, Blake Eggleston <bl...@shift.com> wrote: > That’s a good point. CPU steal time is very low, but I haven’t observed > internode ping times during one of the peaks, I’ll have to check that out. > Another thing I’ve noticed is that cassandra starts dropping read messages > during the spikes, as reported by tpstats. This indicates that there’s too > many queries for cassandra to handle. However, as I mentioned earlier, the > spikes aren’t correlated to an increase in reads. > > On Jan 5, 2014, at 3:28 PM, Blake Eggleston <bl...@shift.com> wrote: > > > Hi, > > > > I’ve been having a problem with 3 neighboring nodes in our cluster > having their read latencies jump up to 9000ms - 18000ms for a few minutes > (as reported by opscenter), then come back down. > > > > We’re running a 6 node cluster, on AWS hi1.4xlarge instances, with > cassandra reading and writing to 2 raided ssds. > > > > I’ve added 2 nodes to the struggling part of the cluster, and aside from > the latency spikes shifting onto the new nodes, it has had no effect. I > suspect that a single key that lives on the first stressed node may be > being read from heavily. > > > > The spikes in latency don’t seem to be correlated to an increase in > reads. The cluster’s workload is usually handling a maximum workload of > 4200 reads/sec per node, with writes being significantly less, at ~200/sec > per node. Usually it will be fine with this, with read latencies at around > 3.5-10 ms/read, but once or twice an hour the latencies on the 3 nodes will > shoot through the roof. > > > > The disks aren’t showing serious use, with read and write rates on the > ssd volume at around 1350 kBps and 3218 kBps, respectively. Each cassandra > process is maintaining 1000-1100 open connections. GC logs aren’t showing > any serious gc pauses. > > > > Any ideas on what might be causing this? > > > > Thanks, > > > > Blake > >