The lion's share of your drops are from cross-node timeouts, which require clock synchronization, so check that first. If your clocks are synced, that means not only are you showing eager dropping based on time, but despite the eager dropping you are still facing overload.
That local, non-gc pause is also troubling. (I assume non-gc since there wasn't anything logged by the GC inspector.) On Mon, Jan 23, 2017 at 12:36 AM, Dikang Gu <dikan...@gmail.com> wrote: > Hello there, > > We have a 100 nodes ish cluster, I find that there are dropped messages on > random nodes in the cluster, which caused error spikes and P99 latency > spikes as well. > > I tried to figure out the cause. I do not see any obvious bottleneck in > the cluster, the C* nodes still have plenty of cpu idle/disk io. But I do > see some suspicious gossip events around that time, not sure if it's > related. > > 2017-01-21_16:43:56.71033 WARN 16:43:56 [GossipTasks:1]: Not marking > nodes down due to local pause of 13079498815 > 5000000000 > 2017-01-21_16:43:56.85532 INFO 16:43:56 [ScheduledTasks:1]: MUTATION > messages were dropped in last 5000 ms: 65 for internal timeout and 10895 > for cross node timeout > 2017-01-21_16:43:56.85533 INFO 16:43:56 [ScheduledTasks:1]: READ messages > were dropped in last 5000 ms: 33 for internal timeout and 7867 for cross > node timeout > 2017-01-21_16:43:56.85534 INFO 16:43:56 [ScheduledTasks:1]: Pool Name > Active Pending Completed Blocked All Time Blocked > 2017-01-21_16:43:56.85534 INFO 16:43:56 [ScheduledTasks:1]: MutationStage > 128 47794 1015525068 0 0 > 2017-01-21_16:43:56.85535 > 2017-01-21_16:43:56.85535 INFO 16:43:56 [ScheduledTasks:1]: ReadStage > 64 20202 450508940 0 0 > > Any suggestions? > > Thanks! > > -- > Dikang > >