The lion's share of your drops are from cross-node timeouts, which require
clock synchronization, so check that first.  If your clocks are synced,
that means not only are you showing eager dropping based on time, but
despite the eager dropping you are still facing overload.

That local, non-gc pause is also troubling. (I assume non-gc since there
wasn't anything logged by the GC inspector.)

On Mon, Jan 23, 2017 at 12:36 AM, Dikang Gu <dikan...@gmail.com> wrote:

> Hello there,
>
> We have a 100 nodes ish cluster, I find that there are dropped messages on
> random nodes in the cluster, which caused error spikes and P99 latency
> spikes as well.
>
> I tried to figure out the cause. I do not see any obvious bottleneck in
> the cluster, the C* nodes still have plenty of cpu idle/disk io. But I do
> see some suspicious gossip events around that time, not sure if it's
> related.
>
> 2017-01-21_16:43:56.71033 WARN  16:43:56 [GossipTasks:1]: Not marking
> nodes down due to local pause of 13079498815 > 5000000000
> 2017-01-21_16:43:56.85532 INFO  16:43:56 [ScheduledTasks:1]: MUTATION
> messages were dropped in last 5000 ms: 65 for internal timeout and 10895
> for cross node timeout
> 2017-01-21_16:43:56.85533 INFO  16:43:56 [ScheduledTasks:1]: READ messages
> were dropped in last 5000 ms: 33 for internal timeout and 7867 for cross
> node timeout
> 2017-01-21_16:43:56.85534 INFO  16:43:56 [ScheduledTasks:1]: Pool Name
>                Active   Pending      Completed   Blocked  All Time Blocked
> 2017-01-21_16:43:56.85534 INFO  16:43:56 [ScheduledTasks:1]: MutationStage
>                   128     47794     1015525068         0                 0
> 2017-01-21_16:43:56.85535
> 2017-01-21_16:43:56.85535 INFO  16:43:56 [ScheduledTasks:1]: ReadStage
>                    64     20202      450508940         0                 0
>
> Any suggestions?
>
> Thanks!
>
> --
> Dikang
>
>

Reply via email to