Re: Dropped messages on random nodes.

2017-01-24 Thread Dikang Gu
Thanks guys! Jeff Jirsa helped me take a look, and I found a 10sec young gc pause in the GC log. 3071128K->282000K(3495296K), 0.1144648 secs] 25943529K->23186623K(66409856K), 9.8971781 secs] [Times: user=2.33 sys=0.00, real=9.89 secs] I'm trying to get a histogram or heap dump. Thanks! On Mo

Re: Dropped messages on random nodes.

2017-01-23 Thread Brandon Williams
The lion's share of your drops are from cross-node timeouts, which require clock synchronization, so check that first. If your clocks are synced, that means not only are you showing eager dropping based on time, but despite the eager dropping you are still facing overload. That local, non-gc paus

Re: Dropped messages on random nodes.

2017-01-23 Thread Roopa Tangirala
Dikang, Did you take a look at the heap health on those nodes? A quick heap histogram or dump would help you figure out if it is related to data issue(wide rows, or bad model) where few nodes may be coming under heap pressure and dropping messages. Thanks, Roopa *Regards,* *Roopa Tangirala*

Re: Dropped messages on random nodes.

2017-01-23 Thread Blake Eggleston
Hi Dikang, Do you have any GC logging or metrics you can correlate with the dropped messages? A 13 second pause sounds like a bad GC pause. Thanks, Blake On January 22, 2017 at 10:37:22 PM, Dikang Gu (dikan...@gmail.com) wrote: Btw, the C* version is 2.2.5, with several backported patches.

Re: Dropped messages on random nodes.

2017-01-22 Thread Dikang Gu
Btw, the C* version is 2.2.5, with several backported patches. On Sun, Jan 22, 2017 at 10:36 PM, Dikang Gu wrote: > Hello there, > > We have a 100 nodes ish cluster, I find that there are dropped messages on > random nodes in the cluster, which caused error spikes and P99 latency > spikes as wel