On Mon, Nov 24, 2014 at 12:57 PM, Kevin Burton <bur...@spinn3r.com> wrote:
> I’m trying to track down some exceptions in our production cluster. I > bumped up our write load and now I’m getting a non-trivial number of these > exceptions. Somewhere on the order of 100 per hour. > > All machines have a somewhat high CPU load because they’re doing other > tasks. I’m worried that perhaps my background tasks are just overloading > cassandra and one way to mitigate this is to nice them to least favorable > priority (this is my first tasks). > Two out of three of them are timeouts or lack of availability. Seeing this across your cluster is usually associated with hitting a "pre-fail" condition in terms of GC, where the amount of data stored per node makes the steady state working set larger than available non-fragmented heap. If you're graphing GC time, I would expect to see a concomitant spike there. =Rob