On Fri, Nov 28, 2014 at 12:55 PM, Paulo Ricardo Motta Gomes <
paulo.mo...@chaordicsystems.com> wrote:

> We restart the whole cluster every 1 or 2 months, to avoid machines
> getting into this crazy state. We tried tuning GC size and parameters,
> different cassandra versions (1.1, 1.2, 2.0), but this behavior keeps
> happening. More recently, during black friday, we received about 5x our
> normal load, and some machines started presenting this behavior. Once
> again, we restart the nodes an the GC behaves normal again.
> ...
> You can clearly notice some memory is actually reclaimed during GC in
> healthy nodes, while in sick machines very little memory is reclaimed.
> Also, since GC is executed more frequently in sick machines, it uses about
> 2x more CPU than non-sick nodes.
>
> Have you ever observed this behavior in your cluster? Could this be
> related to heap fragmentation? Would using the G1 collector help in this
> case? Any GC tuning or monitoring advice to troubleshoot this issue?
>

The specific combo of symptoms does in fact sound like a combination of
being close to heap exhaustion with working set and then fragmentation
putting you over the top.

I would probably start by increasing your heap, which will help avoid the
pre-fail condition from your working set.

But for tuning, examine the contents of each generation when the JVM gets
into this state. You are probably exhausting permanent generation, but
depending on what that says, you could change the relatively sizing of the
generations.

=Rob

Reply via email to