Have you tried using the G1 garbage collector instead of CMS?

We had the same issues that things were normally fine, but as soon as
something extraordinary happened, a node could go into GC hell and never
recover, and that could then spread to other nodes as they took up the
slack, trapping them in GC hell, and so on.

We did two things that helped us a lot; we switched to the G1GC, and we
switched to off-heap memtables. The second is pretty much a no-brainer, and
might even be default in 2.2.x, but do it if not. Switching to G1 needs to
be monitored closely, it has very different characteristics from CMS, but
it helped us in our case.

Both things are very easy to try out, it's just a config change and a node
restart, and if you have good monitoring you should be able to see how they
compare in the regular case and the extraordinary case.


/Henrik

On Wed, Aug 3, 2016 at 11:09 AM, Jonathan Haddad <j...@jonhaddad.com> wrote:

> Kevin,
>
> "Our scheme uses large buckets of content where we write to a
> bucket/partition for 5 minutes, then move to a new one."
>
> Are you writing to a single partition and only that partition for 5
> minutes?  If so, you should really rethink your data model.  This method
> does not scale as you add nodes, it can only scale vertically.
>
> On Wed, Aug 3, 2016 at 9:24 AM Reynald Bourtembourg <
> reynald.bourtembo...@esrf.fr> wrote:
>
>> Hi,
>>
>> Maybe Ben was referring to this issue which has been mentioned recently
>> on this mailing list:
>> https://issues.apache.org/jira/browse/CASSANDRA-11887
>>
>> Cheers,
>> Reynald
>>
>>
>> On 03/08/2016 18:09, Romain Hardouin wrote:
>>
>> > Curious why the 2.2 to 3.x upgrade path is risky at best.
>> I guess that upgrade from 2.2 is less tested by DataStax QA because DSE4
>> used C* 2.1, not 2.2.
>> I would say the safest upgrade is 2.1 to 3.0.x
>>
>> Best,
>>
>> Romain
>>
>>
>>

Reply via email to