Graham,

This is largely fixed in 2.1 with the introduction of partially off-heap
memtables - the slabs reside off-heap, so do not cause any GC issues.

As it happens the changes would also permit us to recycle on-heap slabs
reasonable easily as well, so feel free to file a ticket for that, although
it won't be back ported to 2.0.


On 21 May 2014 00:57, graham sanderson <gra...@vast.com> wrote:

> So i’ve been tinkering a bit with CMS config because we are still seeing
> fairly frequent full compacting GC due to framgentation/promotion failure
>
> As mentioned below, we are usually too fragmented to promote new in-flight
> memtables.
>
> This is likely caused by sudden write spikes (which we do have), though
> actually the problems don’t generally happen at that time of our largest
> write spikes (though any write spikes likely cause spill of both new
> memtables along with many other new objects of unknown size into the
> tenured gen, so they cause fragmentation if not immediate GC issue). We
> have lots of things going on in this multi-tenant cluster (GC pauses are of
> course extra bad, since they cause spike in hinted-handoff on other nodes
> which were already busy etc…)
>
> Anyway, considering possibilities:
>
> 0) Try and make our application behavior more steady state - this is
> probably possible, but there are lots of other things (e.g. compaction,
> opscenter, repair etc.) which are both tunable and generally throttle-able
> to think about too.
> 1) Play with tweaking PLAB configs to see if we can ease fragmentation
> (I’d be curious what the “crud” is in particular that is getting spilled -
> presumably it is larger objects since it affects the binary tree of large
> objects)
> 2) Given the above, if we can guarantee even > 24 hours without full GC, I
> don’t think we’d mind running a regular rolling re-start on the servers
> during off hours (note usually the GCs don’t have a visible impact, but
> when they hit multiple machines at once they can)
> 3) Zing is seriously an option, if it would save us large amounts of
> tuning, and constant worry about the “next” thing tweaking the allocation
> patterns - does anyone have any experience with Zing & Cassandra
> 4) Given that we expect periodic bursts of writes,
> memtable_total_space_in_mb is bounded, we are not actually short of memory
> (it just gets fragmented), I’m wondering if anyone has played with pinning
> (up to or initially?) that many 1MB chunks of memory via SlabAllocator and
> re-using… It will get promoted once, and then these 1M chunks won’t be part
> of the subsequent promotion hassle… it will probably also allow more crud
> to die in eden under write load since we aren’t allocating these large
> chunks in eden at the same time. Anyway, I had a little look at the code,
> and the life cycles of memtables is not trivial, but was considering
> attempting a patch to play with… anyone have any thoughts?
>
> Basically in summary, the Slab allocator helps by allocating and freeing
> lots of objects at the same time, however any time slabs are allocated
> under load, we end up promoting them with whatever other live stuff in eden
> is still there. If we only do this once and reuse the slabs, we are likely
> to minimize our promotion problem later (at least for these large objects)
>
> On May 16, 2014, at 9:37 PM, graham sanderson <gra...@vast.com> wrote:
>
> > Excellent - thank you…
> >
> > On May 16, 2014, at 7:08 AM, Samuel CARRIERE <samuel.carri...@urssaf.fr>
> wrote:
> >
> >> Hi,
> >> This is arena allocation of memtables. See here for more infos :
> >> http://www.datastax.com/dev/blog/whats-new-in-cassandra-1-0-performance
> >>
> >>
> >>
> >>
> >> De :    graham sanderson <gra...@vast.com>
> >> A :     dev@cassandra.apache.org,
> >> Date :  16/05/2014 14:03
> >> Objet : Things that are about 1M big
> >>
> >>
> >>
> >> So just throwing this out there for those for whom this might ring a
> bell.
> >>
> >> I?m debugging some CMS memory fragmentation issues on 2.0.5 - and
> >> interestingly enough most of the objects giving us promotion failures
> are
> >> of size 131074 (dwords) - GC logging obviously doesn?t say what those
> are,
> >> but I?d wager money they are either 1M big byte arrays, or less likely
> >> 256k entry object arrays backing large maps
> >>
> >> So not strictly critical to solving my problem, but I was wondering if
> >> anyone can think of any heap allocated C* objects which are (with no
> >> significant changes to standard cassandra config) allocated in 1M
> chunks.
> >> (It would save me scouring the code, or a 9 gig heap dump if I need to
> >> figure it out!)
> >>
> >> Thanks,
> >>
> >> Graham
> >
>
>

Reply via email to