Graham, This is largely fixed in 2.1 with the introduction of partially off-heap memtables - the slabs reside off-heap, so do not cause any GC issues.
As it happens the changes would also permit us to recycle on-heap slabs reasonable easily as well, so feel free to file a ticket for that, although it won't be back ported to 2.0. On 21 May 2014 00:57, graham sanderson <gra...@vast.com> wrote: > So i’ve been tinkering a bit with CMS config because we are still seeing > fairly frequent full compacting GC due to framgentation/promotion failure > > As mentioned below, we are usually too fragmented to promote new in-flight > memtables. > > This is likely caused by sudden write spikes (which we do have), though > actually the problems don’t generally happen at that time of our largest > write spikes (though any write spikes likely cause spill of both new > memtables along with many other new objects of unknown size into the > tenured gen, so they cause fragmentation if not immediate GC issue). We > have lots of things going on in this multi-tenant cluster (GC pauses are of > course extra bad, since they cause spike in hinted-handoff on other nodes > which were already busy etc…) > > Anyway, considering possibilities: > > 0) Try and make our application behavior more steady state - this is > probably possible, but there are lots of other things (e.g. compaction, > opscenter, repair etc.) which are both tunable and generally throttle-able > to think about too. > 1) Play with tweaking PLAB configs to see if we can ease fragmentation > (I’d be curious what the “crud” is in particular that is getting spilled - > presumably it is larger objects since it affects the binary tree of large > objects) > 2) Given the above, if we can guarantee even > 24 hours without full GC, I > don’t think we’d mind running a regular rolling re-start on the servers > during off hours (note usually the GCs don’t have a visible impact, but > when they hit multiple machines at once they can) > 3) Zing is seriously an option, if it would save us large amounts of > tuning, and constant worry about the “next” thing tweaking the allocation > patterns - does anyone have any experience with Zing & Cassandra > 4) Given that we expect periodic bursts of writes, > memtable_total_space_in_mb is bounded, we are not actually short of memory > (it just gets fragmented), I’m wondering if anyone has played with pinning > (up to or initially?) that many 1MB chunks of memory via SlabAllocator and > re-using… It will get promoted once, and then these 1M chunks won’t be part > of the subsequent promotion hassle… it will probably also allow more crud > to die in eden under write load since we aren’t allocating these large > chunks in eden at the same time. Anyway, I had a little look at the code, > and the life cycles of memtables is not trivial, but was considering > attempting a patch to play with… anyone have any thoughts? > > Basically in summary, the Slab allocator helps by allocating and freeing > lots of objects at the same time, however any time slabs are allocated > under load, we end up promoting them with whatever other live stuff in eden > is still there. If we only do this once and reuse the slabs, we are likely > to minimize our promotion problem later (at least for these large objects) > > On May 16, 2014, at 9:37 PM, graham sanderson <gra...@vast.com> wrote: > > > Excellent - thank you… > > > > On May 16, 2014, at 7:08 AM, Samuel CARRIERE <samuel.carri...@urssaf.fr> > wrote: > > > >> Hi, > >> This is arena allocation of memtables. See here for more infos : > >> http://www.datastax.com/dev/blog/whats-new-in-cassandra-1-0-performance > >> > >> > >> > >> > >> De : graham sanderson <gra...@vast.com> > >> A : dev@cassandra.apache.org, > >> Date : 16/05/2014 14:03 > >> Objet : Things that are about 1M big > >> > >> > >> > >> So just throwing this out there for those for whom this might ring a > bell. > >> > >> I?m debugging some CMS memory fragmentation issues on 2.0.5 - and > >> interestingly enough most of the objects giving us promotion failures > are > >> of size 131074 (dwords) - GC logging obviously doesn?t say what those > are, > >> but I?d wager money they are either 1M big byte arrays, or less likely > >> 256k entry object arrays backing large maps > >> > >> So not strictly critical to solving my problem, but I was wondering if > >> anyone can think of any heap allocated C* objects which are (with no > >> significant changes to standard cassandra config) allocated in 1M > chunks. > >> (It would save me scouring the code, or a 9 gig heap dump if I need to > >> figure it out!) > >> > >> Thanks, > >> > >> Graham > > > >