memtables etc

graham sanderson Sun, 15 Jun 2014 08:54:05 -0700

Hi Benedict,

So I had a look at the code, and as you say it looked pretty easy to recycle on 
heap slabs… there is already RACE_ALLOCATED which keeps a strongly referenced 
pool, however I was thinking in this case of just WeakReferences.


In terms of on heap slabs, it seemed to me that recycling the oldest slab you 
have is probably the best heuristic, since it is less likely to be in eden (of 
course re-using in eden is no worse than worst case today), however since the 
problem tends to be promotion failure of slabs due to fragmentation of old gen, 
recycling one that is already there is even better - better still if it has 
been compacted somewhere pretty stable. I think this heuristic would also work 
well for G1, though I believe the recommendation is still not to use that with 
cassandra.

So for implementation of that I was thinking of using a ConcurrentSkipListMap, 
from a Long representing the allocation order of the Region to a weak reference 
to the Region (just regular 1M sized ones)… allocators can pull oldest and 
discard cleared references (might need a scrubber if the map got too big and we 
were only checking the first entry). Beyond that I don’t think there is any 
need for a configurable-lengthed collection of strongly referenced reusable 
slabs.

Question 1:

This is easy enough to implement, and probably should just be turned on by an 
orthogonal setting… I guess on heap slab is the current default, so this 
feature will be useful

Question 2:

Something similar could be done for off heap slabs… this would seem more like 
it would want a size limit on the number of re-usable slabs… strong references 
with explicit clean() is probably better, than using weak-references and 
letting PhantomReference cleaner on DirectByteBuffer do the cleaning later.

Let me know any thoughts and I’ll open an issue (probably 2 - one for on heap 
one for off)… let me know whether you’d like me to assign the first to you or 
me (I couldn’t work on it before next week)

Thanks,

Graham.

On May 21, 2014, at 2:20 AM, Benedict Elliott Smith 
<[email protected]> wrote:

> Graham,
> 
> This is largely fixed in 2.1 with the introduction of partially off-heap
> memtables - the slabs reside off-heap, so do not cause any GC issues.
> 
> As it happens the changes would also permit us to recycle on-heap slabs
> reasonable easily as well, so feel free to file a ticket for that, although
> it won't be back ported to 2.0.
> 
> 
> On 21 May 2014 00:57, graham sanderson <[email protected]> wrote:
> 
>> So i’ve been tinkering a bit with CMS config because we are still seeing
>> fairly frequent full compacting GC due to framgentation/promotion failure
>> 
>> As mentioned below, we are usually too fragmented to promote new in-flight
>> memtables.
>> 
>> This is likely caused by sudden write spikes (which we do have), though
>> actually the problems don’t generally happen at that time of our largest
>> write spikes (though any write spikes likely cause spill of both new
>> memtables along with many other new objects of unknown size into the
>> tenured gen, so they cause fragmentation if not immediate GC issue). We
>> have lots of things going on in this multi-tenant cluster (GC pauses are of
>> course extra bad, since they cause spike in hinted-handoff on other nodes
>> which were already busy etc…)
>> 
>> Anyway, considering possibilities:
>> 
>> 0) Try and make our application behavior more steady state - this is
>> probably possible, but there are lots of other things (e.g. compaction,
>> opscenter, repair etc.) which are both tunable and generally throttle-able
>> to think about too.
>> 1) Play with tweaking PLAB configs to see if we can ease fragmentation
>> (I’d be curious what the “crud” is in particular that is getting spilled -
>> presumably it is larger objects since it affects the binary tree of large
>> objects)
>> 2) Given the above, if we can guarantee even > 24 hours without full GC, I
>> don’t think we’d mind running a regular rolling re-start on the servers
>> during off hours (note usually the GCs don’t have a visible impact, but
>> when they hit multiple machines at once they can)
>> 3) Zing is seriously an option, if it would save us large amounts of
>> tuning, and constant worry about the “next” thing tweaking the allocation
>> patterns - does anyone have any experience with Zing & Cassandra
>> 4) Given that we expect periodic bursts of writes,
>> memtable_total_space_in_mb is bounded, we are not actually short of memory
>> (it just gets fragmented), I’m wondering if anyone has played with pinning
>> (up to or initially?) that many 1MB chunks of memory via SlabAllocator and
>> re-using… It will get promoted once, and then these 1M chunks won’t be part
>> of the subsequent promotion hassle… it will probably also allow more crud
>> to die in eden under write load since we aren’t allocating these large
>> chunks in eden at the same time. Anyway, I had a little look at the code,
>> and the life cycles of memtables is not trivial, but was considering
>> attempting a patch to play with… anyone have any thoughts?
>> 
>> Basically in summary, the Slab allocator helps by allocating and freeing
>> lots of objects at the same time, however any time slabs are allocated
>> under load, we end up promoting them with whatever other live stuff in eden
>> is still there. If we only do this once and reuse the slabs, we are likely
>> to minimize our promotion problem later (at least for these large objects)
>> 
>> On May 16, 2014, at 9:37 PM, graham sanderson <[email protected]> wrote:
>> 
>>> Excellent - thank you…
>>> 
>>> On May 16, 2014, at 7:08 AM, Samuel CARRIERE <[email protected]>
>> wrote:
>>> 
>>>> Hi,
>>>> This is arena allocation of memtables. See here for more infos :
>>>> http://www.datastax.com/dev/blog/whats-new-in-cassandra-1-0-performance
>>>> 
>>>> 
>>>> 
>>>> 
>>>> De :    graham sanderson <[email protected]>
>>>> A :     [email protected],
>>>> Date :  16/05/2014 14:03
>>>> Objet : Things that are about 1M big
>>>> 
>>>> 
>>>> 
>>>> So just throwing this out there for those for whom this might ring a
>> bell.
>>>> 
>>>> I?m debugging some CMS memory fragmentation issues on 2.0.5 - and
>>>> interestingly enough most of the objects giving us promotion failures
>> are
>>>> of size 131074 (dwords) - GC logging obviously doesn?t say what those
>> are,
>>>> but I?d wager money they are either 1M big byte arrays, or less likely
>>>> 256k entry object arrays backing large maps
>>>> 
>>>> So not strictly critical to solving my problem, but I was wondering if
>>>> anyone can think of any heap allocated C* objects which are (with no
>>>> significant changes to standard cassandra config) allocated in 1M
>> chunks.
>>>> (It would save me scouring the code, or a 9 gig heap dump if I need to
>>>> figure it out!)
>>>> 
>>>> Thanks,
>>>> 
>>>> Graham
>>> 
>> 
>>

smime.p7s
Description: S/MIME cryptographic signature

Re: CMS GC / fragmentation / memtables etc

Reply via email to