Hinted Handoff/Atomic Updates & SnapTree replacement with BTree

graham sanderson Sat, 12 Jul 2014 19:49:24 -0700

Running on 2.0.5 we’ve noticed occasional MASSIVE spikes in memory allocation 
that happen under write load, but way out of proportion to the amount of data 
actually being written.


My suspicion is the problem is related to hinted handoff, and basically follows 
the “some sort of non trivial GC pause on one node (probably caused by 
promotion failure of memtable due to slow/eventual fragmentation of tenured 
gen) causes other nodes to start writing hints for that node, and the hinting 
turns out to be actually much worse memory wise than the writes themselves.

Obviously there is the JIRA issue in 3.0 to not store hints in a table, but in 
the meanwhile I thought I’d take a quick peek.

I’m in the process of writing a test to confirm what I think is happening, but 
I thought I’d throw this out there anyway for discussion in the context of 2.1 
changes.

====

In this case, we are adding a single column (of small delta data) to a lot of 
partitions at once (lets say a hundred thousand partitions for each node after 
replication) pretty fast with not a lot of data for each probably a few hundred 
bytes max… this really puts very little strain on things at all. However if we 
are hinting, then ALL these updates for one target node end up going to the 
same partition in system.hints. My assumption that this combined with SnapTree 
is a huge problem. Whilst the snap tree clone itself may be quick, we are 
essentially appending each time, so there is probably a huge amount of 
rebalancing going on along with the lazy copy on write and the inherent 
concurrent race waste. I am guestimating that we are actually allocating 
several orders of magnitude more memory than doing the actual regular inserts 
does. The fact that our H/W can handle a very very high concurrent write volume 
probably makes the problem much worse.

So my first thought, was fine, why not just change hinted handoff to shard (an 
extra partition key column along with the existing node uuid) based on say a 
fixed length 8 bit hash of the original row mutation partition key. This would 
spread the load some and probably solve the problem, but would require a schema 
change to system.hints which is annoying. That said, appending (sorted) data 
concurrently to the same partition is the sort of thing that many people might 
do, though probably not as fast as is done in this hinting case, so probably 
needs to be better anyway.

So searching, I came across 
https://issues.apache.org/jira/browse/CASSANDRA-6271 for 2.1 which replaces 
SnapTree… so my final question is (and I will try to test this on 2.1 soon).. 
do we expect that this change will probably make the hint problem moot anyways? 
i.e. heavy concurrent write to the same partition key is AOK, or does it make 
sense to shard hints until 3.x makes everything groovy.

Thanks,

Graham.

smime.p7s
Description: S/MIME cryptographic signature

Hinted Handoff/Atomic Updates & SnapTree replacement with BTree

Reply via email to