Running on 2.0.5 we’ve noticed occasional MASSIVE spikes in memory allocation that happen under write load, but way out of proportion to the amount of data actually being written.
My suspicion is the problem is related to hinted handoff, and basically follows the “some sort of non trivial GC pause on one node (probably caused by promotion failure of memtable due to slow/eventual fragmentation of tenured gen) causes other nodes to start writing hints for that node, and the hinting turns out to be actually much worse memory wise than the writes themselves. Obviously there is the JIRA issue in 3.0 to not store hints in a table, but in the meanwhile I thought I’d take a quick peek. I’m in the process of writing a test to confirm what I think is happening, but I thought I’d throw this out there anyway for discussion in the context of 2.1 changes. ==== In this case, we are adding a single column (of small delta data) to a lot of partitions at once (lets say a hundred thousand partitions for each node after replication) pretty fast with not a lot of data for each probably a few hundred bytes max… this really puts very little strain on things at all. However if we are hinting, then ALL these updates for one target node end up going to the same partition in system.hints. My assumption that this combined with SnapTree is a huge problem. Whilst the snap tree clone itself may be quick, we are essentially appending each time, so there is probably a huge amount of rebalancing going on along with the lazy copy on write and the inherent concurrent race waste. I am guestimating that we are actually allocating several orders of magnitude more memory than doing the actual regular inserts does. The fact that our H/W can handle a very very high concurrent write volume probably makes the problem much worse. So my first thought, was fine, why not just change hinted handoff to shard (an extra partition key column along with the existing node uuid) based on say a fixed length 8 bit hash of the original row mutation partition key. This would spread the load some and probably solve the problem, but would require a schema change to system.hints which is annoying. That said, appending (sorted) data concurrently to the same partition is the sort of thing that many people might do, though probably not as fast as is done in this hinting case, so probably needs to be better anyway. So searching, I came across https://issues.apache.org/jira/browse/CASSANDRA-6271 for 2.1 which replaces SnapTree… so my final question is (and I will try to test this on 2.1 soon).. do we expect that this change will probably make the hint problem moot anyways? i.e. heavy concurrent write to the same partition key is AOK, or does it make sense to shard hints until 3.x makes everything groovy. Thanks, Graham.
smime.p7s
Description: S/MIME cryptographic signature