Recently upgraded my 8 node cluster from 0.6.6 to 0.7.0 (even more recently 
0.7.1) for ExpiringColumn, among the many other spectacular improvements.
 
Retuned the GC settings based on experience from 0.6.6 and new defaults.
 
After about a week, two of the nodes were very far behind on minor compactions 
(2k+ SSTables per CF and growing, 20k+ pending compactions).  The SSTable 
switch rate on these two nodes was about 10x higher than the other nodes.  I 
also observed rolling long pause deaths (Gossip saying node X is dead), 
seemingly every three minutes one of the nodes would long pause GC.  I saw this 
behavior also when I upgraded from 0.6.6 to 0.6.8, but I rolled back to 0.6.6 
because time did not allow for a deeper observation at that time. (found this: 
https://issues.apache.org/jira/browse/CASSANDRA-1656)
 
I eventually traced this behavior back to a nasty interaction between Hinted 
Handoff and GC tuned for normal operating conditions.  
 
If I understand the code correctly, when a node replays a hint it reads the 
hinted data directly from the application tables (read: my ColumnFamily).  If 
the replaying node happens to be to also be a replica it will resend the entire 
row, even if only one column was mutated.  Because of the rolling GC pause 
deaths the HHs rarely succeeded and if they did it wasn’t long before a new set 
of hints were recorded.
 
Disabling Hinted Handoffs has fixed this problem, for me.
 
Looking into intermittent GC issues further, the verbose gc log showed ParNew 
promotion failures, so I conservatively lowered CMSInitiatingOccupancyFraction, 
MAX_NEWSIZE, and in_memory_compaction_limit_in_mb.  I’m now seeing long CMS 
times (8000ms+) but no failures, which leads me to believe 6G heap may be too 
large based on the current tuning.
 
It’s worth noting that I saw no increase in ColumnFamily WriteCount or 
StorageProxy.WriteOperations, only ColumnFamily MemtableColumnsCount and 
MemtableDataSize were increasing very rapidly on the target node while 
HintedHandoffs were replaying.

--
Chris

Reply via email to