Hi! We having problems with one node (out of 56 in total) misbehaving. Symptoms are:
* High number of full CMS old space collections during early morning when we're doing bulkloads. Yes, bulkloads, not CQL, and only a few thrift insertions. * Really long stop-the-world GC events (I've seen up to 50 seconds) for both CMS and ParNew. * CPU usage higher during early morning hours compared to other nodes. * The large number of Garbage Collections *seems* to correspond to doing a lot of compactions (SizeTiered for most of our CFs, Leveled for a few small ones) * Node loosing track of what other nodes are up and keeping that state until restart (this I think is a bug caused by the GC behaviour, with the stop-the-world making the node not accepting gossip connections from other nodes) This is on 2.0.13 with vnodes (256 per node). All other nodes have normal behaviour, with a few (2-3) full CMS old space in the same 3h period that the trouble node is making some 30 ones. Heap space is 8G, with NEW_SIZE set to 800M. With 6G/800M the problem was even worse (it seems, this is a bit hard to debug as it happens *almost* every night). nodetool status shows that although we have a certain unbalance in the cluster, this node is neither the most nor the least loaded. I.e. we have between 1.6% and 2.1% in the "Owns" column, and the troublesome node reports 1.7%. All nodes are under puppet control, so configuration is the same everywhere. We're running NetworkTopolyStrategy with rack awareness, and here's a deviation from recommended settings - we have slightly varying number of nodes in the racks: 15 cssa01 15 cssa02 13 cssa03 13 cssa04 The affected node is in the cssa04 rack. Could this mean I have some kind of hotspot situation? Why would that show up as more GC work? I'm quite puzzled here, so I'm looking for hints on how to identify what is causing this. Regards, \EF