Hi Erik, Forgetting for a while that it's only a single row: does this node store any super-long rows? The first things that come to my mind after reading your e-mail is unthrottled compaction (sounds like a possible issue, but it would affect other nodes too) or very large rows. Or a mix of both? Maybe it will be of your interest: http://aryanet.com/blog/cassandra-garbage-collector-tuning re investigating GC issues (if you haven't seen it yet) and pinning it down further.
M. Kind regards, MichaĆ Michalski, michal.michal...@boxever.com On 15 April 2015 at 13:15, Erik Forsberg <forsb...@opera.com> wrote: > Hi! > > We having problems with one node (out of 56 in total) misbehaving. > Symptoms are: > > * High number of full CMS old space collections during early morning > when we're doing bulkloads. Yes, bulkloads, not CQL, and only a few > thrift insertions. > * Really long stop-the-world GC events (I've seen up to 50 seconds) for > both CMS and ParNew. > * CPU usage higher during early morning hours compared to other nodes. > * The large number of Garbage Collections *seems* to correspond to doing > a lot of compactions (SizeTiered for most of our CFs, Leveled for a few > small ones) > * Node loosing track of what other nodes are up and keeping that state > until restart (this I think is a bug caused by the GC behaviour, with > the stop-the-world making the node not accepting gossip connections from > other nodes) > > This is on 2.0.13 with vnodes (256 per node). > > All other nodes have normal behaviour, with a few (2-3) full CMS old > space in the same 3h period that the trouble node is making some 30 > ones. Heap space is 8G, with NEW_SIZE set to 800M. With 6G/800M the > problem was even worse (it seems, this is a bit hard to debug as it > happens *almost* every night). > > nodetool status shows that although we have a certain unbalance in the > cluster, this node is neither the most nor the least loaded. I.e. we > have between 1.6% and 2.1% in the "Owns" column, and the troublesome > node reports 1.7%. > > All nodes are under puppet control, so configuration is the same > everywhere. > > We're running NetworkTopolyStrategy with rack awareness, and here's a > deviation from recommended settings - we have slightly varying number of > nodes in the racks: > > 15 cssa01 > 15 cssa02 > 13 cssa03 > 13 cssa04 > > The affected node is in the cssa04 rack. Could this mean I have some > kind of hotspot situation? Why would that show up as more GC work? > > I'm quite puzzled here, so I'm looking for hints on how to identify what > is causing this. > > Regards, > \EF > > > > >