Hi Erik,

Forgetting for a while that it's only a single row: does this node store
any super-long rows?
The first things that come to my mind after reading your e-mail is
unthrottled compaction (sounds like a possible issue, but it would affect
other nodes too) or very large rows. Or a mix of both?
Maybe it will be of your interest:
http://aryanet.com/blog/cassandra-garbage-collector-tuning re investigating
GC issues (if you haven't seen it yet) and pinning it down further.

M.



Kind regards,
MichaƂ Michalski,
michal.michal...@boxever.com

On 15 April 2015 at 13:15, Erik Forsberg <forsb...@opera.com> wrote:

> Hi!
>
> We having problems with one node (out of 56 in total) misbehaving.
> Symptoms are:
>
> * High number of full CMS old space collections during early morning
> when we're doing bulkloads. Yes, bulkloads, not CQL, and only a few
> thrift insertions.
> * Really long stop-the-world GC events (I've seen up to 50 seconds) for
> both CMS and ParNew.
> * CPU usage higher during early morning hours compared to other nodes.
> * The large number of Garbage Collections *seems* to correspond to doing
> a lot of compactions (SizeTiered for most of our CFs, Leveled for a few
> small ones)
> * Node loosing track of what other nodes are up and keeping that state
> until restart (this I think is a bug caused by the GC behaviour, with
> the stop-the-world making the node not accepting gossip connections from
> other nodes)
>
> This is on 2.0.13 with vnodes (256 per node).
>
> All other nodes have normal behaviour, with a few (2-3) full CMS old
> space  in the same 3h period that the trouble node is making some 30
> ones. Heap space is 8G, with NEW_SIZE set to 800M. With 6G/800M the
> problem was even worse (it seems, this is a bit hard to debug as it
> happens *almost* every night).
>
> nodetool status shows that although we have a certain unbalance in the
> cluster, this node is neither the most nor the least loaded. I.e. we
> have between 1.6% and 2.1% in the "Owns" column, and the troublesome
> node reports 1.7%.
>
> All nodes are under puppet control, so configuration is the same
> everywhere.
>
> We're running NetworkTopolyStrategy with rack awareness, and here's a
> deviation from recommended settings - we have slightly varying number of
> nodes in the racks:
>
>      15 cssa01
>      15 cssa02
>      13 cssa03
>      13 cssa04
>
> The affected node is in the cssa04 rack. Could this mean I have some
> kind of hotspot situation? Why would that show up as more GC work?
>
> I'm quite puzzled here, so I'm looking for hints on how to identify what
> is causing this.
>
> Regards,
> \EF
>
>
>
>
>

Reply via email to