Hi C* user list, I have a curious recurring problem with Cassandra 1.2 and what seems like a GC issue.
The cluster looks somewhat well balanced, all nodes are running HotSpot JVM 1.6.0_31-b04 and cassandra 1.2.3. Address Rack Status State Load Owns RAC6 Up Normal 15.13 GB 12.71% RAC5 Up Normal 16.87 GB 13.57% RAC8 Up Normal 13.27 GB 13.71% RAC1 Up Normal 16.46 GB 14.08% RAC7 Up Normal 11.59 GB 14.34% RAC2 Up Normal 23.15 GB 15.12% RAC4 Up Normal 16.52 GB 16.47% Every now and then (roughly once a month, currently), two nodes (always the same two) need to be restarted after they start eating all available CPU cycles and read and write latencies increase dramatically. Restart fixes this every time. The only metric that significantly deviates from the average for all nodes shows GC doing something: http://bou.si/rest/parnew.png Is there a way to debug this? After searching online it appears as nobody has really solved this problem and I have no idea what could cause such behaviour in just two particular cluster nodes. I'm now thinking of decomissioning the problematic nodes and bootstrapping them anew, but can't decide if this could possibly help. Thanks in advance for any insight anyone might offer, -- Jure Koren, DevOps http://www.zemanta.com/