Compaction, Slow Ring, and bad behavior

Drew from Zhrodague Mon, 29 Apr 2013 10:34:08 -0700

Hi, we have a 9-node ring on m1.xlarge AWS hosts. We started havingsome trouble a while ago, and it's making me pull out all of my hair.

The host in position #3 has been replaced 4 times. Each time, the hostjoins the ring, I do a nodetool repair -pr, and she seems fine for abouta day. Then she gets real slow, sometimes OOMs, sometimes takes down thehost in position #5, sometimes gets stuck on a compaction with near-idledisk throughput, and eventually dies without any kind of error messageor reason for failing.

Sometimes our cluster gets so slow that it is almost unusable - we gettimeout errors from our application, AWS sends us voluminous alertsabout latency.

I've tried changing the amount of RAM between 8G and 12G, changing theMAX_HEAP_SIZE and HEAP_NEWSIZE, repeatedly forcing a stop compaction,setting astronomical ulimit values, and praying to available gods. I'm abit confused. We're not using super-wide rows, most things are default.


        EL5, Cassandra 1.1.9, Java 1.6.0


--

Drew from Zhrodague
lolcat divinator
d...@zhrodague.net

Compaction, Slow Ring, and bad behavior

Reply via email to