Hi, we have a 9-node ring on m1.xlarge AWS hosts. We started having
some trouble a while ago, and it's making me pull out all of my hair.
The host in position #3 has been replaced 4 times. Each time, the host
joins the ring, I do a nodetool repair -pr, and she seems fine for about
a day. Then she gets real slow, sometimes OOMs, sometimes takes down the
host in position #5, sometimes gets stuck on a compaction with near-idle
disk throughput, and eventually dies without any kind of error message
or reason for failing.
Sometimes our cluster gets so slow that it is almost unusable - we get
timeout errors from our application, AWS sends us voluminous alerts
about latency.
I've tried changing the amount of RAM between 8G and 12G, changing the
MAX_HEAP_SIZE and HEAP_NEWSIZE, repeatedly forcing a stop compaction,
setting astronomical ulimit values, and praying to available gods. I'm a
bit confused. We're not using super-wide rows, most things are default.
EL5, Cassandra 1.1.9, Java 1.6.0
--
Drew from Zhrodague
lolcat divinator
d...@zhrodague.net