Check the logs for warnings from the GCInspector. If you see messages that correlate with compaction running limit compaction to help stabilise thingsā¦
* set concurrrent_compactions to 2 * if you have wide rows reduce in_memory_compaction_limit * reduce compaction_throughput If you have a lot (more than 200 million) of rows check the size of the bloom filters using nodetool cfstats. If it's around 1GB consider increase the bloom_filter_fp_chance per CF to 0.01 or 0.1 > I've tried changing the amount of RAM between 8G and 12G, More JVM memory is not always the answer, try to get back to stable on the the defaults or something close to them and then tune from there. > sometimes gets stuck on a compaction with near-idle disk throughput Wide rows can slow down compaction, check the row size with nodetool cfstats or nodetool cfhistograms Cheers ----------------- Aaron Morton Freelance Cassandra Consultant New Zealand @aaronmorton http://www.thelastpickle.com On 30/04/2013, at 5:33 AM, Drew from Zhrodague <drewzhroda...@zhrodague.net> wrote: > Hi, we have a 9-node ring on m1.xlarge AWS hosts. We started having > some trouble a while ago, and it's making me pull out all of my hair. > > The host in position #3 has been replaced 4 times. Each time, the host > joins the ring, I do a nodetool repair -pr, and she seems fine for about a > day. Then she gets real slow, sometimes OOMs, sometimes takes down the host > in position #5, sometimes gets stuck on a compaction with near-idle disk > throughput, and eventually dies without any kind of error message or reason > for failing. > > Sometimes our cluster gets so slow that it is almost unusable - we get > timeout errors from our application, AWS sends us voluminous alerts about > latency. > > I've tried changing the amount of RAM between 8G and 12G, changing the > MAX_HEAP_SIZE and HEAP_NEWSIZE, repeatedly forcing a stop compaction, setting > astronomical ulimit values, and praying to available gods. I'm a bit > confused. We're not using super-wide rows, most things are default. > > EL5, Cassandra 1.1.9, Java 1.6.0 > > > -- > > Drew from Zhrodague > lolcat divinator > d...@zhrodague.net