Hey all;
So, we have Cassandra running on a 5-server ring, with a RF of 3, and
we're regularly seeing major slowdowns in read & write performance while
running nodetool repair, as well as the occasional Cassandra crash
during the repair window - slowdowns past 10 seconds to perform a single
write.
The repair cycle runs nightly on a different server, so each server has
it run once a week.
We're running 0.7.0 currently, and we'll be upgrading to 0.7.6 shortly.
System load on the Cassandra servers is never more than 10% CPU and
utterly minimal IO usage, so I wouldn't think we'd be seeing issues
quite like this.
What sort of knobs should I be looking at tuning to reduce the impact
that nodetool repair has on Cassandra? What questions should I be asking
as to why Cassandra slows down to the level that it does, and what I
should be optimizing?
Additionally, what should I be looking for in the logs when this is
happening? There's a lot in the logs, but I'm not sure what to look for.
Cassadra is, in this instance, backing a system that supports around a
million requests a day, so not terribly heavy traffic.
Thanks,
Aurynn