First, thanks for the quick reply and jira links! Its helpful to know we are not the only ones experiencing these issues.
"Are you sure you actually want/need to run repair as frequently as you currently are? Reducing the frequency won't make it work any better, but it will reduce the number of times you have to babysit its failure." I think we are dealing with streaming issues specifically, we have been successfully running repairs such that each node runs once a week in all of our clusters (to stay within gc_grace_seconds per best practices). In this particular case, we are trying to backfill data to a new second datacenter from our first datacenter using manual repairs (total cluster load ~11TB). It is becoming more and more evident that the most reliable option at this point would be to do an out-of-band rsync of a snapshot on dc1, with a custom sstable id de-duplication script paired with a refresh/compaction/cleanup on dc2 nodes as in [1]. It should also be noted that our initial plan (nodetool rebuild) failed on this cluster with a stack overrun, likely due to the massive amount of CF's (2800+) we are running (an admitted data model issue that is being worked out). I would love to consider dropping scheduled anti-entropy repairs completely if we have enough other fail-safes in place. We run RF=3 and LOCAL_QUORUM reads/writes. We also have read repair chance set to 1.0 on most CFs (something we recently realized was carried over as a default from the 0.8 days, this cluster is indeed that old...). Our usage sees deletes, but worst case if data came back, I suspect it would just trigger duplicate processing. We did notice our repair process timings went from about 8 hours in 1.1 to over 12 hours in 1.2. Our biggest concern at this point is can we effectively rebuild a failed node with streaming/bootstrap or do we need to devise custom workflows (like above mentioned rsync) to quickly and reliably bring a node back to full load. It sounds like there are some considerable improvements to bootstrap/repair/streaming in 2.0, excluding the current performance problems with vnodes. We are planning on upgrading to 2.0, but as with most things, this wont happen overnight. We obviously need to get to 1.2.16 as a pre-req to upgrade to 2.0 which will probably get more priority now :) [1] http://planetcassandra.org/blog/post/bulk-loading-options-for-cassandra/ -Andrew