Hi all we're still on 0.6 and are facing problems with repairs.
I.e. a repair for one CF takes around 60h and we have to do that twice (RF=3, 5 nodes). During that time the cluster is under pretty heavy IO load. It kinda works but during peek times we see lots of dropped messages (including writes). So we are actually creating inconsistencies that we are trying to fix with the repair. Since we already have a very simple hadoopish framework in place which allows us to do token range walks with multiple workers and restart at a given position in case of failure I created a simple worker that would read everything with CL_ALL. With only one worker and almost no performance impact one scan took 7h. My understanding is that at that point due to read repair I got the same as I would have achieved with repair runs. Is that true or am I missing something? Cheers, Daniel