We have had a 15 node cluster across three zones and cluster repairs using ‘nodetool repair -pr’ took about 3 hours to finish. Lately, we shrunk the cluster to 12. Since then, same repair job has taken up to 12 hours to finish and most times, it never does.
More importantly, at some point during the repair cycle, we see read latencies jumping to 1-2 seconds and applications immediately notice the impact. stream_throughput_outbound_megabits_per_sec is set at 200 and compaction_throughput_mb_per_sec at 64. The /data dir on the nodes is around ~500GB at 44% usage. When shrinking the cluster, the ‘nodetool decommision’ was eventless. It completed successfully with no issues. What could possibly cause repairs to cause this impact following cluster downsizing? Taking three nodes out does not seem compatible with such a drastic effect on repair and read latency. Any expert insights will be appreciated. ---------------- Thank you