We have had a 15 node cluster across three zones and cluster repairs using 
‘nodetool repair -pr’ took about 3 hours to finish. Lately, we shrunk the 
cluster to 12. Since then, same repair job has taken up to 12 hours to finish 
and most times, it never does. 

More importantly, at some point during the repair cycle, we see read latencies 
jumping to 1-2 seconds and applications immediately notice the impact.

stream_throughput_outbound_megabits_per_sec is set at 200 and 
compaction_throughput_mb_per_sec at 64. The /data dir on the nodes is around 
~500GB at 44% usage. 

When shrinking the cluster, the ‘nodetool decommision’ was eventless. It 
completed successfully with no issues.

What could possibly cause repairs to cause this impact following cluster 
downsizing? Taking three nodes out does not seem compatible with such a drastic 
effect on repair and read latency. 

Any expert insights will be appreciated. 
----------------
Thank you

Reply via email to