sorry for the idiot questions... data was allowed to fully rebalance/repair/drain before the next node was taken off?
did you take 1 off per rack/AZ? On Wed, Feb 21, 2018 at 12:29 PM, Fred Habash <fmhab...@gmail.com> wrote: > One node at a time > > On Feb 21, 2018 10:23 AM, "Carl Mueller" <carl.muel...@smartthings.com> > wrote: > >> What is your replication factor? >> Single datacenter, three availability zones, is that right? >> You removed one node at a time or three at once? >> >> On Wed, Feb 21, 2018 at 10:20 AM, Fd Habash <fmhab...@gmail.com> wrote: >> >>> We have had a 15 node cluster across three zones and cluster repairs >>> using ‘nodetool repair -pr’ took about 3 hours to finish. Lately, we shrunk >>> the cluster to 12. Since then, same repair job has taken up to 12 hours to >>> finish and most times, it never does. >>> >>> >>> >>> More importantly, at some point during the repair cycle, we see read >>> latencies jumping to 1-2 seconds and applications immediately notice the >>> impact. >>> >>> >>> >>> stream_throughput_outbound_megabits_per_sec is set at 200 and >>> compaction_throughput_mb_per_sec at 64. The /data dir on the nodes is >>> around ~500GB at 44% usage. >>> >>> >>> >>> When shrinking the cluster, the ‘nodetool decommision’ was eventless. It >>> completed successfully with no issues. >>> >>> >>> >>> What could possibly cause repairs to cause this impact following cluster >>> downsizing? Taking three nodes out does not seem compatible with such a >>> drastic effect on repair and read latency. >>> >>> >>> >>> Any expert insights will be appreciated. >>> >>> ---------------- >>> Thank you >>> >>> >>> >> >>