sorry for the idiot questions...

data was allowed to fully rebalance/repair/drain before the next node was
taken off?

did you take 1 off per rack/AZ?


On Wed, Feb 21, 2018 at 12:29 PM, Fred Habash <fmhab...@gmail.com> wrote:

> One node at a time
>
> On Feb 21, 2018 10:23 AM, "Carl Mueller" <carl.muel...@smartthings.com>
> wrote:
>
>> What is your replication factor?
>> Single datacenter, three availability zones, is that right?
>> You removed one node at a time or three at once?
>>
>> On Wed, Feb 21, 2018 at 10:20 AM, Fd Habash <fmhab...@gmail.com> wrote:
>>
>>> We have had a 15 node cluster across three zones and cluster repairs
>>> using ‘nodetool repair -pr’ took about 3 hours to finish. Lately, we shrunk
>>> the cluster to 12. Since then, same repair job has taken up to 12 hours to
>>> finish and most times, it never does.
>>>
>>>
>>>
>>> More importantly, at some point during the repair cycle, we see read
>>> latencies jumping to 1-2 seconds and applications immediately notice the
>>> impact.
>>>
>>>
>>>
>>> stream_throughput_outbound_megabits_per_sec is set at 200 and
>>> compaction_throughput_mb_per_sec at 64. The /data dir on the nodes is
>>> around ~500GB at 44% usage.
>>>
>>>
>>>
>>> When shrinking the cluster, the ‘nodetool decommision’ was eventless. It
>>> completed successfully with no issues.
>>>
>>>
>>>
>>> What could possibly cause repairs to cause this impact following cluster
>>> downsizing? Taking three nodes out does not seem compatible with such a
>>> drastic effect on repair and read latency.
>>>
>>>
>>>
>>> Any expert insights will be appreciated.
>>>
>>> ----------------
>>> Thank you
>>>
>>>
>>>
>>
>>

Reply via email to