Re: A cluster (RF=3) not recovering after two nodes are stopped

Hiroyuki Yamada Wed, 24 Apr 2019 01:13:39 -0700

Hello Ben,

Thank you for the quick reply.
I haven't tried that case, but it does't recover even if I stopped the
stress.


Thanks,
Hiro

On Wed, Apr 24, 2019 at 3:36 PM Ben Slater <ben.sla...@instaclustr.com>
wrote:

> Is it possible that stress is overloading node 1 so it’s not recovering
> state properly when node 2 comes up? Have you tried running with a lower
> load (say 2 or 3 threads)?
>
> Cheers
> Ben
>
> ---
>
>
> *Ben Slater*
> *Chief Product Officer*
>
>
> <https://www.facebook.com/instaclustr>   <https://twitter.com/instaclustr>
>    <https://www.linkedin.com/company/instaclustr>
>
> Read our latest technical blog posts here
> <https://www.instaclustr.com/blog/>.
>
> This email has been sent on behalf of Instaclustr Pty. Limited (Australia)
> and Instaclustr Inc (USA).
>
> This email and any attachments may contain confidential and legally
> privileged information.  If you are not the intended recipient, do not copy
> or disclose its content, but please reply to this email immediately and
> highlight the error to the sender and then immediately delete the message.
>
>
> On Wed, 24 Apr 2019 at 16:28, Hiroyuki Yamada <mogwa...@gmail.com> wrote:
>
>> Hello,
>>
>> I faced a weird issue when recovering a cluster after two nodes are
>> stopped.
>> It is easily reproduce-able and looks like a bug or an issue to fix,
>> so let me write down the steps to reproduce.
>>
>> === STEPS TO REPRODUCE ===
>> * Create a 3-node cluster with RF=3
>>    - node1(seed), node2, node3
>> * Start requests to the cluster with cassandra-stress (it continues
>> until the end)
>>    - what we did: cassandra-stress mixed cl=QUORUM duration=10m
>> -errors ignore -node node1,node2,node3 -rate threads\>=16
>> threads\<=256
>> * Stop node3 normally (with systemctl stop)
>>    - the system is still available because the quorum of nodes is
>> still available
>> * Stop node2 normally (with systemctl stop)
>>    - the system is NOT available after it's stopped.
>>    - the client gets `UnavailableException: Not enough replicas
>> available for query at consistency QUORUM`
>>    - the client gets errors right away (so few ms)
>>    - so far it's all expected
>> * Wait for 1 mins
>> * Bring up node2
>>    - The issue happens here.
>>    - the client gets ReadTimeoutException` or WriteTimeoutException
>> depending on if the request is read or write even after the node2 is
>> up
>>    - the client gets errors after about 5000ms or 2000ms, which are
>> request timeout for write and read request
>>    - what node1 reports with `nodetool status` and what node2 reports
>> are not consistent. (node2 thinks node1 is down)
>>    - It takes very long time to recover from its state
>> === STEPS TO REPRODUCE ===
>>
>> Is it supposed to happen ?
>> If we don't start cassandra-stress, it's all fine.
>>
>> Some workarounds we found to recover the state are the followings:
>> * Restarting node1 and it recovers its state right after it's restarted
>> * Setting lower value in dynamic_snitch_reset_interval_in_ms (to 60000
>> or something)
>>
>> I don't think either of them is a really good solution.
>> Can anyone explain what is going on and what is the best way to make
>> it not happen or recover ?
>>
>> Thanks,
>> Hiro
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>> For additional commands, e-mail: user-h...@cassandra.apache.org
>>
>>

Re: A cluster (RF=3) not recovering after two nodes are stopped

Reply via email to