Re: A cluster (RF=3) not recovering after two nodes are stopped

Hiroyuki Yamada Wed, 24 Apr 2019 07:32:32 -0700

Sorry, I didn't write the version and the configurations.
I've tested with C* 3.11.4, and
the configurations are mostly set to default except for the replication
factor and listen_address for proper networking.


Thanks,
Hiro

On Wed, Apr 24, 2019 at 5:12 PM Hiroyuki Yamada <mogwa...@gmail.com> wrote:

> Hello Ben,
>
> Thank you for the quick reply.
> I haven't tried that case, but it does't recover even if I stopped the
> stress.
>
> Thanks,
> Hiro
>
> On Wed, Apr 24, 2019 at 3:36 PM Ben Slater <ben.sla...@instaclustr.com>
> wrote:
>
>> Is it possible that stress is overloading node 1 so it’s not recovering
>> state properly when node 2 comes up? Have you tried running with a lower
>> load (say 2 or 3 threads)?
>>
>> Cheers
>> Ben
>>
>> ---
>>
>>
>> *Ben Slater*
>> *Chief Product Officer*
>>
>>
>> <https://www.facebook.com/instaclustr>
>> <https://twitter.com/instaclustr>
>> <https://www.linkedin.com/company/instaclustr>
>>
>> Read our latest technical blog posts here
>> <https://www.instaclustr.com/blog/>.
>>
>> This email has been sent on behalf of Instaclustr Pty. Limited
>> (Australia) and Instaclustr Inc (USA).
>>
>> This email and any attachments may contain confidential and legally
>> privileged information.  If you are not the intended recipient, do not copy
>> or disclose its content, but please reply to this email immediately and
>> highlight the error to the sender and then immediately delete the message.
>>
>>
>> On Wed, 24 Apr 2019 at 16:28, Hiroyuki Yamada <mogwa...@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> I faced a weird issue when recovering a cluster after two nodes are
>>> stopped.
>>> It is easily reproduce-able and looks like a bug or an issue to fix,
>>> so let me write down the steps to reproduce.
>>>
>>> === STEPS TO REPRODUCE ===
>>> * Create a 3-node cluster with RF=3
>>>    - node1(seed), node2, node3
>>> * Start requests to the cluster with cassandra-stress (it continues
>>> until the end)
>>>    - what we did: cassandra-stress mixed cl=QUORUM duration=10m
>>> -errors ignore -node node1,node2,node3 -rate threads\>=16
>>> threads\<=256
>>> * Stop node3 normally (with systemctl stop)
>>>    - the system is still available because the quorum of nodes is
>>> still available
>>> * Stop node2 normally (with systemctl stop)
>>>    - the system is NOT available after it's stopped.
>>>    - the client gets `UnavailableException: Not enough replicas
>>> available for query at consistency QUORUM`
>>>    - the client gets errors right away (so few ms)
>>>    - so far it's all expected
>>> * Wait for 1 mins
>>> * Bring up node2
>>>    - The issue happens here.
>>>    - the client gets ReadTimeoutException` or WriteTimeoutException
>>> depending on if the request is read or write even after the node2 is
>>> up
>>>    - the client gets errors after about 5000ms or 2000ms, which are
>>> request timeout for write and read request
>>>    - what node1 reports with `nodetool status` and what node2 reports
>>> are not consistent. (node2 thinks node1 is down)
>>>    - It takes very long time to recover from its state
>>> === STEPS TO REPRODUCE ===
>>>
>>> Is it supposed to happen ?
>>> If we don't start cassandra-stress, it's all fine.
>>>
>>> Some workarounds we found to recover the state are the followings:
>>> * Restarting node1 and it recovers its state right after it's restarted
>>> * Setting lower value in dynamic_snitch_reset_interval_in_ms (to 60000
>>> or something)
>>>
>>> I don't think either of them is a really good solution.
>>> Can anyone explain what is going on and what is the best way to make
>>> it not happen or recover ?
>>>
>>> Thanks,
>>> Hiro
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>>> For additional commands, e-mail: user-h...@cassandra.apache.org
>>>
>>>

Re: A cluster (RF=3) not recovering after two nodes are stopped

Reply via email to