Hello Ben, Thank you for the quick reply. I haven't tried that case, but it does't recover even if I stopped the stress.
Thanks, Hiro On Wed, Apr 24, 2019 at 3:36 PM Ben Slater <ben.sla...@instaclustr.com> wrote: > Is it possible that stress is overloading node 1 so it’s not recovering > state properly when node 2 comes up? Have you tried running with a lower > load (say 2 or 3 threads)? > > Cheers > Ben > > --- > > > *Ben Slater* > *Chief Product Officer* > > > <https://www.facebook.com/instaclustr> <https://twitter.com/instaclustr> > <https://www.linkedin.com/company/instaclustr> > > Read our latest technical blog posts here > <https://www.instaclustr.com/blog/>. > > This email has been sent on behalf of Instaclustr Pty. Limited (Australia) > and Instaclustr Inc (USA). > > This email and any attachments may contain confidential and legally > privileged information. If you are not the intended recipient, do not copy > or disclose its content, but please reply to this email immediately and > highlight the error to the sender and then immediately delete the message. > > > On Wed, 24 Apr 2019 at 16:28, Hiroyuki Yamada <mogwa...@gmail.com> wrote: > >> Hello, >> >> I faced a weird issue when recovering a cluster after two nodes are >> stopped. >> It is easily reproduce-able and looks like a bug or an issue to fix, >> so let me write down the steps to reproduce. >> >> === STEPS TO REPRODUCE === >> * Create a 3-node cluster with RF=3 >> - node1(seed), node2, node3 >> * Start requests to the cluster with cassandra-stress (it continues >> until the end) >> - what we did: cassandra-stress mixed cl=QUORUM duration=10m >> -errors ignore -node node1,node2,node3 -rate threads\>=16 >> threads\<=256 >> * Stop node3 normally (with systemctl stop) >> - the system is still available because the quorum of nodes is >> still available >> * Stop node2 normally (with systemctl stop) >> - the system is NOT available after it's stopped. >> - the client gets `UnavailableException: Not enough replicas >> available for query at consistency QUORUM` >> - the client gets errors right away (so few ms) >> - so far it's all expected >> * Wait for 1 mins >> * Bring up node2 >> - The issue happens here. >> - the client gets ReadTimeoutException` or WriteTimeoutException >> depending on if the request is read or write even after the node2 is >> up >> - the client gets errors after about 5000ms or 2000ms, which are >> request timeout for write and read request >> - what node1 reports with `nodetool status` and what node2 reports >> are not consistent. (node2 thinks node1 is down) >> - It takes very long time to recover from its state >> === STEPS TO REPRODUCE === >> >> Is it supposed to happen ? >> If we don't start cassandra-stress, it's all fine. >> >> Some workarounds we found to recover the state are the followings: >> * Restarting node1 and it recovers its state right after it's restarted >> * Setting lower value in dynamic_snitch_reset_interval_in_ms (to 60000 >> or something) >> >> I don't think either of them is a really good solution. >> Can anyone explain what is going on and what is the best way to make >> it not happen or recover ? >> >> Thanks, >> Hiro >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org >> For additional commands, e-mail: user-h...@cassandra.apache.org >> >>