Hello, Thank you for some feedbacks.
>Ben Thank you. I've tested with lower concurrency in my side, the issue still occurs. We are using 3 x T3.xlarge instances for C* and small and separate instance for the client program. But if we tried with 1 host with 3 C* nodes, the issue didn't occur. > Alok We also thought so and tested with hints disabled, but it doesn't make any difference. (the issue still occurs) Thanks, Hiro On Fri, Apr 26, 2019 at 8:19 AM Alok Dwivedi <alok.dwiv...@instaclustr.com> wrote: > Could it be related to hinted hand offs being stored in Node1 and then > attempted to be replayed in Node2 when it comes back causing more load as > new mutations are also being applied from cassandra-stress at same time? > > Alok Dwivedi > Senior Consultant > https://www.instaclustr.com/ > > > > > On 26 Apr 2019, at 09:04, Ben Slater <ben.sla...@instaclustr.com> wrote: > > In the absence of anyone else having any bright ideas - it still sounds to > me like the kind of scenario that can occur in a heavily overloaded > cluster. I would try again with a lower load. > > What size machines are you using for stress client and the nodes? Are they > all on separate machines? > > Cheers > Ben > > --- > > > *Ben Slater**Chief Product Officer* > > <https://www.instaclustr.com/platform/> > > <https://www.facebook.com/instaclustr> <https://twitter.com/instaclustr> > <https://www.linkedin.com/company/instaclustr> > > Read our latest technical blog posts here > <https://www.instaclustr.com/blog/>. > > This email has been sent on behalf of Instaclustr Pty. Limited (Australia) > and Instaclustr Inc (USA). > > This email and any attachments may contain confidential and legally > privileged information. If you are not the intended recipient, do not copy > or disclose its content, but please reply to this email immediately and > highlight the error to the sender and then immediately delete the message. > > > On Thu, 25 Apr 2019 at 17:26, Hiroyuki Yamada <mogwa...@gmail.com> wrote: > >> Hello, >> >> Sorry again. >> We found yet another weird thing in this. >> If we stop nodes with systemctl or just kill (TERM), it causes the >> problem, >> but if we kill -9, it doesn't cause the problem. >> >> Thanks, >> Hiro >> >> On Wed, Apr 24, 2019 at 11:31 PM Hiroyuki Yamada <mogwa...@gmail.com> >> wrote: >> >>> Sorry, I didn't write the version and the configurations. >>> I've tested with C* 3.11.4, and >>> the configurations are mostly set to default except for the replication >>> factor and listen_address for proper networking. >>> >>> Thanks, >>> Hiro >>> >>> On Wed, Apr 24, 2019 at 5:12 PM Hiroyuki Yamada <mogwa...@gmail.com> >>> wrote: >>> >>>> Hello Ben, >>>> >>>> Thank you for the quick reply. >>>> I haven't tried that case, but it does't recover even if I stopped the >>>> stress. >>>> >>>> Thanks, >>>> Hiro >>>> >>>> On Wed, Apr 24, 2019 at 3:36 PM Ben Slater <ben.sla...@instaclustr.com> >>>> wrote: >>>> >>>>> Is it possible that stress is overloading node 1 so it’s not >>>>> recovering state properly when node 2 comes up? Have you tried running >>>>> with >>>>> a lower load (say 2 or 3 threads)? >>>>> >>>>> Cheers >>>>> Ben >>>>> >>>>> --- >>>>> >>>>> >>>>> *Ben Slater* >>>>> *Chief Product Officer* >>>>> >>>>> >>>>> <https://www.facebook.com/instaclustr> >>>>> <https://twitter.com/instaclustr> >>>>> <https://www.linkedin.com/company/instaclustr> >>>>> >>>>> Read our latest technical blog posts here >>>>> <https://www.instaclustr.com/blog/>. >>>>> >>>>> This email has been sent on behalf of Instaclustr Pty. Limited >>>>> (Australia) and Instaclustr Inc (USA). >>>>> >>>>> This email and any attachments may contain confidential and legally >>>>> privileged information. If you are not the intended recipient, do not >>>>> copy >>>>> or disclose its content, but please reply to this email immediately and >>>>> highlight the error to the sender and then immediately delete the message. >>>>> >>>>> >>>>> On Wed, 24 Apr 2019 at 16:28, Hiroyuki Yamada <mogwa...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hello, >>>>>> >>>>>> I faced a weird issue when recovering a cluster after two nodes are >>>>>> stopped. >>>>>> It is easily reproduce-able and looks like a bug or an issue to fix, >>>>>> so let me write down the steps to reproduce. >>>>>> >>>>>> === STEPS TO REPRODUCE === >>>>>> * Create a 3-node cluster with RF=3 >>>>>> - node1(seed), node2, node3 >>>>>> * Start requests to the cluster with cassandra-stress (it continues >>>>>> until the end) >>>>>> - what we did: cassandra-stress mixed cl=QUORUM duration=10m >>>>>> -errors ignore -node node1,node2,node3 -rate threads\>=16 >>>>>> threads\<=256 >>>>>> * Stop node3 normally (with systemctl stop) >>>>>> - the system is still available because the quorum of nodes is >>>>>> still available >>>>>> * Stop node2 normally (with systemctl stop) >>>>>> - the system is NOT available after it's stopped. >>>>>> - the client gets `UnavailableException: Not enough replicas >>>>>> available for query at consistency QUORUM` >>>>>> - the client gets errors right away (so few ms) >>>>>> - so far it's all expected >>>>>> * Wait for 1 mins >>>>>> * Bring up node2 >>>>>> - The issue happens here. >>>>>> - the client gets ReadTimeoutException` or WriteTimeoutException >>>>>> depending on if the request is read or write even after the node2 is >>>>>> up >>>>>> - the client gets errors after about 5000ms or 2000ms, which are >>>>>> request timeout for write and read request >>>>>> - what node1 reports with `nodetool status` and what node2 reports >>>>>> are not consistent. (node2 thinks node1 is down) >>>>>> - It takes very long time to recover from its state >>>>>> === STEPS TO REPRODUCE === >>>>>> >>>>>> Is it supposed to happen ? >>>>>> If we don't start cassandra-stress, it's all fine. >>>>>> >>>>>> Some workarounds we found to recover the state are the followings: >>>>>> * Restarting node1 and it recovers its state right after it's >>>>>> restarted >>>>>> * Setting lower value in dynamic_snitch_reset_interval_in_ms (to 60000 >>>>>> or something) >>>>>> >>>>>> I don't think either of them is a really good solution. >>>>>> Can anyone explain what is going on and what is the best way to make >>>>>> it not happen or recover ? >>>>>> >>>>>> Thanks, >>>>>> Hiro >>>>>> >>>>>> --------------------------------------------------------------------- >>>>>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org >>>>>> For additional commands, e-mail: user-h...@cassandra.apache.org >>>>>> >>>>>> >