Thanks Daemeon , will do that and post the results. I found jira in open state with similar issue https://issues.apache.org/jira/browse/CASSANDRA-13984
On Wed, Nov 6, 2019 at 1:49 PM daemeon reiydelle <daeme...@gmail.com> wrote: > No connection timeouts? No tcp level retries? I am sorry truly sorry but > you have exceeded my capability. I have never seen a java.io timeout with > out either a session half open failure (no response) or multiple retries. > > I am out of my depth, so please feel free to ignore but, did you see the > packets that are making the initial connection (which must have timed out)? > Out of curiosity, a netstat -arn must be showing bad packets, timeouts, > etc. To see progress, create a simple shell script that dumps date, dumps > netstat, sleeps 100 seconds, repeated. During that window stop, wait 10 > seconds, restart the remove node. > > <======> > Made weak by time and fate, but strong in will, > To strive, to seek, to find, and not to yield. > Ulysses - A. Lord Tennyson > > *Daemeon C.M. Reiydelle* > > *email: daeme...@gmail.com <daeme...@gmail.com>* > *San Francisco 1.415.501.0198/Skype daemeon.c.m.reiydelle* > > > > On Wed, Nov 6, 2019 at 9:11 AM Rahul Reddy <rahulreddy1...@gmail.com> > wrote: > >> Thank you. >> >> I have stopped instance in east. i see that all other instances can >> gossip to that instance and only one instance in west having issues >> gossiping to that node. when i enable debug mode i see below on the west >> node >> >> i see bellow messages from 16:32 to 16:47 >> DEBUG [RMI TCP Connection(272)-127.0.0.1] 2019-11-06 16:44:50, >> 417 StorageProxy.java:2361 - Hosts not in agreement. Didn't get a >> response from everybody: >> 424 StorageProxy.java:2361 - Hosts not in agreement. Didn't get a >> response from everybody: >> >> later i see timeout >> >> DEBUG [MessagingService-Outgoing-/eastip-Gossip] 2019-11-06 16:47:04,831 >> OutboundTcpConnection.java:350 - Error writing to /eastip >> java.io.IOException: Connection timed out >> >> then INFO [GossipStage:1] 2019-11-06 16:47:05,792 StorageService.j >> ava:2289 - Node /eastip state jump to NORMAL >> >> DEBUG [GossipStage:1] 2019-11-06 16:47:06,244 MigrationManager >> .java:99 - Not pulling schema from /eastip, because sche >> ma versions match: local/real=cdbb639b-1675-31b3-8a0d-84aca18e >> 86bf, local/compatible=49bf1daa-d585-38e0-a72b-b36ce82da9cb, r >> emote=cdbb639b-1675-31b3-8a0d-84aca18e86bf >> >> i tried running some tcpdump during that time i dont see any packet loss >> during that time. still unsure why east instance which was stopped and >> started unreachable to west node almost for 15 minutes. >> >> >> On Tue, Nov 5, 2019 at 10:14 PM daemeon reiydelle <daeme...@gmail.com> >> wrote: >> >>> 10 minutes is 600 seconds, and there are several timeouts that are set >>> to that, including the data center timeout as I recall. >>> >>> You may be forced to tcpdump the interface(s) to see where the chatter >>> is. Out of curiosity, when you restart the node, have you snapped the jvm's >>> memory to see if e.g. heap is even in use? >>> >>> >>> On Tue, Nov 5, 2019 at 7:03 PM Rahul Reddy <rahulreddy1...@gmail.com> >>> wrote: >>> >>>> Thanks Ben, >>>> Before stoping the ec2 I did run nodetool drain .so i ruled it out and >>>> system.log also doesn't show commitlogs being applied. >>>> >>>> >>>> >>>> >>>> >>>> On Tue, Nov 5, 2019, 7:51 PM Ben Slater <ben.sla...@instaclustr.com> >>>> wrote: >>>> >>>>> The logs between first start and handshaking should give you a >>>>> clue but my first guess would be replaying commit logs. >>>>> >>>>> Cheers >>>>> Ben >>>>> >>>>> --- >>>>> >>>>> >>>>> *Ben Slater**Chief Product Officer* >>>>> >>>>> <https://www.instaclustr.com/platform/> >>>>> >>>>> <https://www.facebook.com/instaclustr> >>>>> <https://twitter.com/instaclustr> >>>>> <https://www.linkedin.com/company/instaclustr> >>>>> >>>>> Read our latest technical blog posts here >>>>> <https://www.instaclustr.com/blog/>. >>>>> >>>>> This email has been sent on behalf of Instaclustr Pty. Limited >>>>> (Australia) and Instaclustr Inc (USA). >>>>> >>>>> This email and any attachments may contain confidential and legally >>>>> privileged information. If you are not the intended recipient, do not >>>>> copy >>>>> or disclose its content, but please reply to this email immediately and >>>>> highlight the error to the sender and then immediately delete the message. >>>>> >>>>> >>>>> On Wed, 6 Nov 2019 at 04:36, Rahul Reddy <rahulreddy1...@gmail.com> >>>>> wrote: >>>>> >>>>>> I can reproduce the issue. >>>>>> >>>>>> I did drain Cassandra node then stop and started Cassandra instance . >>>>>> Cassandra instance comes up but other nodes will be in DN state around 10 >>>>>> minutes. >>>>>> >>>>>> I don't see error in the systemlog >>>>>> >>>>>> DN xx.xx.xx.59 420.85 MiB 256 48.2% id 2 >>>>>> UN xx.xx.xx.30 432.14 MiB 256 50.0% id 0 >>>>>> UN xx.xx.xx.79 447.33 MiB 256 51.1% id 4 >>>>>> DN xx.xx.xx.144 452.59 MiB 256 51.6% id 1 >>>>>> DN xx.xx.xx.19 431.7 MiB 256 50.1% id 5 >>>>>> UN xx.xx.xx.6 421.79 MiB 256 48.9% >>>>>> >>>>>> when i do nodetool status 3 nodes still showing down. and i dont see >>>>>> errors in system.log >>>>>> >>>>>> and after 10 mins it shows the other node is up as well. >>>>>> >>>>>> >>>>>> INFO [HANDSHAKE-/10.72.100.156] 2019-11-05 15:05:09,133 >>>>>> OutboundTcpConnection.java:561 - Handshaking version with /stopandstarted >>>>>> node >>>>>> INFO [RequestResponseStage-7] 2019-11-05 15:16:27,166 >>>>>> Gossiper.java:1019 - InetAddress /nodewhichitwasshowing down is now UP >>>>>> >>>>>> what is causing delay for 10mins to be able to say that node is >>>>>> reachable >>>>>> >>>>>> On Wed, Oct 30, 2019, 8:37 AM Rahul Reddy <rahulreddy1...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> And also aws ec2 stop and start comes with new instance with same ip >>>>>>> and all our file systems are in ebs mounted fine. Does coming new >>>>>>> instance >>>>>>> with same ip cause any gossip issues? >>>>>>> >>>>>>> On Tue, Oct 29, 2019, 6:16 PM Rahul Reddy <rahulreddy1...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Thanks Alex. We have 6 nodes in each DC with RF=3 with CL local >>>>>>>> qourum . and we stopped and started only one instance at a time . Tough >>>>>>>> nodetool status says all nodes UN and system.log says canssandra >>>>>>>> started >>>>>>>> and started listening . Jmx explrter shows instance stayed down longer >>>>>>>> how >>>>>>>> do we determine what caused the Cassandra unavialbe though log says >>>>>>>> its >>>>>>>> stared and listening ? >>>>>>>> >>>>>>>> On Tue, Oct 29, 2019, 4:44 PM Oleksandr Shulgin < >>>>>>>> oleksandr.shul...@zalando.de> wrote: >>>>>>>> >>>>>>>>> On Tue, Oct 29, 2019 at 9:34 PM Rahul Reddy < >>>>>>>>> rahulreddy1...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> >>>>>>>>>> We have our infrastructure on aws and we use ebs storage . And >>>>>>>>>> aws was retiring on of the node. Since our storage was persistent we >>>>>>>>>> did >>>>>>>>>> nodetool drain and stopped and start the instance . This caused 500 >>>>>>>>>> errors >>>>>>>>>> in the service. We have local_quorum and rf=3 why does stopping one >>>>>>>>>> instance cause application to have issues? >>>>>>>>>> >>>>>>>>> >>>>>>>>> Can you still look up what was the underlying error from Cassandra >>>>>>>>> driver in the application logs? Was it request timeout or not enough >>>>>>>>> replicas? >>>>>>>>> >>>>>>>>> For example, if you only had 3 Cassandra nodes, restarting one of >>>>>>>>> them reduces your cluster capacity by 33% temporarily. >>>>>>>>> >>>>>>>>> Cheers, >>>>>>>>> -- >>>>>>>>> Alex >>>>>>>>> >>>>>>>>>