No connection timeouts? No tcp level retries? I am sorry truly sorry but you have exceeded my capability. I have never seen a java.io timeout with out either a session half open failure (no response) or multiple retries.
I am out of my depth, so please feel free to ignore but, did you see the packets that are making the initial connection (which must have timed out)? Out of curiosity, a netstat -arn must be showing bad packets, timeouts, etc. To see progress, create a simple shell script that dumps date, dumps netstat, sleeps 100 seconds, repeated. During that window stop, wait 10 seconds, restart the remove node. <======> Made weak by time and fate, but strong in will, To strive, to seek, to find, and not to yield. Ulysses - A. Lord Tennyson *Daemeon C.M. Reiydelle* *email: daeme...@gmail.com <daeme...@gmail.com>* *San Francisco 1.415.501.0198/Skype daemeon.c.m.reiydelle* On Wed, Nov 6, 2019 at 9:11 AM Rahul Reddy <rahulreddy1...@gmail.com> wrote: > Thank you. > > I have stopped instance in east. i see that all other instances can gossip > to that instance and only one instance in west having issues gossiping to > that node. when i enable debug mode i see below on the west node > > i see bellow messages from 16:32 to 16:47 > DEBUG [RMI TCP Connection(272)-127.0.0.1] 2019-11-06 16:44:50, > 417 StorageProxy.java:2361 - Hosts not in agreement. Didn't get a response > from everybody: > 424 StorageProxy.java:2361 - Hosts not in agreement. Didn't get a response > from everybody: > > later i see timeout > > DEBUG [MessagingService-Outgoing-/eastip-Gossip] 2019-11-06 16:47:04,831 > OutboundTcpConnection.java:350 - Error writing to /eastip > java.io.IOException: Connection timed out > > then INFO [GossipStage:1] 2019-11-06 16:47:05,792 StorageService.j > ava:2289 - Node /eastip state jump to NORMAL > > DEBUG [GossipStage:1] 2019-11-06 16:47:06,244 MigrationManager > .java:99 - Not pulling schema from /eastip, because sche > ma versions match: local/real=cdbb639b-1675-31b3-8a0d-84aca18e > 86bf, local/compatible=49bf1daa-d585-38e0-a72b-b36ce82da9cb, r > emote=cdbb639b-1675-31b3-8a0d-84aca18e86bf > > i tried running some tcpdump during that time i dont see any packet loss > during that time. still unsure why east instance which was stopped and > started unreachable to west node almost for 15 minutes. > > > On Tue, Nov 5, 2019 at 10:14 PM daemeon reiydelle <daeme...@gmail.com> > wrote: > >> 10 minutes is 600 seconds, and there are several timeouts that are set to >> that, including the data center timeout as I recall. >> >> You may be forced to tcpdump the interface(s) to see where the chatter >> is. Out of curiosity, when you restart the node, have you snapped the jvm's >> memory to see if e.g. heap is even in use? >> >> >> On Tue, Nov 5, 2019 at 7:03 PM Rahul Reddy <rahulreddy1...@gmail.com> >> wrote: >> >>> Thanks Ben, >>> Before stoping the ec2 I did run nodetool drain .so i ruled it out and >>> system.log also doesn't show commitlogs being applied. >>> >>> >>> >>> >>> >>> On Tue, Nov 5, 2019, 7:51 PM Ben Slater <ben.sla...@instaclustr.com> >>> wrote: >>> >>>> The logs between first start and handshaking should give you a clue but >>>> my first guess would be replaying commit logs. >>>> >>>> Cheers >>>> Ben >>>> >>>> --- >>>> >>>> >>>> *Ben Slater**Chief Product Officer* >>>> >>>> <https://www.instaclustr.com/platform/> >>>> >>>> <https://www.facebook.com/instaclustr> >>>> <https://twitter.com/instaclustr> >>>> <https://www.linkedin.com/company/instaclustr> >>>> >>>> Read our latest technical blog posts here >>>> <https://www.instaclustr.com/blog/>. >>>> >>>> This email has been sent on behalf of Instaclustr Pty. Limited >>>> (Australia) and Instaclustr Inc (USA). >>>> >>>> This email and any attachments may contain confidential and legally >>>> privileged information. If you are not the intended recipient, do not copy >>>> or disclose its content, but please reply to this email immediately and >>>> highlight the error to the sender and then immediately delete the message. >>>> >>>> >>>> On Wed, 6 Nov 2019 at 04:36, Rahul Reddy <rahulreddy1...@gmail.com> >>>> wrote: >>>> >>>>> I can reproduce the issue. >>>>> >>>>> I did drain Cassandra node then stop and started Cassandra instance . >>>>> Cassandra instance comes up but other nodes will be in DN state around 10 >>>>> minutes. >>>>> >>>>> I don't see error in the systemlog >>>>> >>>>> DN xx.xx.xx.59 420.85 MiB 256 48.2% id 2 >>>>> UN xx.xx.xx.30 432.14 MiB 256 50.0% id 0 >>>>> UN xx.xx.xx.79 447.33 MiB 256 51.1% id 4 >>>>> DN xx.xx.xx.144 452.59 MiB 256 51.6% id 1 >>>>> DN xx.xx.xx.19 431.7 MiB 256 50.1% id 5 >>>>> UN xx.xx.xx.6 421.79 MiB 256 48.9% >>>>> >>>>> when i do nodetool status 3 nodes still showing down. and i dont see >>>>> errors in system.log >>>>> >>>>> and after 10 mins it shows the other node is up as well. >>>>> >>>>> >>>>> INFO [HANDSHAKE-/10.72.100.156] 2019-11-05 15:05:09,133 >>>>> OutboundTcpConnection.java:561 - Handshaking version with /stopandstarted >>>>> node >>>>> INFO [RequestResponseStage-7] 2019-11-05 15:16:27,166 >>>>> Gossiper.java:1019 - InetAddress /nodewhichitwasshowing down is now UP >>>>> >>>>> what is causing delay for 10mins to be able to say that node is >>>>> reachable >>>>> >>>>> On Wed, Oct 30, 2019, 8:37 AM Rahul Reddy <rahulreddy1...@gmail.com> >>>>> wrote: >>>>> >>>>>> And also aws ec2 stop and start comes with new instance with same ip >>>>>> and all our file systems are in ebs mounted fine. Does coming new >>>>>> instance >>>>>> with same ip cause any gossip issues? >>>>>> >>>>>> On Tue, Oct 29, 2019, 6:16 PM Rahul Reddy <rahulreddy1...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Thanks Alex. We have 6 nodes in each DC with RF=3 with CL local >>>>>>> qourum . and we stopped and started only one instance at a time . Tough >>>>>>> nodetool status says all nodes UN and system.log says canssandra started >>>>>>> and started listening . Jmx explrter shows instance stayed down longer >>>>>>> how >>>>>>> do we determine what caused the Cassandra unavialbe though log says its >>>>>>> stared and listening ? >>>>>>> >>>>>>> On Tue, Oct 29, 2019, 4:44 PM Oleksandr Shulgin < >>>>>>> oleksandr.shul...@zalando.de> wrote: >>>>>>> >>>>>>>> On Tue, Oct 29, 2019 at 9:34 PM Rahul Reddy < >>>>>>>> rahulreddy1...@gmail.com> wrote: >>>>>>>> >>>>>>>>> >>>>>>>>> We have our infrastructure on aws and we use ebs storage . And aws >>>>>>>>> was retiring on of the node. Since our storage was persistent we did >>>>>>>>> nodetool drain and stopped and start the instance . This caused 500 >>>>>>>>> errors >>>>>>>>> in the service. We have local_quorum and rf=3 why does stopping one >>>>>>>>> instance cause application to have issues? >>>>>>>>> >>>>>>>> >>>>>>>> Can you still look up what was the underlying error from Cassandra >>>>>>>> driver in the application logs? Was it request timeout or not enough >>>>>>>> replicas? >>>>>>>> >>>>>>>> For example, if you only had 3 Cassandra nodes, restarting one of >>>>>>>> them reduces your cluster capacity by 33% temporarily. >>>>>>>> >>>>>>>> Cheers, >>>>>>>> -- >>>>>>>> Alex >>>>>>>> >>>>>>>>