Thanks Daemeon ,

will do that and post the results.
I found jira in open state with similar issue
https://issues.apache.org/jira/browse/CASSANDRA-13984

On Wed, Nov 6, 2019 at 1:49 PM daemeon reiydelle <daeme...@gmail.com> wrote:

> No connection timeouts? No tcp level retries? I am sorry truly sorry but
> you have exceeded my capability. I have never seen a java.io timeout with
> out either a session half open failure (no response) or multiple retries.
>
> I am out of my depth, so please feel free to ignore but, did you see the
> packets that are making the initial connection (which must have timed out)?
> Out of curiosity, a netstat -arn must be showing bad packets, timeouts,
> etc. To see progress, create a simple shell script that dumps date, dumps
> netstat, sleeps 100 seconds, repeated. During that window stop, wait 10
> seconds, restart the remove node.
>
> <======>
> Made weak by time and fate, but strong in will,
> To strive, to seek, to find, and not to yield.
> Ulysses - A. Lord Tennyson
>
> *Daemeon C.M. Reiydelle*
>
> *email: daeme...@gmail.com <daeme...@gmail.com>*
> *San Francisco 1.415.501.0198/Skype daemeon.c.m.reiydelle*
>
>
>
> On Wed, Nov 6, 2019 at 9:11 AM Rahul Reddy <rahulreddy1...@gmail.com>
> wrote:
>
>> Thank you.
>>
>> I have stopped instance in east. i see that all other instances can
>> gossip to that instance and only one instance in west having issues
>> gossiping to that node.  when i enable debug mode i see below on the west
>> node
>>
>> i see bellow messages from 16:32 to 16:47
>> DEBUG [RMI TCP Connection(272)-127.0.0.1] 2019-11-06 16:44:50,
>> 417 StorageProxy.java:2361 - Hosts not in agreement. Didn't get a
>> response from everybody:
>> 424 StorageProxy.java:2361 - Hosts not in agreement. Didn't get a
>> response from everybody:
>>
>> later i see timeout
>>
>> DEBUG [MessagingService-Outgoing-/eastip-Gossip] 2019-11-06 16:47:04,831
>> OutboundTcpConnection.java:350 - Error writing to /eastip
>> java.io.IOException: Connection timed out
>>
>> then  INFO  [GossipStage:1] 2019-11-06 16:47:05,792 StorageService.j
>> ava:2289 - Node /eastip state jump to NORMAL
>>
>> DEBUG [GossipStage:1] 2019-11-06 16:47:06,244 MigrationManager
>> .java:99 - Not pulling schema from /eastip, because sche
>> ma versions match: local/real=cdbb639b-1675-31b3-8a0d-84aca18e
>> 86bf, local/compatible=49bf1daa-d585-38e0-a72b-b36ce82da9cb, r
>> emote=cdbb639b-1675-31b3-8a0d-84aca18e86bf
>>
>> i tried running some tcpdump during that time i dont see any packet loss
>> during that time.  still unsure why east instance which was stopped and
>> started unreachable to west node almost for 15 minutes.
>>
>>
>> On Tue, Nov 5, 2019 at 10:14 PM daemeon reiydelle <daeme...@gmail.com>
>> wrote:
>>
>>> 10 minutes is 600 seconds, and there are several timeouts that are set
>>> to that, including the data center timeout as I recall.
>>>
>>> You may be forced to tcpdump the interface(s) to see where the chatter
>>> is. Out of curiosity, when you restart the node, have you snapped the jvm's
>>> memory to see if e.g. heap is even in use?
>>>
>>>
>>> On Tue, Nov 5, 2019 at 7:03 PM Rahul Reddy <rahulreddy1...@gmail.com>
>>> wrote:
>>>
>>>> Thanks Ben,
>>>> Before stoping the ec2 I did run nodetool drain .so i ruled it out and
>>>> system.log also doesn't show commitlogs being applied.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Nov 5, 2019, 7:51 PM Ben Slater <ben.sla...@instaclustr.com>
>>>> wrote:
>>>>
>>>>> The logs between first start and handshaking should give you a
>>>>> clue but my first guess would be replaying commit logs.
>>>>>
>>>>> Cheers
>>>>> Ben
>>>>>
>>>>> ---
>>>>>
>>>>>
>>>>> *Ben Slater**Chief Product Officer*
>>>>>
>>>>> <https://www.instaclustr.com/platform/>
>>>>>
>>>>> <https://www.facebook.com/instaclustr>
>>>>> <https://twitter.com/instaclustr>
>>>>> <https://www.linkedin.com/company/instaclustr>
>>>>>
>>>>> Read our latest technical blog posts here
>>>>> <https://www.instaclustr.com/blog/>.
>>>>>
>>>>> This email has been sent on behalf of Instaclustr Pty. Limited
>>>>> (Australia) and Instaclustr Inc (USA).
>>>>>
>>>>> This email and any attachments may contain confidential and legally
>>>>> privileged information.  If you are not the intended recipient, do not 
>>>>> copy
>>>>> or disclose its content, but please reply to this email immediately and
>>>>> highlight the error to the sender and then immediately delete the message.
>>>>>
>>>>>
>>>>> On Wed, 6 Nov 2019 at 04:36, Rahul Reddy <rahulreddy1...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> I can reproduce the issue.
>>>>>>
>>>>>> I did drain Cassandra node then stop and started Cassandra instance .
>>>>>> Cassandra instance comes up but other nodes will be in DN state around 10
>>>>>> minutes.
>>>>>>
>>>>>> I don't see error in the systemlog
>>>>>>
>>>>>> DN  xx.xx.xx.59   420.85 MiB  256          48.2%             id  2
>>>>>> UN  xx.xx.xx.30   432.14 MiB  256          50.0%             id  0
>>>>>> UN  xx.xx.xx.79   447.33 MiB  256          51.1%             id  4
>>>>>> DN  xx.xx.xx.144  452.59 MiB  256          51.6%             id  1
>>>>>> DN  xx.xx.xx.19   431.7 MiB  256          50.1%             id  5
>>>>>> UN  xx.xx.xx.6    421.79 MiB  256          48.9%
>>>>>>
>>>>>> when i do nodetool status 3 nodes still showing down. and i dont see
>>>>>> errors in system.log
>>>>>>
>>>>>> and after 10 mins it shows the other node is up as well.
>>>>>>
>>>>>>
>>>>>> INFO  [HANDSHAKE-/10.72.100.156] 2019-11-05 15:05:09,133
>>>>>> OutboundTcpConnection.java:561 - Handshaking version with /stopandstarted
>>>>>> node
>>>>>> INFO  [RequestResponseStage-7] 2019-11-05 15:16:27,166
>>>>>> Gossiper.java:1019 - InetAddress /nodewhichitwasshowing down is now UP
>>>>>>
>>>>>> what is causing delay for 10mins to be able to say that node is
>>>>>> reachable
>>>>>>
>>>>>> On Wed, Oct 30, 2019, 8:37 AM Rahul Reddy <rahulreddy1...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> And also aws ec2 stop and start comes with new instance with same ip
>>>>>>> and all our file systems are in ebs mounted fine.  Does coming new 
>>>>>>> instance
>>>>>>> with same ip cause any gossip issues?
>>>>>>>
>>>>>>> On Tue, Oct 29, 2019, 6:16 PM Rahul Reddy <rahulreddy1...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Thanks Alex. We have 6 nodes in each DC with RF=3  with CL local
>>>>>>>> qourum . and we stopped and started only one instance at a time . Tough
>>>>>>>> nodetool status says all nodes UN and system.log says canssandra 
>>>>>>>> started
>>>>>>>> and started listening . Jmx explrter shows instance stayed down longer 
>>>>>>>> how
>>>>>>>> do we determine what caused  the Cassandra unavialbe though log says 
>>>>>>>> its
>>>>>>>> stared and listening ?
>>>>>>>>
>>>>>>>> On Tue, Oct 29, 2019, 4:44 PM Oleksandr Shulgin <
>>>>>>>> oleksandr.shul...@zalando.de> wrote:
>>>>>>>>
>>>>>>>>> On Tue, Oct 29, 2019 at 9:34 PM Rahul Reddy <
>>>>>>>>> rahulreddy1...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> We have our infrastructure on aws and we use ebs storage . And
>>>>>>>>>> aws was retiring on of the node. Since our storage was persistent we 
>>>>>>>>>> did
>>>>>>>>>> nodetool drain and stopped and start the instance . This caused 500 
>>>>>>>>>> errors
>>>>>>>>>> in the service. We have local_quorum and rf=3 why does stopping one
>>>>>>>>>> instance cause application to have issues?
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Can you still look up what was the underlying error from Cassandra
>>>>>>>>> driver in the application logs?  Was it request timeout or not enough
>>>>>>>>> replicas?
>>>>>>>>>
>>>>>>>>> For example, if you only had 3 Cassandra nodes, restarting one of
>>>>>>>>> them reduces your cluster capacity by 33% temporarily.
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>> --
>>>>>>>>> Alex
>>>>>>>>>
>>>>>>>>>

Reply via email to