Re: Aws instance stop and star with ebs

daemeon reiydelle Wed, 06 Nov 2019 10:50:02 -0800

No connection timeouts? No tcp level retries? I am sorry truly sorry but
you have exceeded my capability. I have never seen a java.io timeout with
out either a session half open failure (no response) or multiple retries.


I am out of my depth, so please feel free to ignore but, did you see the
packets that are making the initial connection (which must have timed out)?
Out of curiosity, a netstat -arn must be showing bad packets, timeouts,
etc. To see progress, create a simple shell script that dumps date, dumps
netstat, sleeps 100 seconds, repeated. During that window stop, wait 10
seconds, restart the remove node.

<======>
Made weak by time and fate, but strong in will,
To strive, to seek, to find, and not to yield.
Ulysses - A. Lord Tennyson

*Daemeon C.M. Reiydelle*

*email: daeme...@gmail.com <daeme...@gmail.com>*
*San Francisco 1.415.501.0198/Skype daemeon.c.m.reiydelle*



On Wed, Nov 6, 2019 at 9:11 AM Rahul Reddy <rahulreddy1...@gmail.com> wrote:

> Thank you.
>
> I have stopped instance in east. i see that all other instances can gossip
> to that instance and only one instance in west having issues gossiping to
> that node.  when i enable debug mode i see below on the west node
>
> i see bellow messages from 16:32 to 16:47
> DEBUG [RMI TCP Connection(272)-127.0.0.1] 2019-11-06 16:44:50,
> 417 StorageProxy.java:2361 - Hosts not in agreement. Didn't get a response
> from everybody:
> 424 StorageProxy.java:2361 - Hosts not in agreement. Didn't get a response
> from everybody:
>
> later i see timeout
>
> DEBUG [MessagingService-Outgoing-/eastip-Gossip] 2019-11-06 16:47:04,831
> OutboundTcpConnection.java:350 - Error writing to /eastip
> java.io.IOException: Connection timed out
>
> then  INFO  [GossipStage:1] 2019-11-06 16:47:05,792 StorageService.j
> ava:2289 - Node /eastip state jump to NORMAL
>
> DEBUG [GossipStage:1] 2019-11-06 16:47:06,244 MigrationManager
> .java:99 - Not pulling schema from /eastip, because sche
> ma versions match: local/real=cdbb639b-1675-31b3-8a0d-84aca18e
> 86bf, local/compatible=49bf1daa-d585-38e0-a72b-b36ce82da9cb, r
> emote=cdbb639b-1675-31b3-8a0d-84aca18e86bf
>
> i tried running some tcpdump during that time i dont see any packet loss
> during that time.  still unsure why east instance which was stopped and
> started unreachable to west node almost for 15 minutes.
>
>
> On Tue, Nov 5, 2019 at 10:14 PM daemeon reiydelle <daeme...@gmail.com>
> wrote:
>
>> 10 minutes is 600 seconds, and there are several timeouts that are set to
>> that, including the data center timeout as I recall.
>>
>> You may be forced to tcpdump the interface(s) to see where the chatter
>> is. Out of curiosity, when you restart the node, have you snapped the jvm's
>> memory to see if e.g. heap is even in use?
>>
>>
>> On Tue, Nov 5, 2019 at 7:03 PM Rahul Reddy <rahulreddy1...@gmail.com>
>> wrote:
>>
>>> Thanks Ben,
>>> Before stoping the ec2 I did run nodetool drain .so i ruled it out and
>>> system.log also doesn't show commitlogs being applied.
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Nov 5, 2019, 7:51 PM Ben Slater <ben.sla...@instaclustr.com>
>>> wrote:
>>>
>>>> The logs between first start and handshaking should give you a clue but
>>>> my first guess would be replaying commit logs.
>>>>
>>>> Cheers
>>>> Ben
>>>>
>>>> ---
>>>>
>>>>
>>>> *Ben Slater**Chief Product Officer*
>>>>
>>>> <https://www.instaclustr.com/platform/>
>>>>
>>>> <https://www.facebook.com/instaclustr>
>>>> <https://twitter.com/instaclustr>
>>>> <https://www.linkedin.com/company/instaclustr>
>>>>
>>>> Read our latest technical blog posts here
>>>> <https://www.instaclustr.com/blog/>.
>>>>
>>>> This email has been sent on behalf of Instaclustr Pty. Limited
>>>> (Australia) and Instaclustr Inc (USA).
>>>>
>>>> This email and any attachments may contain confidential and legally
>>>> privileged information.  If you are not the intended recipient, do not copy
>>>> or disclose its content, but please reply to this email immediately and
>>>> highlight the error to the sender and then immediately delete the message.
>>>>
>>>>
>>>> On Wed, 6 Nov 2019 at 04:36, Rahul Reddy <rahulreddy1...@gmail.com>
>>>> wrote:
>>>>
>>>>> I can reproduce the issue.
>>>>>
>>>>> I did drain Cassandra node then stop and started Cassandra instance .
>>>>> Cassandra instance comes up but other nodes will be in DN state around 10
>>>>> minutes.
>>>>>
>>>>> I don't see error in the systemlog
>>>>>
>>>>> DN  xx.xx.xx.59   420.85 MiB  256          48.2%             id  2
>>>>> UN  xx.xx.xx.30   432.14 MiB  256          50.0%             id  0
>>>>> UN  xx.xx.xx.79   447.33 MiB  256          51.1%             id  4
>>>>> DN  xx.xx.xx.144  452.59 MiB  256          51.6%             id  1
>>>>> DN  xx.xx.xx.19   431.7 MiB  256          50.1%             id  5
>>>>> UN  xx.xx.xx.6    421.79 MiB  256          48.9%
>>>>>
>>>>> when i do nodetool status 3 nodes still showing down. and i dont see
>>>>> errors in system.log
>>>>>
>>>>> and after 10 mins it shows the other node is up as well.
>>>>>
>>>>>
>>>>> INFO  [HANDSHAKE-/10.72.100.156] 2019-11-05 15:05:09,133
>>>>> OutboundTcpConnection.java:561 - Handshaking version with /stopandstarted
>>>>> node
>>>>> INFO  [RequestResponseStage-7] 2019-11-05 15:16:27,166
>>>>> Gossiper.java:1019 - InetAddress /nodewhichitwasshowing down is now UP
>>>>>
>>>>> what is causing delay for 10mins to be able to say that node is
>>>>> reachable
>>>>>
>>>>> On Wed, Oct 30, 2019, 8:37 AM Rahul Reddy <rahulreddy1...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> And also aws ec2 stop and start comes with new instance with same ip
>>>>>> and all our file systems are in ebs mounted fine.  Does coming new 
>>>>>> instance
>>>>>> with same ip cause any gossip issues?
>>>>>>
>>>>>> On Tue, Oct 29, 2019, 6:16 PM Rahul Reddy <rahulreddy1...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Thanks Alex. We have 6 nodes in each DC with RF=3  with CL local
>>>>>>> qourum . and we stopped and started only one instance at a time . Tough
>>>>>>> nodetool status says all nodes UN and system.log says canssandra started
>>>>>>> and started listening . Jmx explrter shows instance stayed down longer 
>>>>>>> how
>>>>>>> do we determine what caused  the Cassandra unavialbe though log says its
>>>>>>> stared and listening ?
>>>>>>>
>>>>>>> On Tue, Oct 29, 2019, 4:44 PM Oleksandr Shulgin <
>>>>>>> oleksandr.shul...@zalando.de> wrote:
>>>>>>>
>>>>>>>> On Tue, Oct 29, 2019 at 9:34 PM Rahul Reddy <
>>>>>>>> rahulreddy1...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>> We have our infrastructure on aws and we use ebs storage . And aws
>>>>>>>>> was retiring on of the node. Since our storage was persistent we did
>>>>>>>>> nodetool drain and stopped and start the instance . This caused 500 
>>>>>>>>> errors
>>>>>>>>> in the service. We have local_quorum and rf=3 why does stopping one
>>>>>>>>> instance cause application to have issues?
>>>>>>>>>
>>>>>>>>
>>>>>>>> Can you still look up what was the underlying error from Cassandra
>>>>>>>> driver in the application logs?  Was it request timeout or not enough
>>>>>>>> replicas?
>>>>>>>>
>>>>>>>> For example, if you only had 3 Cassandra nodes, restarting one of
>>>>>>>> them reduces your cluster capacity by 33% temporarily.
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>> --
>>>>>>>> Alex
>>>>>>>>
>>>>>>>>

Re: Aws instance stop and star with ebs

Reply via email to