Re: Nodes go down periodically

Joel Samuelsson Tue, 23 Feb 2016 22:04:54 -0800

"Is it only one node at a time that goes down, and at widely dispersed
times?"
It is a two node cluster so both nodes consider the other node down at the
same time.


These are the times the latest few days:
INFO [GossipTasks:1] 2016-02-19 05:06:21,087 Gossiper.java (line 992)
InetAddress /x.x.x.x is now DOWN
INFO [GossipTasks:1] 2016-02-19 14:33:38,424 Gossiper.java (line 992)
InetAddress /x.x.x.x is now DOWN
INFO [GossipTasks:1] 2016-02-20 07:21:25,626 Gossiper.java (line 992)
InetAddress /x.x.x.x is now DOWN
INFO [GossipTasks:1] 2016-02-20 11:34:46,766 Gossiper.java (line 992)
InetAddress /x.x.x.x is now DOWN
INFO [GossipTasks:1] 2016-02-21 08:00:07,518 Gossiper.java (line 992)
InetAddress /x.x.x.x is now DOWN
INFO [GossipTasks:1] 2016-02-21 10:36:58,788 Gossiper.java (line 992)
InetAddress /x.x.x.x is now DOWN
INFO [GossipTasks:1] 2016-02-22 07:10:40,304 Gossiper.java (line 992)
InetAddress /x.x.x.x is now DOWN
INFO [GossipTasks:1] 2016-02-22 10:05:14,896 Gossiper.java (line 992)
InetAddress /x.x.x.x is now DOWN
INFO [GossipTasks:1] 2016-02-23 08:59:05,392 Gossiper.java (line 992)
InetAddress /x.x.x.x is now DOWN
INFO [GossipTasks:1] 2016-02-23 12:22:59,562 Gossiper.java (line 992)
InetAddress /x.x.x.x is now DOWN


2016-02-23 18:01 GMT+01:00 daemeon reiydelle <daeme...@gmail.com>:

> If you can, do a few (short, maybe 10m records, delete the default schema
> between executions) run of Cassandra Stress test against your production
> cluster (replication=3, force quorum to 3). Look for latency max in the 10s
> of SECONDS. If your devops team is running a monitoring tool that looks at
> the network, look for timeout/retries/errors/lost packets, etc. during the
> run (worst case you need to do netstats runs against the relevant nic e.g.
> every 10 seconds on the CassStress node, look for jumps in this count (if
> monitoring is enabled, look at the monitor's results for ALL of your nodes.
> At least one is having some issues.
>
>
> *.......*
>
>
>
> *Daemeon C.M. ReiydelleUSA (+1) 415.501.0198
> <%28%2B1%29%20415.501.0198>London (+44) (0) 20 8144 9872
> <%28%2B44%29%20%280%29%2020%208144%209872>*
>
> On Tue, Feb 23, 2016 at 8:43 AM, Jack Krupansky <jack.krupan...@gmail.com>
> wrote:
>
>> The reality of modern distributed systems is that connectivity between
>> nodes is never guaranteed and distributed software must be able to cope
>> with occasional absence of connectivity. GC and network connectivity are
>> the two issues that a lot of us are most familiar with. There may be others
>> - but most technical problems on a node would be clearly logged on that
>> node. If you see a lapse of connectivity no more than once or twice a day,
>> consider yourselves lucky.
>>
>> Is it only one node at a time that goes down, and at widely dispersed
>> times?
>>
>> How many nodes?
>>
>> -- Jack Krupansky
>>
>> On Tue, Feb 23, 2016 at 11:01 AM, Joel Samuelsson <
>> samuelsson.j...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> Version is 2.0.17.
>>> Yes, these are VMs in the cloud though I'm fairly certain they are on a
>>> LAN rather than WAN. They are both in the same data centre physically. The
>>> phi_convict_threshold is set to default. I'd rather find the root cause of
>>> the problem than just hiding it by not convicting a node if it isn't
>>> responding though. If pings are <2 ms without a single ping missed in
>>> several days, I highly doubt that network is the reason for the downtime.
>>>
>>> Best regards,
>>> Joel
>>>
>>> 2016-02-23 16:39 GMT+01:00 <sean_r_dur...@homedepot.com>:
>>>
>>>> You didn’t mention version, but I saw this kind of thing very often in
>>>> the 1.1 line. Often this is connected to network flakiness. Are these VMs?
>>>> In the cloud? Connected over a WAN? You mention that ping seems fine. Take
>>>> a look at the phi_convict_threshold in c assandra.yaml. You may need to
>>>> increase it to reduce the UP/DOWN flapping behavior.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Sean Durity
>>>>
>>>>
>>>>
>>>> *From:* Joel Samuelsson [mailto:samuelsson.j...@gmail.com]
>>>> *Sent:* Tuesday, February 23, 2016 9:41 AM
>>>> *To:* user@cassandra.apache.org
>>>> *Subject:* Re: Nodes go down periodically
>>>>
>>>>
>>>>
>>>> Hi,
>>>>
>>>>
>>>>
>>>> Thanks for your reply.
>>>>
>>>>
>>>>
>>>> I have debug logging on and see no GC pauses that are that long. GC
>>>> pauses are all well below 1s and 99 times out of 100 below 100ms.
>>>>
>>>> Do I need to enable GC log options to see the pauses?
>>>>
>>>> I see plenty of these lines:
>>>> DEBUG [ScheduledTasks:1] 2016-02-22 10:43:02,891 GCInspector.java (line
>>>> 118) GC for ParNew: 24 ms for 1 collections
>>>>
>>>> as well as a few CMS GC log lines.
>>>>
>>>>
>>>>
>>>> Best regards,
>>>>
>>>> Joel
>>>>
>>>>
>>>>
>>>> 2016-02-23 15:14 GMT+01:00 Hannu Kröger <hkro...@gmail.com>:
>>>>
>>>> Hi,
>>>>
>>>>
>>>>
>>>> Those are probably GC pauses. Memory tuning is probably needed. Check
>>>> the parameters that you already have customised if they make sense.
>>>>
>>>>
>>>>
>>>> http://blog.mikiobraun.de/2010/08/cassandra-gc-tuning.html
>>>>
>>>>
>>>>
>>>> Hannu
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On 23 Feb 2016, at 16:08, Joel Samuelsson <samuelsson.j...@gmail.com>
>>>> wrote:
>>>>
>>>>
>>>>
>>>> Our nodes go down periodically, around 1-2 times each day. Downtime is
>>>> from <1 second to 30 or so seconds.
>>>>
>>>>
>>>>
>>>> INFO [GossipTasks:1] 2016-02-22 10:05:14,896 Gossiper.java (line 992)
>>>> InetAddress /109.74.13.67 is now DOWN
>>>>
>>>>  INFO [RequestResponseStage:8844] 2016-02-22 10:05:38,331 Gossiper.java
>>>> (line 978) InetAddress /109.74.13.67 is now UP
>>>>
>>>>
>>>>
>>>> I find nothing odd in the logs around the same time. I logged a ping
>>>> with timestamp and checked during the same time and saw nothing weird (ping
>>>> is less than 2ms at all times).
>>>>
>>>>
>>>>
>>>> Does anyone have any suggestions as to why this might happen?
>>>>
>>>>
>>>>
>>>> Best regards,
>>>> Joel
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> ------------------------------
>>>>
>>>> The information in this Internet Email is confidential and may be
>>>> legally privileged. It is intended solely for the addressee. Access to this
>>>> Email by anyone else is unauthorized. If you are not the intended
>>>> recipient, any disclosure, copying, distribution or any action taken or
>>>> omitted to be taken in reliance on it, is prohibited and may be unlawful.
>>>> When addressed to our clients any opinions or advice contained in this
>>>> Email are subject to the terms and conditions expressed in any applicable
>>>> governing The Home Depot terms of business or client engagement letter. The
>>>> Home Depot disclaims all responsibility and liability for the accuracy and
>>>> content of this attachment and for any damages or losses arising from any
>>>> inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other
>>>> items of a destructive nature, which may be contained in this attachment
>>>> and shall not be liable for direct, indirect, consequential or special
>>>> damages in connection with this e-mail message or its attachment.
>>>>
>>>
>>>
>>
>

Re: Nodes go down periodically

Reply via email to