"Is it only one node at a time that goes down, and at widely dispersed times?" It is a two node cluster so both nodes consider the other node down at the same time.
These are the times the latest few days: INFO [GossipTasks:1] 2016-02-19 05:06:21,087 Gossiper.java (line 992) InetAddress /x.x.x.x is now DOWN INFO [GossipTasks:1] 2016-02-19 14:33:38,424 Gossiper.java (line 992) InetAddress /x.x.x.x is now DOWN INFO [GossipTasks:1] 2016-02-20 07:21:25,626 Gossiper.java (line 992) InetAddress /x.x.x.x is now DOWN INFO [GossipTasks:1] 2016-02-20 11:34:46,766 Gossiper.java (line 992) InetAddress /x.x.x.x is now DOWN INFO [GossipTasks:1] 2016-02-21 08:00:07,518 Gossiper.java (line 992) InetAddress /x.x.x.x is now DOWN INFO [GossipTasks:1] 2016-02-21 10:36:58,788 Gossiper.java (line 992) InetAddress /x.x.x.x is now DOWN INFO [GossipTasks:1] 2016-02-22 07:10:40,304 Gossiper.java (line 992) InetAddress /x.x.x.x is now DOWN INFO [GossipTasks:1] 2016-02-22 10:05:14,896 Gossiper.java (line 992) InetAddress /x.x.x.x is now DOWN INFO [GossipTasks:1] 2016-02-23 08:59:05,392 Gossiper.java (line 992) InetAddress /x.x.x.x is now DOWN INFO [GossipTasks:1] 2016-02-23 12:22:59,562 Gossiper.java (line 992) InetAddress /x.x.x.x is now DOWN 2016-02-23 18:01 GMT+01:00 daemeon reiydelle <daeme...@gmail.com>: > If you can, do a few (short, maybe 10m records, delete the default schema > between executions) run of Cassandra Stress test against your production > cluster (replication=3, force quorum to 3). Look for latency max in the 10s > of SECONDS. If your devops team is running a monitoring tool that looks at > the network, look for timeout/retries/errors/lost packets, etc. during the > run (worst case you need to do netstats runs against the relevant nic e.g. > every 10 seconds on the CassStress node, look for jumps in this count (if > monitoring is enabled, look at the monitor's results for ALL of your nodes. > At least one is having some issues. > > > *.......* > > > > *Daemeon C.M. ReiydelleUSA (+1) 415.501.0198 > <%28%2B1%29%20415.501.0198>London (+44) (0) 20 8144 9872 > <%28%2B44%29%20%280%29%2020%208144%209872>* > > On Tue, Feb 23, 2016 at 8:43 AM, Jack Krupansky <jack.krupan...@gmail.com> > wrote: > >> The reality of modern distributed systems is that connectivity between >> nodes is never guaranteed and distributed software must be able to cope >> with occasional absence of connectivity. GC and network connectivity are >> the two issues that a lot of us are most familiar with. There may be others >> - but most technical problems on a node would be clearly logged on that >> node. If you see a lapse of connectivity no more than once or twice a day, >> consider yourselves lucky. >> >> Is it only one node at a time that goes down, and at widely dispersed >> times? >> >> How many nodes? >> >> -- Jack Krupansky >> >> On Tue, Feb 23, 2016 at 11:01 AM, Joel Samuelsson < >> samuelsson.j...@gmail.com> wrote: >> >>> Hi, >>> >>> Version is 2.0.17. >>> Yes, these are VMs in the cloud though I'm fairly certain they are on a >>> LAN rather than WAN. They are both in the same data centre physically. The >>> phi_convict_threshold is set to default. I'd rather find the root cause of >>> the problem than just hiding it by not convicting a node if it isn't >>> responding though. If pings are <2 ms without a single ping missed in >>> several days, I highly doubt that network is the reason for the downtime. >>> >>> Best regards, >>> Joel >>> >>> 2016-02-23 16:39 GMT+01:00 <sean_r_dur...@homedepot.com>: >>> >>>> You didn’t mention version, but I saw this kind of thing very often in >>>> the 1.1 line. Often this is connected to network flakiness. Are these VMs? >>>> In the cloud? Connected over a WAN? You mention that ping seems fine. Take >>>> a look at the phi_convict_threshold in c assandra.yaml. You may need to >>>> increase it to reduce the UP/DOWN flapping behavior. >>>> >>>> >>>> >>>> >>>> >>>> Sean Durity >>>> >>>> >>>> >>>> *From:* Joel Samuelsson [mailto:samuelsson.j...@gmail.com] >>>> *Sent:* Tuesday, February 23, 2016 9:41 AM >>>> *To:* user@cassandra.apache.org >>>> *Subject:* Re: Nodes go down periodically >>>> >>>> >>>> >>>> Hi, >>>> >>>> >>>> >>>> Thanks for your reply. >>>> >>>> >>>> >>>> I have debug logging on and see no GC pauses that are that long. GC >>>> pauses are all well below 1s and 99 times out of 100 below 100ms. >>>> >>>> Do I need to enable GC log options to see the pauses? >>>> >>>> I see plenty of these lines: >>>> DEBUG [ScheduledTasks:1] 2016-02-22 10:43:02,891 GCInspector.java (line >>>> 118) GC for ParNew: 24 ms for 1 collections >>>> >>>> as well as a few CMS GC log lines. >>>> >>>> >>>> >>>> Best regards, >>>> >>>> Joel >>>> >>>> >>>> >>>> 2016-02-23 15:14 GMT+01:00 Hannu Kröger <hkro...@gmail.com>: >>>> >>>> Hi, >>>> >>>> >>>> >>>> Those are probably GC pauses. Memory tuning is probably needed. Check >>>> the parameters that you already have customised if they make sense. >>>> >>>> >>>> >>>> http://blog.mikiobraun.de/2010/08/cassandra-gc-tuning.html >>>> >>>> >>>> >>>> Hannu >>>> >>>> >>>> >>>> >>>> >>>> On 23 Feb 2016, at 16:08, Joel Samuelsson <samuelsson.j...@gmail.com> >>>> wrote: >>>> >>>> >>>> >>>> Our nodes go down periodically, around 1-2 times each day. Downtime is >>>> from <1 second to 30 or so seconds. >>>> >>>> >>>> >>>> INFO [GossipTasks:1] 2016-02-22 10:05:14,896 Gossiper.java (line 992) >>>> InetAddress /109.74.13.67 is now DOWN >>>> >>>> INFO [RequestResponseStage:8844] 2016-02-22 10:05:38,331 Gossiper.java >>>> (line 978) InetAddress /109.74.13.67 is now UP >>>> >>>> >>>> >>>> I find nothing odd in the logs around the same time. I logged a ping >>>> with timestamp and checked during the same time and saw nothing weird (ping >>>> is less than 2ms at all times). >>>> >>>> >>>> >>>> Does anyone have any suggestions as to why this might happen? >>>> >>>> >>>> >>>> Best regards, >>>> Joel >>>> >>>> >>>> >>>> >>>> >>>> ------------------------------ >>>> >>>> The information in this Internet Email is confidential and may be >>>> legally privileged. It is intended solely for the addressee. Access to this >>>> Email by anyone else is unauthorized. If you are not the intended >>>> recipient, any disclosure, copying, distribution or any action taken or >>>> omitted to be taken in reliance on it, is prohibited and may be unlawful. >>>> When addressed to our clients any opinions or advice contained in this >>>> Email are subject to the terms and conditions expressed in any applicable >>>> governing The Home Depot terms of business or client engagement letter. The >>>> Home Depot disclaims all responsibility and liability for the accuracy and >>>> content of this attachment and for any damages or losses arising from any >>>> inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other >>>> items of a destructive nature, which may be contained in this attachment >>>> and shall not be liable for direct, indirect, consequential or special >>>> damages in connection with this e-mail message or its attachment. >>>> >>> >>> >> >