Hi, This issue is related to https://issues.apache.org/jira/browse/CASSANDRA-9630
Thanks and Regards A.SathishKumar On Sun, Dec 24, 2017 at 2:59 AM, sat <sathish.al...@gmail.com> wrote: > Hi, > > We have 3 node cluster, before we reboot one of the VMWare VM/node (RHEL > 7.2) all 3 nodes formed the cluster without any issues. However after > reboot, we noticed rebooted node (7 out of 10 times) takes more time to > join approximately 10 - 15min. > > Cassandra - 3.9 version > > While investigating the issue further we noticed > > - Node 1 (rebooted node) able to send "SYN", "ACK2" messages for both > the nodes (Node 2, Node 3) even though nodetool status displays "Node 2 and > 3 as "DN"" only in "Node 1" > - After 10 - 15min we noticed *"Connection Timeout"* exception in > Node 2 and 3. being thrown from OutboundTcpConnection.java (line # 311) > which triggers a state change event to "Node 1" and changes the state as > "UN". > > > >> *if (logger.isTraceEnabled())** logger.trace("error >> writing to {}", poolReference.endPoint(), e);* > > > Please let us know what triggers "Connection TimeOut" exception in "Node 2 > and 3" and ways to resolve this. > > > Thanks and Regards > A.SathishKumar > > On Fri, Dec 22, 2017 at 12:06 PM, sat <sathish.al...@gmail.com> wrote: > >> Hi Jeff, >> >> Thanks for your prompt response. Please find logs. Does Gossip has >> dependency with NTP, because we find the node we reboot takes time to sync >> with NTP. >> >> When will FailureDetector message kick in, we have this issue only when >> we see Gossiper and Failure Detector log message. >> >> TRACE [GossipStage:1] 2017-12-22 12:46:26,759 FailureDetector.java:272 - >> Average for /10.63.114.158 is 9.977908204545455E8 >> TRACE [GossipStage:1] 2017-12-22 12:46:26,759 Gossiper.java:1140 - / >> 10.63.114.158 local generation 1513967155, remote generation 1513967155 >> TRACE [GossipStage:1] 2017-12-22 12:46:26,759 Gossiper.java:1193 - >> Updating heartbeat state version to 14666 from 14663 for /10.63.114.158 >> ... >> TRACE [GossipStage:1] 2017-12-22 12:46:26,759 Gossiper.java:986 - Sending >> a EchoMessage to /10.63.114.158 >> TRACE [GossipStage:1] 2017-12-22 12:46:26,759 MessagingService.java:760 - >> /10.63.114.150 sending ECHO to 6794@/10.63.114.158 >> TRACE [GossipTasks:1] 2017-12-22 12:46:26,856 Gossiper.java:144 - My >> heartbeat is now 2258 >> TRACE [GossipTasks:1] 2017-12-22 12:46:26,856 Gossiper.java:500 - Gossip >> Digests are : /10.63.114.150:1513971247:2258 /10.63.114.154:1513967154 >> :14664 /10.63.114.158:1513967155:14666 >> TRACE [GossipTasks:1] 2017-12-22 12:46:26,857 Gossiper.java:646 - Sending >> a GossipDigestSyn to /10.63.114.158 ... >> TRACE [GossipTasks:1] 2017-12-22 12:46:26,857 MessagingService.java:760 - >> /10.63.114.150 sending GOSSIP_DIGEST_SYN to 6795@/10.63.114.158 >> TRACE [GossipTasks:1] 2017-12-22 12:46:26,857 Gossiper.java:646 - Sending >> a GossipDigestSyn to /10.63.114.154 ... >> TRACE [GossipTasks:1] 2017-12-22 12:46:26,857 MessagingService.java:760 - >> /10.63.114.150 sending GOSSIP_DIGEST_SYN to 6796@/10.63.114.154 >> TRACE [GossipTasks:1] 2017-12-22 12:46:26,857 Gossiper.java:757 - >> Performing status check ... >> TRACE [GossipTasks:1] 2017-12-22 12:46:26,857 FailureDetector.java:298 - >> PHI for /10.63.114.158 : 0.09877714244264273 >> TRACE [GossipTasks:1] 2017-12-22 12:46:26,857 FailureDetector.java:315 - >> PHI for /10.63.114.158 : 0.09877714244264273 >> TRACE [GossipTasks:1] 2017-12-22 12:46:26,857 FailureDetector.java:316 - >> mean for /10.63.114.158 : 9.977908204545455E8 >> TRACE [GossipTasks:1] 2017-12-22 12:46:26,857 FailureDetector.java:298 - >> PHI for /10.63.114.154 : 0.27857029580273446 >> TRACE [GossipTasks:1] 2017-12-22 12:46:26,858 FailureDetector.java:315 - >> PHI for /10.63.114.154 : 0.27857029580273446 >> TRACE [GossipTasks:1] 2017-12-22 12:46:26,858 FailureDetector.java:316 - >> mean for /10.63.114.154 : 7.65247444583815E8 >> TRACE [GossipStage:1] 2017-12-22 12:46:26,858 >> GossipDigestAckVerbHandler.java:41 - Received a GossipDigestAckMessage >> from /10.63.114.158 >> TRACE [GossipStage:1] 2017-12-22 12:46:26,858 >> GossipDigestAckVerbHandler.java:52 - Received ack with 1 digests and 0 >> states >> TRACE [GossipStage:1] 2017-12-22 12:46:26,858 Gossiper.java:889 - local >> heartbeat version 2258 greater than 2257 for /10.63.114.150 >> TRACE [GossipStage:1] 2017-12-22 12:46:26,858 >> GossipDigestAckVerbHandler.java:84 - Sending a GossipDigestAck2Message >> to /10.63.114.158 >> TRACE [GossipStage:1] 2017-12-22 12:46:26,858 MessagingService.java:760 - >> /10.63.114.150 sending GOSSIP_DIGEST_ACK2 to 6797@/10.63.114.158 >> TRACE [GossipStage:1] 2017-12-22 12:46:26,858 >> GossipDigestAckVerbHandler.java:41 - Received a GossipDigestAckMessage >> from /10.63.114.154 >> TRACE [GossipStage:1] 2017-12-22 12:46:26,858 >> GossipDigestAckVerbHandler.java:52 - Received ack with 2 digests and 1 >> states >> TRACE [GossipStage:1] 2017-12-22 12:46:26,858 Gossiper.java:1140 - / >> 10.63.114.154 local generation 1513967154, remote generation 1513967154 >> TRACE [GossipStage:1] 2017-12-22 12:46:26,858 Gossiper.java:1193 - >> Updating heartbeat state version to 14664 from 14664 for /10.63.114.154 >> ... >> TRACE [GossipStage:1] 2017-12-22 12:46:26,858 Gossiper.java:986 - Sending >> a EchoMessage to /10.63.114.154 >> TRACE [GossipStage:1] 2017-12-22 12:46:26,858 MessagingService.java:760 - >> /10.63.114.150 sending ECHO to 6798@/10.63.114.154 >> TRACE [GossipStage:1] 2017-12-22 12:46:26,858 Gossiper.java:889 - local >> heartbeat version 2258 greater than 2257 for /10.63.114.150 >> TRACE [GossipStage:1] 2017-12-22 12:46:26,858 Gossiper.java:889 - local >> heartbeat version 14666 greater than 14663 for /10.63.114.158 >> TRACE [GossipStage:1] 2017-12-22 12:46:26,859 Gossiper.java:904 - Adding >> state SEVERITY: 0.0 >> >> Thanks and Regards >> A.SathishKumar >> >> >> >> On Thu, Dec 21, 2017 at 9:40 PM, Jeff Jirsa <jji...@gmail.com> wrote: >> >>> Not received is ambiguous. >>> >>> Can you paste the logs somewhere like pastebin? Sanitize/anonymize as >>> needed. >>> >>> >>> >>> -- >>> Jeff Jirsa >>> >>> >>> On Dec 21, 2017, at 9:15 PM, sat <sathish.al...@gmail.com> wrote: >>> >>> Hi, >>> >>> We observe gossip is not received by other nodes and we see "Convicted" >>> message in debug.log even though cassandra process is running in all 3 >>> nodes. >>> >>> Thanks and Regards >>> A.SathisKumar >>> >>> On Thu, Dec 21, 2017 at 9:12 PM, Jeff Jirsa <jji...@gmail.com> wrote: >>> >>>> What’s it logging? >>>> >>>> -- >>>> Jeff Jirsa >>>> >>>> >>>> On Dec 21, 2017, at 8:56 PM, sat <sathish.al...@gmail.com> wrote: >>>> >>>> We have configured 3 node cassandra cluster in RHEL 7.2 version and we >>>> are doing cluster testing. When we start cassandra in all 3 nodes they form >>>> a cluster and they work fine. >>>> >>>> But when we bring one node down using "init 6" or "reboot" command, the >>>> rebooted node takes more time to join the cluster, however if we manually >>>> kill and start cassandra process the nodes join cluster immediately without >>>> any issues. >>>> >>>> We have provided all 3 IPs as seed nodes and the cluster name is same >>>> for all 3 nodes and their respective IP as listen address. >>>> >>>> Selinux is also disabled. >>>> >>>> Please help us in resolving this issue. >>>> >>>> Thanks >>>> >>>> >>> >>> >>> -- >>> A.SathishKumar >>> 044-24735023 >>> >>> >> >> >> -- >> A.SathishKumar >> 044-24735023 >> > > > > -- > A.SathishKumar > 044-24735023 > -- A.SathishKumar 044-24735023