Re: Dead node still being pinged

Nicolas Lalevée Mon, 11 Jun 2012 08:35:29 -0700

finally, thanks to the groovy jmx builder, it was not that hard.


Le 11 juin 2012 à 12:12, Samuel CARRIERE a écrit :

> If I were you, I would connect (through JMX, with jconsole) to one of the 
> nodes that is sending messages to an old node, and would have a look at these 
> MBean : 
>    - org.apache.net.FailureDetector : does SimpleStates looks good ? (or do 
> you see an IP of an old node)

SimpleStates:[/10.10.0.22:DOWN, /10.10.0.24:DOWN, /10.10.0.26:UP, 
/10.10.0.25:UP, /10.10.0.27:UP]

>    - org.apache.net.MessagingService : do you see one of the old IP in one of 
> the attributes ?

data-5:
CommandCompletedTasks:
[10.10.0.22:2, 10.10.0.26:6147307, 10.10.0.27:6084684, 10.10.0.24:2]
CommandPendingTasks:
[10.10.0.22:0, 10.10.0.26:0, 10.10.0.27:0, 10.10.0.24:0]
ResponseCompletedTasks:
[10.10.0.22:1487, 10.10.0.26:6187204, 10.10.0.27:6062890, 10.10.0.24:1495]
ResponsePendingTasks:
[10.10.0.22:0, 10.10.0.26:0, 10.10.0.27:0, 10.10.0.24:0]

data-6:
CommandCompletedTasks:
[10.10.0.22:2, 10.10.0.27:6064992, 10.10.0.24:2, 10.10.0.25:6308102]
CommandPendingTasks:
[10.10.0.22:0, 10.10.0.27:0, 10.10.0.24:0, 10.10.0.25:0]
ResponseCompletedTasks:
[10.10.0.22:1463, 10.10.0.27:6067943, 10.10.0.24:1474, 10.10.0.25:6367692]
ResponsePendingTasks:
[10.10.0.22:0, 10.10.0.27:0, 10.10.0.24:2, 10.10.0.25:0]

data-7:
CommandCompletedTasks:
[10.10.0.22:2, 10.10.0.26:6043653, 10.10.0.24:2, 10.10.0.25:5964168]
CommandPendingTasks:
[10.10.0.22:0, 10.10.0.26:0, 10.10.0.24:0, 10.10.0.25:0]
ResponseCompletedTasks:
[10.10.0.22:1424, 10.10.0.26:6090251, 10.10.0.24:1431, 10.10.0.25:6094954]
ResponsePendingTasks:
[10.10.0.22:4, 10.10.0.26:0, 10.10.0.24:1, 10.10.0.25:0]

>    - org.apache.net.StreamingService : do you see an old IP in StreamSources 
> or StreamDestinations ?

nothing streaming on the 3 nodes.
nodetool netstats confirmed that.

>    - org.apache.internal.HintedHandoff : are there non-zero ActiveCount, 
> CurrentlyBlockedTasks, PendingTasks, TotalBlockedTask ?

On the 3 nodes, all at 0.

I don't know much what I'm looking at, but it seems that some 
ResponsePendingTasks needs to end.

Nicolas

> 
> Samuel 
> 
> 
> 
> Nicolas Lalevée <nicolas.lale...@hibnet.org>
> 08/06/2012 21:03
> Veuillez répondre à
> user@cassandra.apache.org
> 
> A
> user@cassandra.apache.org
> cc
> Objet
> Re: Dead node still being pinged
> 
> 
> 
> 
> 
> 
> Le 8 juin 2012 à 20:02, Samuel CARRIERE a écrit :
> 
> > I'm in the train but just a guess : maybe it's hinted handoff. A look in 
> > the logs of the new nodes could confirm that : look for the IP of an old 
> > node and maybe you'll find hinted handoff related messages.
> 
> I grepped on every node about every old node, I got nothing since the "crash".
> 
> If it can be of some help, here is some grepped log of the crash:
> 
> system.log.1: WARN [RMI TCP Connection(1037)-10.10.0.26] 2012-05-06 
> 00:39:30,241 StorageService.java (line 2417) Endpoint /10.10.0.24 is down and 
> will not receive data for re-replication of /10.10.0.22
> system.log.1: WARN [RMI TCP Connection(1037)-10.10.0.26] 2012-05-06 
> 00:39:30,242 StorageService.java (line 2417) Endpoint /10.10.0.24 is down and 
> will not receive data for re-replication of /10.10.0.22
> system.log.1: WARN [RMI TCP Connection(1037)-10.10.0.26] 2012-05-06 
> 00:39:30,242 StorageService.java (line 2417) Endpoint /10.10.0.24 is down and 
> will not receive data for re-replication of /10.10.0.22
> system.log.1: WARN [RMI TCP Connection(1037)-10.10.0.26] 2012-05-06 
> 00:39:30,243 StorageService.java (line 2417) Endpoint /10.10.0.24 is down and 
> will not receive data for re-replication of /10.10.0.22
> system.log.1: WARN [RMI TCP Connection(1037)-10.10.0.26] 2012-05-06 
> 00:39:30,243 StorageService.java (line 2417) Endpoint /10.10.0.24 is down and 
> will not receive data for re-replication of /10.10.0.22
> system.log.1: INFO [GossipStage:1] 2012-05-06 00:44:33,822 Gossiper.java 
> (line 818) InetAddress /10.10.0.24 is now dead.
> system.log.1: INFO [GossipStage:1] 2012-05-06 04:25:23,894 Gossiper.java 
> (line 818) InetAddress /10.10.0.24 is now dead.
> system.log.1: INFO [OptionalTasks:1] 2012-05-06 04:25:23,895 
> HintedHandOffManager.java (line 179) Deleting any stored hints for /10.10.0.24
> system.log.1: INFO [GossipStage:1] 2012-05-06 04:25:23,895 
> StorageService.java (line 1157) Removing token 
> 127605887595351923798765477786913079296 for /10.10.0.24
> system.log.1: INFO [GossipStage:1] 2012-05-09 04:26:25,015 Gossiper.java 
> (line 818) InetAddress /10.10.0.24 is now dead.
> 
> 
> Maybe its the way I have removed nodes ? AFAIR I didn't used the decommission 
> command. For each node I got the node down and then issue a remove token 
> command.
> Here is what I can find in the log about when I removed one of them:
> 
> system.log.1: INFO [GossipTasks:1] 2012-05-02 17:21:10,281 Gossiper.java 
> (line 818) InetAddress /10.10.0.24 is now dead.
> system.log.1: INFO [HintedHandoff:1] 2012-05-02 17:21:21,496 
> HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint 
> delivery, aborting
> system.log.1: INFO [GossipStage:1] 2012-05-02 17:21:59,307 Gossiper.java 
> (line 818) InetAddress /10.10.0.24 is now dead.
> system.log.1: INFO [HintedHandoff:1] 2012-05-02 17:31:20,336 
> HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint 
> delivery, aborting
> system.log.1: INFO [HintedHandoff:1] 2012-05-02 17:41:06,177 
> HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint 
> delivery, aborting
> system.log.1: INFO [HintedHandoff:1] 2012-05-02 17:51:18,148 
> HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint 
> delivery, aborting
> system.log.1: INFO [HintedHandoff:1] 2012-05-02 18:00:31,709 
> HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint 
> delivery, aborting
> system.log.1: INFO [HintedHandoff:1] 2012-05-02 18:11:02,521 
> HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint 
> delivery, aborting
> system.log.1: INFO [HintedHandoff:1] 2012-05-02 18:20:38,282 
> HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint 
> delivery, aborting
> system.log.1: INFO [HintedHandoff:1] 2012-05-02 18:31:09,513 
> HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint 
> delivery, aborting
> system.log.1: INFO [HintedHandoff:1] 2012-05-02 18:40:31,565 
> HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint 
> delivery, aborting
> system.log.1: INFO [HintedHandoff:1] 2012-05-02 18:51:10,566 
> HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint 
> delivery, aborting
> system.log.1: INFO [HintedHandoff:1] 2012-05-02 19:00:32,197 
> HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint 
> delivery, aborting
> system.log.1: INFO [HintedHandoff:1] 2012-05-02 19:11:17,018 
> HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint 
> delivery, aborting
> system.log.1: INFO [HintedHandoff:1] 2012-05-02 19:21:21,759 
> HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint 
> delivery, aborting
> system.log.1: INFO [GossipStage:1] 2012-05-02 20:05:57,281 Gossiper.java 
> (line 818) InetAddress /10.10.0.24 is now dead.
> system.log.1: INFO [OptionalTasks:1] 2012-05-02 20:05:57,281 
> HintedHandOffManager.java (line 179) Deleting any stored hints for /10.10.0.24
> system.log.1: INFO [GossipStage:1] 2012-05-02 20:05:57,281 
> StorageService.java (line 1157) Removing token 
> 145835300108973619103103718265651724288 for /10.10.0.24
> 
> 
> Nicolas
> 
> 
> > 
> > 
> > ----- Message d'origine -----
> > De : Nicolas Lalevée [nicolas.lale...@hibnet.org]
> > Envoyé : 08/06/2012 19:26 ZE2
> > À : user@cassandra.apache.org
> > Objet : Re: Dead node still being pinged
> > 
> > 
> > 
> > Le 8 juin 2012 à 15:17, Samuel CARRIERE a écrit :
> > 
> >> What does nodetool ring says ? (Ask every node)
> > 
> > currently, each of new node see only the tokens of the new nodes.
> > 
> >> Have you checked that the list of seeds in every yaml is correct ?
> > 
> > yes, it is correct, every of my new node point to the first of my new node
> > 
> >> What version of cassandra are you using ?
> > 
> > Sorry I should have wrote this in my first mail.
> > I use the 1.0.9
> > 
> > Nicolas
> > 
> >> 
> >> Samuel
> >> 
> >> 
> >> 
> >> Nicolas Lalevée <nicolas.lale...@hibnet.org>
> >> 08/06/2012 14:10
> >> Veuillez répondre à
> >> user@cassandra.apache.org
> >> 
> >> A
> >> user@cassandra.apache.org
> >> cc
> >> Objet
> >> Dead node still being pinged
> >> 
> >> 
> >> 
> >> 
> >> 
> >> I had a configuration where I had 4 nodes, data-1,4. We then bought 3 
> >> bigger machines, data-5,7. And we moved all data from data-1,4 to data-5,7.
> >> To move all the data without interruption of service, I added one new node 
> >> at a time. And then I removed one by one the old machines via a "remove 
> >> token".
> >> 
> >> Everything was working fine. Until there was an expected load on our 
> >> cluster, the machine started to swap and become unresponsive. We fixed the 
> >> unexpected load and the three new machines were restarted. After that the 
> >> new cassandra machines were stating that some old token were not assigned, 
> >> namely from data-2 and data-4. To fix this I issued again some "remove 
> >> token" commands.
> >> 
> >> Everything seems to be back to normal, but on the network I still see some 
> >> packet from the new cluster to the old machines. On the port 7000.
> >> How I can tell cassandra to completely forget about the old machines ?
> >> 
> >> Nicolas
> >> 
> >> 
> > 
> 
>

Re: Dead node still being pinged

Reply via email to