Try purging the hints for 10.10.0.24 using the HintedHandOffManager MBean. Cheers
----------------- Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 12/06/2012, at 3:33 AM, Nicolas Lalevée wrote: > finally, thanks to the groovy jmx builder, it was not that hard. > > > Le 11 juin 2012 à 12:12, Samuel CARRIERE a écrit : > >> If I were you, I would connect (through JMX, with jconsole) to one of the >> nodes that is sending messages to an old node, and would have a look at >> these MBean : >> - org.apache.net.FailureDetector : does SimpleStates looks good ? (or do >> you see an IP of an old node) > > SimpleStates:[/10.10.0.22:DOWN, /10.10.0.24:DOWN, /10.10.0.26:UP, > /10.10.0.25:UP, /10.10.0.27:UP] > >> - org.apache.net.MessagingService : do you see one of the old IP in one of >> the attributes ? > > data-5: > CommandCompletedTasks: > [10.10.0.22:2, 10.10.0.26:6147307, 10.10.0.27:6084684, 10.10.0.24:2] > CommandPendingTasks: > [10.10.0.22:0, 10.10.0.26:0, 10.10.0.27:0, 10.10.0.24:0] > ResponseCompletedTasks: > [10.10.0.22:1487, 10.10.0.26:6187204, 10.10.0.27:6062890, 10.10.0.24:1495] > ResponsePendingTasks: > [10.10.0.22:0, 10.10.0.26:0, 10.10.0.27:0, 10.10.0.24:0] > > data-6: > CommandCompletedTasks: > [10.10.0.22:2, 10.10.0.27:6064992, 10.10.0.24:2, 10.10.0.25:6308102] > CommandPendingTasks: > [10.10.0.22:0, 10.10.0.27:0, 10.10.0.24:0, 10.10.0.25:0] > ResponseCompletedTasks: > [10.10.0.22:1463, 10.10.0.27:6067943, 10.10.0.24:1474, 10.10.0.25:6367692] > ResponsePendingTasks: > [10.10.0.22:0, 10.10.0.27:0, 10.10.0.24:2, 10.10.0.25:0] > > data-7: > CommandCompletedTasks: > [10.10.0.22:2, 10.10.0.26:6043653, 10.10.0.24:2, 10.10.0.25:5964168] > CommandPendingTasks: > [10.10.0.22:0, 10.10.0.26:0, 10.10.0.24:0, 10.10.0.25:0] > ResponseCompletedTasks: > [10.10.0.22:1424, 10.10.0.26:6090251, 10.10.0.24:1431, 10.10.0.25:6094954] > ResponsePendingTasks: > [10.10.0.22:4, 10.10.0.26:0, 10.10.0.24:1, 10.10.0.25:0] > >> - org.apache.net.StreamingService : do you see an old IP in StreamSources >> or StreamDestinations ? > > nothing streaming on the 3 nodes. > nodetool netstats confirmed that. > >> - org.apache.internal.HintedHandoff : are there non-zero ActiveCount, >> CurrentlyBlockedTasks, PendingTasks, TotalBlockedTask ? > > On the 3 nodes, all at 0. > > I don't know much what I'm looking at, but it seems that some > ResponsePendingTasks needs to end. > > Nicolas > >> >> Samuel >> >> >> >> Nicolas Lalevée <nicolas.lale...@hibnet.org> >> 08/06/2012 21:03 >> Veuillez répondre à >> user@cassandra.apache.org >> >> A >> user@cassandra.apache.org >> cc >> Objet >> Re: Dead node still being pinged >> >> >> >> >> >> >> Le 8 juin 2012 à 20:02, Samuel CARRIERE a écrit : >> >>> I'm in the train but just a guess : maybe it's hinted handoff. A look in >>> the logs of the new nodes could confirm that : look for the IP of an old >>> node and maybe you'll find hinted handoff related messages. >> >> I grepped on every node about every old node, I got nothing since the >> "crash". >> >> If it can be of some help, here is some grepped log of the crash: >> >> system.log.1: WARN [RMI TCP Connection(1037)-10.10.0.26] 2012-05-06 >> 00:39:30,241 StorageService.java (line 2417) Endpoint /10.10.0.24 is down >> and will not receive data for re-replication of /10.10.0.22 >> system.log.1: WARN [RMI TCP Connection(1037)-10.10.0.26] 2012-05-06 >> 00:39:30,242 StorageService.java (line 2417) Endpoint /10.10.0.24 is down >> and will not receive data for re-replication of /10.10.0.22 >> system.log.1: WARN [RMI TCP Connection(1037)-10.10.0.26] 2012-05-06 >> 00:39:30,242 StorageService.java (line 2417) Endpoint /10.10.0.24 is down >> and will not receive data for re-replication of /10.10.0.22 >> system.log.1: WARN [RMI TCP Connection(1037)-10.10.0.26] 2012-05-06 >> 00:39:30,243 StorageService.java (line 2417) Endpoint /10.10.0.24 is down >> and will not receive data for re-replication of /10.10.0.22 >> system.log.1: WARN [RMI TCP Connection(1037)-10.10.0.26] 2012-05-06 >> 00:39:30,243 StorageService.java (line 2417) Endpoint /10.10.0.24 is down >> and will not receive data for re-replication of /10.10.0.22 >> system.log.1: INFO [GossipStage:1] 2012-05-06 00:44:33,822 Gossiper.java >> (line 818) InetAddress /10.10.0.24 is now dead. >> system.log.1: INFO [GossipStage:1] 2012-05-06 04:25:23,894 Gossiper.java >> (line 818) InetAddress /10.10.0.24 is now dead. >> system.log.1: INFO [OptionalTasks:1] 2012-05-06 04:25:23,895 >> HintedHandOffManager.java (line 179) Deleting any stored hints for >> /10.10.0.24 >> system.log.1: INFO [GossipStage:1] 2012-05-06 04:25:23,895 >> StorageService.java (line 1157) Removing token >> 127605887595351923798765477786913079296 for /10.10.0.24 >> system.log.1: INFO [GossipStage:1] 2012-05-09 04:26:25,015 Gossiper.java >> (line 818) InetAddress /10.10.0.24 is now dead. >> >> >> Maybe its the way I have removed nodes ? AFAIR I didn't used the >> decommission command. For each node I got the node down and then issue a >> remove token command. >> Here is what I can find in the log about when I removed one of them: >> >> system.log.1: INFO [GossipTasks:1] 2012-05-02 17:21:10,281 Gossiper.java >> (line 818) InetAddress /10.10.0.24 is now dead. >> system.log.1: INFO [HintedHandoff:1] 2012-05-02 17:21:21,496 >> HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint >> delivery, aborting >> system.log.1: INFO [GossipStage:1] 2012-05-02 17:21:59,307 Gossiper.java >> (line 818) InetAddress /10.10.0.24 is now dead. >> system.log.1: INFO [HintedHandoff:1] 2012-05-02 17:31:20,336 >> HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint >> delivery, aborting >> system.log.1: INFO [HintedHandoff:1] 2012-05-02 17:41:06,177 >> HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint >> delivery, aborting >> system.log.1: INFO [HintedHandoff:1] 2012-05-02 17:51:18,148 >> HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint >> delivery, aborting >> system.log.1: INFO [HintedHandoff:1] 2012-05-02 18:00:31,709 >> HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint >> delivery, aborting >> system.log.1: INFO [HintedHandoff:1] 2012-05-02 18:11:02,521 >> HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint >> delivery, aborting >> system.log.1: INFO [HintedHandoff:1] 2012-05-02 18:20:38,282 >> HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint >> delivery, aborting >> system.log.1: INFO [HintedHandoff:1] 2012-05-02 18:31:09,513 >> HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint >> delivery, aborting >> system.log.1: INFO [HintedHandoff:1] 2012-05-02 18:40:31,565 >> HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint >> delivery, aborting >> system.log.1: INFO [HintedHandoff:1] 2012-05-02 18:51:10,566 >> HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint >> delivery, aborting >> system.log.1: INFO [HintedHandoff:1] 2012-05-02 19:00:32,197 >> HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint >> delivery, aborting >> system.log.1: INFO [HintedHandoff:1] 2012-05-02 19:11:17,018 >> HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint >> delivery, aborting >> system.log.1: INFO [HintedHandoff:1] 2012-05-02 19:21:21,759 >> HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint >> delivery, aborting >> system.log.1: INFO [GossipStage:1] 2012-05-02 20:05:57,281 Gossiper.java >> (line 818) InetAddress /10.10.0.24 is now dead. >> system.log.1: INFO [OptionalTasks:1] 2012-05-02 20:05:57,281 >> HintedHandOffManager.java (line 179) Deleting any stored hints for >> /10.10.0.24 >> system.log.1: INFO [GossipStage:1] 2012-05-02 20:05:57,281 >> StorageService.java (line 1157) Removing token >> 145835300108973619103103718265651724288 for /10.10.0.24 >> >> >> Nicolas >> >> >>> >>> >>> ----- Message d'origine ----- >>> De : Nicolas Lalevée [nicolas.lale...@hibnet.org] >>> Envoyé : 08/06/2012 19:26 ZE2 >>> À : user@cassandra.apache.org >>> Objet : Re: Dead node still being pinged >>> >>> >>> >>> Le 8 juin 2012 à 15:17, Samuel CARRIERE a écrit : >>> >>>> What does nodetool ring says ? (Ask every node) >>> >>> currently, each of new node see only the tokens of the new nodes. >>> >>>> Have you checked that the list of seeds in every yaml is correct ? >>> >>> yes, it is correct, every of my new node point to the first of my new node >>> >>>> What version of cassandra are you using ? >>> >>> Sorry I should have wrote this in my first mail. >>> I use the 1.0.9 >>> >>> Nicolas >>> >>>> >>>> Samuel >>>> >>>> >>>> >>>> Nicolas Lalevée <nicolas.lale...@hibnet.org> >>>> 08/06/2012 14:10 >>>> Veuillez répondre à >>>> user@cassandra.apache.org >>>> >>>> A >>>> user@cassandra.apache.org >>>> cc >>>> Objet >>>> Dead node still being pinged >>>> >>>> >>>> >>>> >>>> >>>> I had a configuration where I had 4 nodes, data-1,4. We then bought 3 >>>> bigger machines, data-5,7. And we moved all data from data-1,4 to data-5,7. >>>> To move all the data without interruption of service, I added one new node >>>> at a time. And then I removed one by one the old machines via a "remove >>>> token". >>>> >>>> Everything was working fine. Until there was an expected load on our >>>> cluster, the machine started to swap and become unresponsive. We fixed the >>>> unexpected load and the three new machines were restarted. After that the >>>> new cassandra machines were stating that some old token were not assigned, >>>> namely from data-2 and data-4. To fix this I issued again some "remove >>>> token" commands. >>>> >>>> Everything seems to be back to normal, but on the network I still see some >>>> packet from the new cluster to the old machines. On the port 7000. >>>> How I can tell cassandra to completely forget about the old machines ? >>>> >>>> Nicolas >>>> >>>> >>> >> >> >