Le 12 juin 2012 à 11:03, aaron morton a écrit : > Try purging the hints for 10.10.0.24 using the HintedHandOffManager MBean.
As far as I could tell, there were no hinted hand off to be delivered. Nevertheless I have called "deleteHintsForEndpoint" on every node for the two expected to be out nodes. Nothing changed, I still see packet being send to these old nodes. I looked closer to ResponsePendingTasks of MessagingService. Actually the numbers change, between 0 and about 4. So tasks are ending but new ones come just after. Nicolas > > Cheers > > ----------------- > Aaron Morton > Freelance Developer > @aaronmorton > http://www.thelastpickle.com > > On 12/06/2012, at 3:33 AM, Nicolas Lalevée wrote: > >> finally, thanks to the groovy jmx builder, it was not that hard. >> >> >> Le 11 juin 2012 à 12:12, Samuel CARRIERE a écrit : >> >>> If I were you, I would connect (through JMX, with jconsole) to one of the >>> nodes that is sending messages to an old node, and would have a look at >>> these MBean : >>> - org.apache.net.FailureDetector : does SimpleStates looks good ? (or do >>> you see an IP of an old node) >> >> SimpleStates:[/10.10.0.22:DOWN, /10.10.0.24:DOWN, /10.10.0.26:UP, >> /10.10.0.25:UP, /10.10.0.27:UP] >> >>> - org.apache.net.MessagingService : do you see one of the old IP in one >>> of the attributes ? >> >> data-5: >> CommandCompletedTasks: >> [10.10.0.22:2, 10.10.0.26:6147307, 10.10.0.27:6084684, 10.10.0.24:2] >> CommandPendingTasks: >> [10.10.0.22:0, 10.10.0.26:0, 10.10.0.27:0, 10.10.0.24:0] >> ResponseCompletedTasks: >> [10.10.0.22:1487, 10.10.0.26:6187204, 10.10.0.27:6062890, 10.10.0.24:1495] >> ResponsePendingTasks: >> [10.10.0.22:0, 10.10.0.26:0, 10.10.0.27:0, 10.10.0.24:0] >> >> data-6: >> CommandCompletedTasks: >> [10.10.0.22:2, 10.10.0.27:6064992, 10.10.0.24:2, 10.10.0.25:6308102] >> CommandPendingTasks: >> [10.10.0.22:0, 10.10.0.27:0, 10.10.0.24:0, 10.10.0.25:0] >> ResponseCompletedTasks: >> [10.10.0.22:1463, 10.10.0.27:6067943, 10.10.0.24:1474, 10.10.0.25:6367692] >> ResponsePendingTasks: >> [10.10.0.22:0, 10.10.0.27:0, 10.10.0.24:2, 10.10.0.25:0] >> >> data-7: >> CommandCompletedTasks: >> [10.10.0.22:2, 10.10.0.26:6043653, 10.10.0.24:2, 10.10.0.25:5964168] >> CommandPendingTasks: >> [10.10.0.22:0, 10.10.0.26:0, 10.10.0.24:0, 10.10.0.25:0] >> ResponseCompletedTasks: >> [10.10.0.22:1424, 10.10.0.26:6090251, 10.10.0.24:1431, 10.10.0.25:6094954] >> ResponsePendingTasks: >> [10.10.0.22:4, 10.10.0.26:0, 10.10.0.24:1, 10.10.0.25:0] >> >>> - org.apache.net.StreamingService : do you see an old IP in StreamSources >>> or StreamDestinations ? >> >> nothing streaming on the 3 nodes. >> nodetool netstats confirmed that. >> >>> - org.apache.internal.HintedHandoff : are there non-zero ActiveCount, >>> CurrentlyBlockedTasks, PendingTasks, TotalBlockedTask ? >> >> On the 3 nodes, all at 0. >> >> I don't know much what I'm looking at, but it seems that some >> ResponsePendingTasks needs to end. >> >> Nicolas >> >>> >>> Samuel >>> >>> >>> >>> Nicolas Lalevée <nicolas.lale...@hibnet.org> >>> 08/06/2012 21:03 >>> Veuillez répondre à >>> user@cassandra.apache.org >>> >>> A >>> user@cassandra.apache.org >>> cc >>> Objet >>> Re: Dead node still being pinged >>> >>> >>> >>> >>> >>> >>> Le 8 juin 2012 à 20:02, Samuel CARRIERE a écrit : >>> >>>> I'm in the train but just a guess : maybe it's hinted handoff. A look in >>>> the logs of the new nodes could confirm that : look for the IP of an old >>>> node and maybe you'll find hinted handoff related messages. >>> >>> I grepped on every node about every old node, I got nothing since the >>> "crash". >>> >>> If it can be of some help, here is some grepped log of the crash: >>> >>> system.log.1: WARN [RMI TCP Connection(1037)-10.10.0.26] 2012-05-06 >>> 00:39:30,241 StorageService.java (line 2417) Endpoint /10.10.0.24 is down >>> and will not receive data for re-replication of /10.10.0.22 >>> system.log.1: WARN [RMI TCP Connection(1037)-10.10.0.26] 2012-05-06 >>> 00:39:30,242 StorageService.java (line 2417) Endpoint /10.10.0.24 is down >>> and will not receive data for re-replication of /10.10.0.22 >>> system.log.1: WARN [RMI TCP Connection(1037)-10.10.0.26] 2012-05-06 >>> 00:39:30,242 StorageService.java (line 2417) Endpoint /10.10.0.24 is down >>> and will not receive data for re-replication of /10.10.0.22 >>> system.log.1: WARN [RMI TCP Connection(1037)-10.10.0.26] 2012-05-06 >>> 00:39:30,243 StorageService.java (line 2417) Endpoint /10.10.0.24 is down >>> and will not receive data for re-replication of /10.10.0.22 >>> system.log.1: WARN [RMI TCP Connection(1037)-10.10.0.26] 2012-05-06 >>> 00:39:30,243 StorageService.java (line 2417) Endpoint /10.10.0.24 is down >>> and will not receive data for re-replication of /10.10.0.22 >>> system.log.1: INFO [GossipStage:1] 2012-05-06 00:44:33,822 Gossiper.java >>> (line 818) InetAddress /10.10.0.24 is now dead. >>> system.log.1: INFO [GossipStage:1] 2012-05-06 04:25:23,894 Gossiper.java >>> (line 818) InetAddress /10.10.0.24 is now dead. >>> system.log.1: INFO [OptionalTasks:1] 2012-05-06 04:25:23,895 >>> HintedHandOffManager.java (line 179) Deleting any stored hints for >>> /10.10.0.24 >>> system.log.1: INFO [GossipStage:1] 2012-05-06 04:25:23,895 >>> StorageService.java (line 1157) Removing token >>> 127605887595351923798765477786913079296 for /10.10.0.24 >>> system.log.1: INFO [GossipStage:1] 2012-05-09 04:26:25,015 Gossiper.java >>> (line 818) InetAddress /10.10.0.24 is now dead. >>> >>> >>> Maybe its the way I have removed nodes ? AFAIR I didn't used the >>> decommission command. For each node I got the node down and then issue a >>> remove token command. >>> Here is what I can find in the log about when I removed one of them: >>> >>> system.log.1: INFO [GossipTasks:1] 2012-05-02 17:21:10,281 Gossiper.java >>> (line 818) InetAddress /10.10.0.24 is now dead. >>> system.log.1: INFO [HintedHandoff:1] 2012-05-02 17:21:21,496 >>> HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint >>> delivery, aborting >>> system.log.1: INFO [GossipStage:1] 2012-05-02 17:21:59,307 Gossiper.java >>> (line 818) InetAddress /10.10.0.24 is now dead. >>> system.log.1: INFO [HintedHandoff:1] 2012-05-02 17:31:20,336 >>> HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint >>> delivery, aborting >>> system.log.1: INFO [HintedHandoff:1] 2012-05-02 17:41:06,177 >>> HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint >>> delivery, aborting >>> system.log.1: INFO [HintedHandoff:1] 2012-05-02 17:51:18,148 >>> HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint >>> delivery, aborting >>> system.log.1: INFO [HintedHandoff:1] 2012-05-02 18:00:31,709 >>> HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint >>> delivery, aborting >>> system.log.1: INFO [HintedHandoff:1] 2012-05-02 18:11:02,521 >>> HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint >>> delivery, aborting >>> system.log.1: INFO [HintedHandoff:1] 2012-05-02 18:20:38,282 >>> HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint >>> delivery, aborting >>> system.log.1: INFO [HintedHandoff:1] 2012-05-02 18:31:09,513 >>> HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint >>> delivery, aborting >>> system.log.1: INFO [HintedHandoff:1] 2012-05-02 18:40:31,565 >>> HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint >>> delivery, aborting >>> system.log.1: INFO [HintedHandoff:1] 2012-05-02 18:51:10,566 >>> HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint >>> delivery, aborting >>> system.log.1: INFO [HintedHandoff:1] 2012-05-02 19:00:32,197 >>> HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint >>> delivery, aborting >>> system.log.1: INFO [HintedHandoff:1] 2012-05-02 19:11:17,018 >>> HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint >>> delivery, aborting >>> system.log.1: INFO [HintedHandoff:1] 2012-05-02 19:21:21,759 >>> HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint >>> delivery, aborting >>> system.log.1: INFO [GossipStage:1] 2012-05-02 20:05:57,281 Gossiper.java >>> (line 818) InetAddress /10.10.0.24 is now dead. >>> system.log.1: INFO [OptionalTasks:1] 2012-05-02 20:05:57,281 >>> HintedHandOffManager.java (line 179) Deleting any stored hints for >>> /10.10.0.24 >>> system.log.1: INFO [GossipStage:1] 2012-05-02 20:05:57,281 >>> StorageService.java (line 1157) Removing token >>> 145835300108973619103103718265651724288 for /10.10.0.24 >>> >>> >>> Nicolas >>> >>> >>>> >>>> >>>> ----- Message d'origine ----- >>>> De : Nicolas Lalevée [nicolas.lale...@hibnet.org] >>>> Envoyé : 08/06/2012 19:26 ZE2 >>>> À : user@cassandra.apache.org >>>> Objet : Re: Dead node still being pinged >>>> >>>> >>>> >>>> Le 8 juin 2012 à 15:17, Samuel CARRIERE a écrit : >>>> >>>>> What does nodetool ring says ? (Ask every node) >>>> >>>> currently, each of new node see only the tokens of the new nodes. >>>> >>>>> Have you checked that the list of seeds in every yaml is correct ? >>>> >>>> yes, it is correct, every of my new node point to the first of my new node >>>> >>>>> What version of cassandra are you using ? >>>> >>>> Sorry I should have wrote this in my first mail. >>>> I use the 1.0.9 >>>> >>>> Nicolas >>>> >>>>> >>>>> Samuel >>>>> >>>>> >>>>> >>>>> Nicolas Lalevée <nicolas.lale...@hibnet.org> >>>>> 08/06/2012 14:10 >>>>> Veuillez répondre à >>>>> user@cassandra.apache.org >>>>> >>>>> A >>>>> user@cassandra.apache.org >>>>> cc >>>>> Objet >>>>> Dead node still being pinged >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> I had a configuration where I had 4 nodes, data-1,4. We then bought 3 >>>>> bigger machines, data-5,7. And we moved all data from data-1,4 to >>>>> data-5,7. >>>>> To move all the data without interruption of service, I added one new >>>>> node at a time. And then I removed one by one the old machines via a >>>>> "remove token". >>>>> >>>>> Everything was working fine. Until there was an expected load on our >>>>> cluster, the machine started to swap and become unresponsive. We fixed >>>>> the unexpected load and the three new machines were restarted. After that >>>>> the new cassandra machines were stating that some old token were not >>>>> assigned, namely from data-2 and data-4. To fix this I issued again some >>>>> "remove token" commands. >>>>> >>>>> Everything seems to be back to normal, but on the network I still see >>>>> some packet from the new cluster to the old machines. On the port 7000. >>>>> How I can tell cassandra to completely forget about the old machines ? >>>>> >>>>> Nicolas >>>>> >>>>> >>>> >>> >>> >> >