finally, thanks to the groovy jmx builder, it was not that hard.
Le 11 juin 2012 à 12:12, Samuel CARRIERE a écrit : > If I were you, I would connect (through JMX, with jconsole) to one of the > nodes that is sending messages to an old node, and would have a look at these > MBean : > - org.apache.net.FailureDetector : does SimpleStates looks good ? (or do > you see an IP of an old node) SimpleStates:[/10.10.0.22:DOWN, /10.10.0.24:DOWN, /10.10.0.26:UP, /10.10.0.25:UP, /10.10.0.27:UP] > - org.apache.net.MessagingService : do you see one of the old IP in one of > the attributes ? data-5: CommandCompletedTasks: [10.10.0.22:2, 10.10.0.26:6147307, 10.10.0.27:6084684, 10.10.0.24:2] CommandPendingTasks: [10.10.0.22:0, 10.10.0.26:0, 10.10.0.27:0, 10.10.0.24:0] ResponseCompletedTasks: [10.10.0.22:1487, 10.10.0.26:6187204, 10.10.0.27:6062890, 10.10.0.24:1495] ResponsePendingTasks: [10.10.0.22:0, 10.10.0.26:0, 10.10.0.27:0, 10.10.0.24:0] data-6: CommandCompletedTasks: [10.10.0.22:2, 10.10.0.27:6064992, 10.10.0.24:2, 10.10.0.25:6308102] CommandPendingTasks: [10.10.0.22:0, 10.10.0.27:0, 10.10.0.24:0, 10.10.0.25:0] ResponseCompletedTasks: [10.10.0.22:1463, 10.10.0.27:6067943, 10.10.0.24:1474, 10.10.0.25:6367692] ResponsePendingTasks: [10.10.0.22:0, 10.10.0.27:0, 10.10.0.24:2, 10.10.0.25:0] data-7: CommandCompletedTasks: [10.10.0.22:2, 10.10.0.26:6043653, 10.10.0.24:2, 10.10.0.25:5964168] CommandPendingTasks: [10.10.0.22:0, 10.10.0.26:0, 10.10.0.24:0, 10.10.0.25:0] ResponseCompletedTasks: [10.10.0.22:1424, 10.10.0.26:6090251, 10.10.0.24:1431, 10.10.0.25:6094954] ResponsePendingTasks: [10.10.0.22:4, 10.10.0.26:0, 10.10.0.24:1, 10.10.0.25:0] > - org.apache.net.StreamingService : do you see an old IP in StreamSources > or StreamDestinations ? nothing streaming on the 3 nodes. nodetool netstats confirmed that. > - org.apache.internal.HintedHandoff : are there non-zero ActiveCount, > CurrentlyBlockedTasks, PendingTasks, TotalBlockedTask ? On the 3 nodes, all at 0. I don't know much what I'm looking at, but it seems that some ResponsePendingTasks needs to end. Nicolas > > Samuel > > > > Nicolas Lalevée <nicolas.lale...@hibnet.org> > 08/06/2012 21:03 > Veuillez répondre à > user@cassandra.apache.org > > A > user@cassandra.apache.org > cc > Objet > Re: Dead node still being pinged > > > > > > > Le 8 juin 2012 à 20:02, Samuel CARRIERE a écrit : > > > I'm in the train but just a guess : maybe it's hinted handoff. A look in > > the logs of the new nodes could confirm that : look for the IP of an old > > node and maybe you'll find hinted handoff related messages. > > I grepped on every node about every old node, I got nothing since the "crash". > > If it can be of some help, here is some grepped log of the crash: > > system.log.1: WARN [RMI TCP Connection(1037)-10.10.0.26] 2012-05-06 > 00:39:30,241 StorageService.java (line 2417) Endpoint /10.10.0.24 is down and > will not receive data for re-replication of /10.10.0.22 > system.log.1: WARN [RMI TCP Connection(1037)-10.10.0.26] 2012-05-06 > 00:39:30,242 StorageService.java (line 2417) Endpoint /10.10.0.24 is down and > will not receive data for re-replication of /10.10.0.22 > system.log.1: WARN [RMI TCP Connection(1037)-10.10.0.26] 2012-05-06 > 00:39:30,242 StorageService.java (line 2417) Endpoint /10.10.0.24 is down and > will not receive data for re-replication of /10.10.0.22 > system.log.1: WARN [RMI TCP Connection(1037)-10.10.0.26] 2012-05-06 > 00:39:30,243 StorageService.java (line 2417) Endpoint /10.10.0.24 is down and > will not receive data for re-replication of /10.10.0.22 > system.log.1: WARN [RMI TCP Connection(1037)-10.10.0.26] 2012-05-06 > 00:39:30,243 StorageService.java (line 2417) Endpoint /10.10.0.24 is down and > will not receive data for re-replication of /10.10.0.22 > system.log.1: INFO [GossipStage:1] 2012-05-06 00:44:33,822 Gossiper.java > (line 818) InetAddress /10.10.0.24 is now dead. > system.log.1: INFO [GossipStage:1] 2012-05-06 04:25:23,894 Gossiper.java > (line 818) InetAddress /10.10.0.24 is now dead. > system.log.1: INFO [OptionalTasks:1] 2012-05-06 04:25:23,895 > HintedHandOffManager.java (line 179) Deleting any stored hints for /10.10.0.24 > system.log.1: INFO [GossipStage:1] 2012-05-06 04:25:23,895 > StorageService.java (line 1157) Removing token > 127605887595351923798765477786913079296 for /10.10.0.24 > system.log.1: INFO [GossipStage:1] 2012-05-09 04:26:25,015 Gossiper.java > (line 818) InetAddress /10.10.0.24 is now dead. > > > Maybe its the way I have removed nodes ? AFAIR I didn't used the decommission > command. For each node I got the node down and then issue a remove token > command. > Here is what I can find in the log about when I removed one of them: > > system.log.1: INFO [GossipTasks:1] 2012-05-02 17:21:10,281 Gossiper.java > (line 818) InetAddress /10.10.0.24 is now dead. > system.log.1: INFO [HintedHandoff:1] 2012-05-02 17:21:21,496 > HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint > delivery, aborting > system.log.1: INFO [GossipStage:1] 2012-05-02 17:21:59,307 Gossiper.java > (line 818) InetAddress /10.10.0.24 is now dead. > system.log.1: INFO [HintedHandoff:1] 2012-05-02 17:31:20,336 > HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint > delivery, aborting > system.log.1: INFO [HintedHandoff:1] 2012-05-02 17:41:06,177 > HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint > delivery, aborting > system.log.1: INFO [HintedHandoff:1] 2012-05-02 17:51:18,148 > HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint > delivery, aborting > system.log.1: INFO [HintedHandoff:1] 2012-05-02 18:00:31,709 > HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint > delivery, aborting > system.log.1: INFO [HintedHandoff:1] 2012-05-02 18:11:02,521 > HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint > delivery, aborting > system.log.1: INFO [HintedHandoff:1] 2012-05-02 18:20:38,282 > HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint > delivery, aborting > system.log.1: INFO [HintedHandoff:1] 2012-05-02 18:31:09,513 > HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint > delivery, aborting > system.log.1: INFO [HintedHandoff:1] 2012-05-02 18:40:31,565 > HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint > delivery, aborting > system.log.1: INFO [HintedHandoff:1] 2012-05-02 18:51:10,566 > HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint > delivery, aborting > system.log.1: INFO [HintedHandoff:1] 2012-05-02 19:00:32,197 > HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint > delivery, aborting > system.log.1: INFO [HintedHandoff:1] 2012-05-02 19:11:17,018 > HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint > delivery, aborting > system.log.1: INFO [HintedHandoff:1] 2012-05-02 19:21:21,759 > HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint > delivery, aborting > system.log.1: INFO [GossipStage:1] 2012-05-02 20:05:57,281 Gossiper.java > (line 818) InetAddress /10.10.0.24 is now dead. > system.log.1: INFO [OptionalTasks:1] 2012-05-02 20:05:57,281 > HintedHandOffManager.java (line 179) Deleting any stored hints for /10.10.0.24 > system.log.1: INFO [GossipStage:1] 2012-05-02 20:05:57,281 > StorageService.java (line 1157) Removing token > 145835300108973619103103718265651724288 for /10.10.0.24 > > > Nicolas > > > > > > > > ----- Message d'origine ----- > > De : Nicolas Lalevée [nicolas.lale...@hibnet.org] > > Envoyé : 08/06/2012 19:26 ZE2 > > À : user@cassandra.apache.org > > Objet : Re: Dead node still being pinged > > > > > > > > Le 8 juin 2012 à 15:17, Samuel CARRIERE a écrit : > > > >> What does nodetool ring says ? (Ask every node) > > > > currently, each of new node see only the tokens of the new nodes. > > > >> Have you checked that the list of seeds in every yaml is correct ? > > > > yes, it is correct, every of my new node point to the first of my new node > > > >> What version of cassandra are you using ? > > > > Sorry I should have wrote this in my first mail. > > I use the 1.0.9 > > > > Nicolas > > > >> > >> Samuel > >> > >> > >> > >> Nicolas Lalevée <nicolas.lale...@hibnet.org> > >> 08/06/2012 14:10 > >> Veuillez répondre à > >> user@cassandra.apache.org > >> > >> A > >> user@cassandra.apache.org > >> cc > >> Objet > >> Dead node still being pinged > >> > >> > >> > >> > >> > >> I had a configuration where I had 4 nodes, data-1,4. We then bought 3 > >> bigger machines, data-5,7. And we moved all data from data-1,4 to data-5,7. > >> To move all the data without interruption of service, I added one new node > >> at a time. And then I removed one by one the old machines via a "remove > >> token". > >> > >> Everything was working fine. Until there was an expected load on our > >> cluster, the machine started to swap and become unresponsive. We fixed the > >> unexpected load and the three new machines were restarted. After that the > >> new cassandra machines were stating that some old token were not assigned, > >> namely from data-2 and data-4. To fix this I issued again some "remove > >> token" commands. > >> > >> Everything seems to be back to normal, but on the network I still see some > >> packet from the new cluster to the old machines. On the port 7000. > >> How I can tell cassandra to completely forget about the old machines ? > >> > >> Nicolas > >> > >> > > > >