Re: Dead node still being pinged

Nicolas Lalevée Fri, 08 Jun 2012 12:03:42 -0700

Le 8 juin 2012 à 20:02, Samuel CARRIERE a écrit :

> I'm in the train but just a guess : maybe it's hinted handoff. A look in the 
> logs of the new nodes could confirm that : look for the IP of an old node and 
> maybe you'll find hinted handoff related messages.


I grepped on every node about every old node, I got nothing since the "crash".

If it can be of some help, here is some grepped log of the crash:

system.log.1: WARN [RMI TCP Connection(1037)-10.10.0.26] 2012-05-06 
00:39:30,241 StorageService.java (line 2417) Endpoint /10.10.0.24 is down and 
will not receive data for re-replication of /10.10.0.22
system.log.1: WARN [RMI TCP Connection(1037)-10.10.0.26] 2012-05-06 
00:39:30,242 StorageService.java (line 2417) Endpoint /10.10.0.24 is down and 
will not receive data for re-replication of /10.10.0.22
system.log.1: WARN [RMI TCP Connection(1037)-10.10.0.26] 2012-05-06 
00:39:30,242 StorageService.java (line 2417) Endpoint /10.10.0.24 is down and 
will not receive data for re-replication of /10.10.0.22
system.log.1: WARN [RMI TCP Connection(1037)-10.10.0.26] 2012-05-06 
00:39:30,243 StorageService.java (line 2417) Endpoint /10.10.0.24 is down and 
will not receive data for re-replication of /10.10.0.22
system.log.1: WARN [RMI TCP Connection(1037)-10.10.0.26] 2012-05-06 
00:39:30,243 StorageService.java (line 2417) Endpoint /10.10.0.24 is down and 
will not receive data for re-replication of /10.10.0.22
system.log.1: INFO [GossipStage:1] 2012-05-06 00:44:33,822 Gossiper.java (line 
818) InetAddress /10.10.0.24 is now dead.
system.log.1: INFO [GossipStage:1] 2012-05-06 04:25:23,894 Gossiper.java (line 
818) InetAddress /10.10.0.24 is now dead.
system.log.1: INFO [OptionalTasks:1] 2012-05-06 04:25:23,895 
HintedHandOffManager.java (line 179) Deleting any stored hints for /10.10.0.24
system.log.1: INFO [GossipStage:1] 2012-05-06 04:25:23,895 StorageService.java 
(line 1157) Removing token 127605887595351923798765477786913079296 for 
/10.10.0.24
system.log.1: INFO [GossipStage:1] 2012-05-09 04:26:25,015 Gossiper.java (line 
818) InetAddress /10.10.0.24 is now dead.


Maybe its the way I have removed nodes ? AFAIR I didn't used the decommission 
command. For each node I got the node down and then issue a remove token 
command.
Here is what I can find in the log about when I removed one of them:

system.log.1: INFO [GossipTasks:1] 2012-05-02 17:21:10,281 Gossiper.java (line 
818) InetAddress /10.10.0.24 is now dead.
system.log.1: INFO [HintedHandoff:1] 2012-05-02 17:21:21,496 
HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint 
delivery, aborting
system.log.1: INFO [GossipStage:1] 2012-05-02 17:21:59,307 Gossiper.java (line 
818) InetAddress /10.10.0.24 is now dead.
system.log.1: INFO [HintedHandoff:1] 2012-05-02 17:31:20,336 
HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint 
delivery, aborting
system.log.1: INFO [HintedHandoff:1] 2012-05-02 17:41:06,177 
HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint 
delivery, aborting
system.log.1: INFO [HintedHandoff:1] 2012-05-02 17:51:18,148 
HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint 
delivery, aborting
system.log.1: INFO [HintedHandoff:1] 2012-05-02 18:00:31,709 
HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint 
delivery, aborting
system.log.1: INFO [HintedHandoff:1] 2012-05-02 18:11:02,521 
HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint 
delivery, aborting
system.log.1: INFO [HintedHandoff:1] 2012-05-02 18:20:38,282 
HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint 
delivery, aborting
system.log.1: INFO [HintedHandoff:1] 2012-05-02 18:31:09,513 
HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint 
delivery, aborting
system.log.1: INFO [HintedHandoff:1] 2012-05-02 18:40:31,565 
HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint 
delivery, aborting
system.log.1: INFO [HintedHandoff:1] 2012-05-02 18:51:10,566 
HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint 
delivery, aborting
system.log.1: INFO [HintedHandoff:1] 2012-05-02 19:00:32,197 
HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint 
delivery, aborting
system.log.1: INFO [HintedHandoff:1] 2012-05-02 19:11:17,018 
HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint 
delivery, aborting
system.log.1: INFO [HintedHandoff:1] 2012-05-02 19:21:21,759 
HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint 
delivery, aborting
system.log.1: INFO [GossipStage:1] 2012-05-02 20:05:57,281 Gossiper.java (line 
818) InetAddress /10.10.0.24 is now dead.
system.log.1: INFO [OptionalTasks:1] 2012-05-02 20:05:57,281 
HintedHandOffManager.java (line 179) Deleting any stored hints for /10.10.0.24
system.log.1: INFO [GossipStage:1] 2012-05-02 20:05:57,281 StorageService.java 
(line 1157) Removing token 145835300108973619103103718265651724288 for 
/10.10.0.24


Nicolas


> 
> 
> ----- Message d'origine -----
> De : Nicolas Lalevée [[email protected]]
> Envoyé : 08/06/2012 19:26 ZE2
> À : [email protected]
> Objet : Re: Dead node still being pinged
> 
> 
> 
> Le 8 juin 2012 à 15:17, Samuel CARRIERE a écrit :
> 
>> What does nodetool ring says ? (Ask every node)
> 
> currently, each of new node see only the tokens of the new nodes.
> 
>> Have you checked that the list of seeds in every yaml is correct ?
> 
> yes, it is correct, every of my new node point to the first of my new node
> 
>> What version of cassandra are you using ?
> 
> Sorry I should have wrote this in my first mail.
> I use the 1.0.9
> 
> Nicolas
> 
>> 
>> Samuel
>> 
>> 
>> 
>> Nicolas Lalevée <[email protected]>
>> 08/06/2012 14:10
>> Veuillez répondre à
>> [email protected]
>> 
>> A
>> [email protected]
>> cc
>> Objet
>> Dead node still being pinged
>> 
>> 
>> 
>> 
>> 
>> I had a configuration where I had 4 nodes, data-1,4. We then bought 3 bigger 
>> machines, data-5,7. And we moved all data from data-1,4 to data-5,7.
>> To move all the data without interruption of service, I added one new node 
>> at a time. And then I removed one by one the old machines via a "remove 
>> token".
>> 
>> Everything was working fine. Until there was an expected load on our 
>> cluster, the machine started to swap and become unresponsive. We fixed the 
>> unexpected load and the three new machines were restarted. After that the 
>> new cassandra machines were stating that some old token were not assigned, 
>> namely from data-2 and data-4. To fix this I issued again some "remove 
>> token" commands.
>> 
>> Everything seems to be back to normal, but on the network I still see some 
>> packet from the new cluster to the old machines. On the port 7000.
>> How I can tell cassandra to completely forget about the old machines ?
>> 
>> Nicolas
>> 
>> 
>

Re: Dead node still being pinged

Reply via email to