Try purging the hints for 10.10.0.24 using the HintedHandOffManager MBean.

Cheers

-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 12/06/2012, at 3:33 AM, Nicolas Lalevée wrote:

> finally, thanks to the groovy jmx builder, it was not that hard.
> 
> 
> Le 11 juin 2012 à 12:12, Samuel CARRIERE a écrit :
> 
>> If I were you, I would connect (through JMX, with jconsole) to one of the 
>> nodes that is sending messages to an old node, and would have a look at 
>> these MBean : 
>>   - org.apache.net.FailureDetector : does SimpleStates looks good ? (or do 
>> you see an IP of an old node)
> 
> SimpleStates:[/10.10.0.22:DOWN, /10.10.0.24:DOWN, /10.10.0.26:UP, 
> /10.10.0.25:UP, /10.10.0.27:UP]
> 
>>   - org.apache.net.MessagingService : do you see one of the old IP in one of 
>> the attributes ?
> 
> data-5:
> CommandCompletedTasks:
> [10.10.0.22:2, 10.10.0.26:6147307, 10.10.0.27:6084684, 10.10.0.24:2]
> CommandPendingTasks:
> [10.10.0.22:0, 10.10.0.26:0, 10.10.0.27:0, 10.10.0.24:0]
> ResponseCompletedTasks:
> [10.10.0.22:1487, 10.10.0.26:6187204, 10.10.0.27:6062890, 10.10.0.24:1495]
> ResponsePendingTasks:
> [10.10.0.22:0, 10.10.0.26:0, 10.10.0.27:0, 10.10.0.24:0]
> 
> data-6:
> CommandCompletedTasks:
> [10.10.0.22:2, 10.10.0.27:6064992, 10.10.0.24:2, 10.10.0.25:6308102]
> CommandPendingTasks:
> [10.10.0.22:0, 10.10.0.27:0, 10.10.0.24:0, 10.10.0.25:0]
> ResponseCompletedTasks:
> [10.10.0.22:1463, 10.10.0.27:6067943, 10.10.0.24:1474, 10.10.0.25:6367692]
> ResponsePendingTasks:
> [10.10.0.22:0, 10.10.0.27:0, 10.10.0.24:2, 10.10.0.25:0]
> 
> data-7:
> CommandCompletedTasks:
> [10.10.0.22:2, 10.10.0.26:6043653, 10.10.0.24:2, 10.10.0.25:5964168]
> CommandPendingTasks:
> [10.10.0.22:0, 10.10.0.26:0, 10.10.0.24:0, 10.10.0.25:0]
> ResponseCompletedTasks:
> [10.10.0.22:1424, 10.10.0.26:6090251, 10.10.0.24:1431, 10.10.0.25:6094954]
> ResponsePendingTasks:
> [10.10.0.22:4, 10.10.0.26:0, 10.10.0.24:1, 10.10.0.25:0]
> 
>>   - org.apache.net.StreamingService : do you see an old IP in StreamSources 
>> or StreamDestinations ?
> 
> nothing streaming on the 3 nodes.
> nodetool netstats confirmed that.
> 
>>   - org.apache.internal.HintedHandoff : are there non-zero ActiveCount, 
>> CurrentlyBlockedTasks, PendingTasks, TotalBlockedTask ?
> 
> On the 3 nodes, all at 0.
> 
> I don't know much what I'm looking at, but it seems that some 
> ResponsePendingTasks needs to end.
> 
> Nicolas
> 
>> 
>> Samuel 
>> 
>> 
>> 
>> Nicolas Lalevée <nicolas.lale...@hibnet.org>
>> 08/06/2012 21:03
>> Veuillez répondre à
>> user@cassandra.apache.org
>> 
>> A
>> user@cassandra.apache.org
>> cc
>> Objet
>> Re: Dead node still being pinged
>> 
>> 
>> 
>> 
>> 
>> 
>> Le 8 juin 2012 à 20:02, Samuel CARRIERE a écrit :
>> 
>>> I'm in the train but just a guess : maybe it's hinted handoff. A look in 
>>> the logs of the new nodes could confirm that : look for the IP of an old 
>>> node and maybe you'll find hinted handoff related messages.
>> 
>> I grepped on every node about every old node, I got nothing since the 
>> "crash".
>> 
>> If it can be of some help, here is some grepped log of the crash:
>> 
>> system.log.1: WARN [RMI TCP Connection(1037)-10.10.0.26] 2012-05-06 
>> 00:39:30,241 StorageService.java (line 2417) Endpoint /10.10.0.24 is down 
>> and will not receive data for re-replication of /10.10.0.22
>> system.log.1: WARN [RMI TCP Connection(1037)-10.10.0.26] 2012-05-06 
>> 00:39:30,242 StorageService.java (line 2417) Endpoint /10.10.0.24 is down 
>> and will not receive data for re-replication of /10.10.0.22
>> system.log.1: WARN [RMI TCP Connection(1037)-10.10.0.26] 2012-05-06 
>> 00:39:30,242 StorageService.java (line 2417) Endpoint /10.10.0.24 is down 
>> and will not receive data for re-replication of /10.10.0.22
>> system.log.1: WARN [RMI TCP Connection(1037)-10.10.0.26] 2012-05-06 
>> 00:39:30,243 StorageService.java (line 2417) Endpoint /10.10.0.24 is down 
>> and will not receive data for re-replication of /10.10.0.22
>> system.log.1: WARN [RMI TCP Connection(1037)-10.10.0.26] 2012-05-06 
>> 00:39:30,243 StorageService.java (line 2417) Endpoint /10.10.0.24 is down 
>> and will not receive data for re-replication of /10.10.0.22
>> system.log.1: INFO [GossipStage:1] 2012-05-06 00:44:33,822 Gossiper.java 
>> (line 818) InetAddress /10.10.0.24 is now dead.
>> system.log.1: INFO [GossipStage:1] 2012-05-06 04:25:23,894 Gossiper.java 
>> (line 818) InetAddress /10.10.0.24 is now dead.
>> system.log.1: INFO [OptionalTasks:1] 2012-05-06 04:25:23,895 
>> HintedHandOffManager.java (line 179) Deleting any stored hints for 
>> /10.10.0.24
>> system.log.1: INFO [GossipStage:1] 2012-05-06 04:25:23,895 
>> StorageService.java (line 1157) Removing token 
>> 127605887595351923798765477786913079296 for /10.10.0.24
>> system.log.1: INFO [GossipStage:1] 2012-05-09 04:26:25,015 Gossiper.java 
>> (line 818) InetAddress /10.10.0.24 is now dead.
>> 
>> 
>> Maybe its the way I have removed nodes ? AFAIR I didn't used the 
>> decommission command. For each node I got the node down and then issue a 
>> remove token command.
>> Here is what I can find in the log about when I removed one of them:
>> 
>> system.log.1: INFO [GossipTasks:1] 2012-05-02 17:21:10,281 Gossiper.java 
>> (line 818) InetAddress /10.10.0.24 is now dead.
>> system.log.1: INFO [HintedHandoff:1] 2012-05-02 17:21:21,496 
>> HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint 
>> delivery, aborting
>> system.log.1: INFO [GossipStage:1] 2012-05-02 17:21:59,307 Gossiper.java 
>> (line 818) InetAddress /10.10.0.24 is now dead.
>> system.log.1: INFO [HintedHandoff:1] 2012-05-02 17:31:20,336 
>> HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint 
>> delivery, aborting
>> system.log.1: INFO [HintedHandoff:1] 2012-05-02 17:41:06,177 
>> HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint 
>> delivery, aborting
>> system.log.1: INFO [HintedHandoff:1] 2012-05-02 17:51:18,148 
>> HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint 
>> delivery, aborting
>> system.log.1: INFO [HintedHandoff:1] 2012-05-02 18:00:31,709 
>> HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint 
>> delivery, aborting
>> system.log.1: INFO [HintedHandoff:1] 2012-05-02 18:11:02,521 
>> HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint 
>> delivery, aborting
>> system.log.1: INFO [HintedHandoff:1] 2012-05-02 18:20:38,282 
>> HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint 
>> delivery, aborting
>> system.log.1: INFO [HintedHandoff:1] 2012-05-02 18:31:09,513 
>> HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint 
>> delivery, aborting
>> system.log.1: INFO [HintedHandoff:1] 2012-05-02 18:40:31,565 
>> HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint 
>> delivery, aborting
>> system.log.1: INFO [HintedHandoff:1] 2012-05-02 18:51:10,566 
>> HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint 
>> delivery, aborting
>> system.log.1: INFO [HintedHandoff:1] 2012-05-02 19:00:32,197 
>> HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint 
>> delivery, aborting
>> system.log.1: INFO [HintedHandoff:1] 2012-05-02 19:11:17,018 
>> HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint 
>> delivery, aborting
>> system.log.1: INFO [HintedHandoff:1] 2012-05-02 19:21:21,759 
>> HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint 
>> delivery, aborting
>> system.log.1: INFO [GossipStage:1] 2012-05-02 20:05:57,281 Gossiper.java 
>> (line 818) InetAddress /10.10.0.24 is now dead.
>> system.log.1: INFO [OptionalTasks:1] 2012-05-02 20:05:57,281 
>> HintedHandOffManager.java (line 179) Deleting any stored hints for 
>> /10.10.0.24
>> system.log.1: INFO [GossipStage:1] 2012-05-02 20:05:57,281 
>> StorageService.java (line 1157) Removing token 
>> 145835300108973619103103718265651724288 for /10.10.0.24
>> 
>> 
>> Nicolas
>> 
>> 
>>> 
>>> 
>>> ----- Message d'origine -----
>>> De : Nicolas Lalevée [nicolas.lale...@hibnet.org]
>>> Envoyé : 08/06/2012 19:26 ZE2
>>> À : user@cassandra.apache.org
>>> Objet : Re: Dead node still being pinged
>>> 
>>> 
>>> 
>>> Le 8 juin 2012 à 15:17, Samuel CARRIERE a écrit :
>>> 
>>>> What does nodetool ring says ? (Ask every node)
>>> 
>>> currently, each of new node see only the tokens of the new nodes.
>>> 
>>>> Have you checked that the list of seeds in every yaml is correct ?
>>> 
>>> yes, it is correct, every of my new node point to the first of my new node
>>> 
>>>> What version of cassandra are you using ?
>>> 
>>> Sorry I should have wrote this in my first mail.
>>> I use the 1.0.9
>>> 
>>> Nicolas
>>> 
>>>> 
>>>> Samuel
>>>> 
>>>> 
>>>> 
>>>> Nicolas Lalevée <nicolas.lale...@hibnet.org>
>>>> 08/06/2012 14:10
>>>> Veuillez répondre à
>>>> user@cassandra.apache.org
>>>> 
>>>> A
>>>> user@cassandra.apache.org
>>>> cc
>>>> Objet
>>>> Dead node still being pinged
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> I had a configuration where I had 4 nodes, data-1,4. We then bought 3 
>>>> bigger machines, data-5,7. And we moved all data from data-1,4 to data-5,7.
>>>> To move all the data without interruption of service, I added one new node 
>>>> at a time. And then I removed one by one the old machines via a "remove 
>>>> token".
>>>> 
>>>> Everything was working fine. Until there was an expected load on our 
>>>> cluster, the machine started to swap and become unresponsive. We fixed the 
>>>> unexpected load and the three new machines were restarted. After that the 
>>>> new cassandra machines were stating that some old token were not assigned, 
>>>> namely from data-2 and data-4. To fix this I issued again some "remove 
>>>> token" commands.
>>>> 
>>>> Everything seems to be back to normal, but on the network I still see some 
>>>> packet from the new cluster to the old machines. On the port 7000.
>>>> How I can tell cassandra to completely forget about the old machines ?
>>>> 
>>>> Nicolas
>>>> 
>>>> 
>>> 
>> 
>> 
> 

Reply via email to