Re: Dead node still being pinged

Nicolas Lalevée Tue, 12 Jun 2012 03:26:27 -0700

Le 12 juin 2012 à 11:03, aaron morton a écrit :

> Try purging the hints for 10.10.0.24 using the HintedHandOffManager MBean.


As far as I could tell, there were no hinted hand off to be delivered. 
Nevertheless I have called "deleteHintsForEndpoint" on every node for the two 
expected to be out nodes.
Nothing changed, I still see packet being send to these old nodes.

I looked closer to ResponsePendingTasks of MessagingService. Actually the 
numbers change, between 0 and about 4. So tasks are ending but new ones come 
just after.

Nicolas

> 
> Cheers
> 
> -----------------
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
> 
> On 12/06/2012, at 3:33 AM, Nicolas Lalevée wrote:
> 
>> finally, thanks to the groovy jmx builder, it was not that hard.
>> 
>> 
>> Le 11 juin 2012 à 12:12, Samuel CARRIERE a écrit :
>> 
>>> If I were you, I would connect (through JMX, with jconsole) to one of the 
>>> nodes that is sending messages to an old node, and would have a look at 
>>> these MBean : 
>>>   - org.apache.net.FailureDetector : does SimpleStates looks good ? (or do 
>>> you see an IP of an old node)
>> 
>> SimpleStates:[/10.10.0.22:DOWN, /10.10.0.24:DOWN, /10.10.0.26:UP, 
>> /10.10.0.25:UP, /10.10.0.27:UP]
>> 
>>>   - org.apache.net.MessagingService : do you see one of the old IP in one 
>>> of the attributes ?
>> 
>> data-5:
>> CommandCompletedTasks:
>> [10.10.0.22:2, 10.10.0.26:6147307, 10.10.0.27:6084684, 10.10.0.24:2]
>> CommandPendingTasks:
>> [10.10.0.22:0, 10.10.0.26:0, 10.10.0.27:0, 10.10.0.24:0]
>> ResponseCompletedTasks:
>> [10.10.0.22:1487, 10.10.0.26:6187204, 10.10.0.27:6062890, 10.10.0.24:1495]
>> ResponsePendingTasks:
>> [10.10.0.22:0, 10.10.0.26:0, 10.10.0.27:0, 10.10.0.24:0]
>> 
>> data-6:
>> CommandCompletedTasks:
>> [10.10.0.22:2, 10.10.0.27:6064992, 10.10.0.24:2, 10.10.0.25:6308102]
>> CommandPendingTasks:
>> [10.10.0.22:0, 10.10.0.27:0, 10.10.0.24:0, 10.10.0.25:0]
>> ResponseCompletedTasks:
>> [10.10.0.22:1463, 10.10.0.27:6067943, 10.10.0.24:1474, 10.10.0.25:6367692]
>> ResponsePendingTasks:
>> [10.10.0.22:0, 10.10.0.27:0, 10.10.0.24:2, 10.10.0.25:0]
>> 
>> data-7:
>> CommandCompletedTasks:
>> [10.10.0.22:2, 10.10.0.26:6043653, 10.10.0.24:2, 10.10.0.25:5964168]
>> CommandPendingTasks:
>> [10.10.0.22:0, 10.10.0.26:0, 10.10.0.24:0, 10.10.0.25:0]
>> ResponseCompletedTasks:
>> [10.10.0.22:1424, 10.10.0.26:6090251, 10.10.0.24:1431, 10.10.0.25:6094954]
>> ResponsePendingTasks:
>> [10.10.0.22:4, 10.10.0.26:0, 10.10.0.24:1, 10.10.0.25:0]
>> 
>>>   - org.apache.net.StreamingService : do you see an old IP in StreamSources 
>>> or StreamDestinations ?
>> 
>> nothing streaming on the 3 nodes.
>> nodetool netstats confirmed that.
>> 
>>>   - org.apache.internal.HintedHandoff : are there non-zero ActiveCount, 
>>> CurrentlyBlockedTasks, PendingTasks, TotalBlockedTask ?
>> 
>> On the 3 nodes, all at 0.
>> 
>> I don't know much what I'm looking at, but it seems that some 
>> ResponsePendingTasks needs to end.
>> 
>> Nicolas
>> 
>>> 
>>> Samuel 
>>> 
>>> 
>>> 
>>> Nicolas Lalevée <nicolas.lale...@hibnet.org>
>>> 08/06/2012 21:03
>>> Veuillez répondre à
>>> user@cassandra.apache.org
>>> 
>>> A
>>> user@cassandra.apache.org
>>> cc
>>> Objet
>>> Re: Dead node still being pinged
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> Le 8 juin 2012 à 20:02, Samuel CARRIERE a écrit :
>>> 
>>>> I'm in the train but just a guess : maybe it's hinted handoff. A look in 
>>>> the logs of the new nodes could confirm that : look for the IP of an old 
>>>> node and maybe you'll find hinted handoff related messages.
>>> 
>>> I grepped on every node about every old node, I got nothing since the 
>>> "crash".
>>> 
>>> If it can be of some help, here is some grepped log of the crash:
>>> 
>>> system.log.1: WARN [RMI TCP Connection(1037)-10.10.0.26] 2012-05-06 
>>> 00:39:30,241 StorageService.java (line 2417) Endpoint /10.10.0.24 is down 
>>> and will not receive data for re-replication of /10.10.0.22
>>> system.log.1: WARN [RMI TCP Connection(1037)-10.10.0.26] 2012-05-06 
>>> 00:39:30,242 StorageService.java (line 2417) Endpoint /10.10.0.24 is down 
>>> and will not receive data for re-replication of /10.10.0.22
>>> system.log.1: WARN [RMI TCP Connection(1037)-10.10.0.26] 2012-05-06 
>>> 00:39:30,242 StorageService.java (line 2417) Endpoint /10.10.0.24 is down 
>>> and will not receive data for re-replication of /10.10.0.22
>>> system.log.1: WARN [RMI TCP Connection(1037)-10.10.0.26] 2012-05-06 
>>> 00:39:30,243 StorageService.java (line 2417) Endpoint /10.10.0.24 is down 
>>> and will not receive data for re-replication of /10.10.0.22
>>> system.log.1: WARN [RMI TCP Connection(1037)-10.10.0.26] 2012-05-06 
>>> 00:39:30,243 StorageService.java (line 2417) Endpoint /10.10.0.24 is down 
>>> and will not receive data for re-replication of /10.10.0.22
>>> system.log.1: INFO [GossipStage:1] 2012-05-06 00:44:33,822 Gossiper.java 
>>> (line 818) InetAddress /10.10.0.24 is now dead.
>>> system.log.1: INFO [GossipStage:1] 2012-05-06 04:25:23,894 Gossiper.java 
>>> (line 818) InetAddress /10.10.0.24 is now dead.
>>> system.log.1: INFO [OptionalTasks:1] 2012-05-06 04:25:23,895 
>>> HintedHandOffManager.java (line 179) Deleting any stored hints for 
>>> /10.10.0.24
>>> system.log.1: INFO [GossipStage:1] 2012-05-06 04:25:23,895 
>>> StorageService.java (line 1157) Removing token 
>>> 127605887595351923798765477786913079296 for /10.10.0.24
>>> system.log.1: INFO [GossipStage:1] 2012-05-09 04:26:25,015 Gossiper.java 
>>> (line 818) InetAddress /10.10.0.24 is now dead.
>>> 
>>> 
>>> Maybe its the way I have removed nodes ? AFAIR I didn't used the 
>>> decommission command. For each node I got the node down and then issue a 
>>> remove token command.
>>> Here is what I can find in the log about when I removed one of them:
>>> 
>>> system.log.1: INFO [GossipTasks:1] 2012-05-02 17:21:10,281 Gossiper.java 
>>> (line 818) InetAddress /10.10.0.24 is now dead.
>>> system.log.1: INFO [HintedHandoff:1] 2012-05-02 17:21:21,496 
>>> HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint 
>>> delivery, aborting
>>> system.log.1: INFO [GossipStage:1] 2012-05-02 17:21:59,307 Gossiper.java 
>>> (line 818) InetAddress /10.10.0.24 is now dead.
>>> system.log.1: INFO [HintedHandoff:1] 2012-05-02 17:31:20,336 
>>> HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint 
>>> delivery, aborting
>>> system.log.1: INFO [HintedHandoff:1] 2012-05-02 17:41:06,177 
>>> HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint 
>>> delivery, aborting
>>> system.log.1: INFO [HintedHandoff:1] 2012-05-02 17:51:18,148 
>>> HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint 
>>> delivery, aborting
>>> system.log.1: INFO [HintedHandoff:1] 2012-05-02 18:00:31,709 
>>> HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint 
>>> delivery, aborting
>>> system.log.1: INFO [HintedHandoff:1] 2012-05-02 18:11:02,521 
>>> HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint 
>>> delivery, aborting
>>> system.log.1: INFO [HintedHandoff:1] 2012-05-02 18:20:38,282 
>>> HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint 
>>> delivery, aborting
>>> system.log.1: INFO [HintedHandoff:1] 2012-05-02 18:31:09,513 
>>> HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint 
>>> delivery, aborting
>>> system.log.1: INFO [HintedHandoff:1] 2012-05-02 18:40:31,565 
>>> HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint 
>>> delivery, aborting
>>> system.log.1: INFO [HintedHandoff:1] 2012-05-02 18:51:10,566 
>>> HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint 
>>> delivery, aborting
>>> system.log.1: INFO [HintedHandoff:1] 2012-05-02 19:00:32,197 
>>> HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint 
>>> delivery, aborting
>>> system.log.1: INFO [HintedHandoff:1] 2012-05-02 19:11:17,018 
>>> HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint 
>>> delivery, aborting
>>> system.log.1: INFO [HintedHandoff:1] 2012-05-02 19:21:21,759 
>>> HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint 
>>> delivery, aborting
>>> system.log.1: INFO [GossipStage:1] 2012-05-02 20:05:57,281 Gossiper.java 
>>> (line 818) InetAddress /10.10.0.24 is now dead.
>>> system.log.1: INFO [OptionalTasks:1] 2012-05-02 20:05:57,281 
>>> HintedHandOffManager.java (line 179) Deleting any stored hints for 
>>> /10.10.0.24
>>> system.log.1: INFO [GossipStage:1] 2012-05-02 20:05:57,281 
>>> StorageService.java (line 1157) Removing token 
>>> 145835300108973619103103718265651724288 for /10.10.0.24
>>> 
>>> 
>>> Nicolas
>>> 
>>> 
>>>> 
>>>> 
>>>> ----- Message d'origine -----
>>>> De : Nicolas Lalevée [nicolas.lale...@hibnet.org]
>>>> Envoyé : 08/06/2012 19:26 ZE2
>>>> À : user@cassandra.apache.org
>>>> Objet : Re: Dead node still being pinged
>>>> 
>>>> 
>>>> 
>>>> Le 8 juin 2012 à 15:17, Samuel CARRIERE a écrit :
>>>> 
>>>>> What does nodetool ring says ? (Ask every node)
>>>> 
>>>> currently, each of new node see only the tokens of the new nodes.
>>>> 
>>>>> Have you checked that the list of seeds in every yaml is correct ?
>>>> 
>>>> yes, it is correct, every of my new node point to the first of my new node
>>>> 
>>>>> What version of cassandra are you using ?
>>>> 
>>>> Sorry I should have wrote this in my first mail.
>>>> I use the 1.0.9
>>>> 
>>>> Nicolas
>>>> 
>>>>> 
>>>>> Samuel
>>>>> 
>>>>> 
>>>>> 
>>>>> Nicolas Lalevée <nicolas.lale...@hibnet.org>
>>>>> 08/06/2012 14:10
>>>>> Veuillez répondre à
>>>>> user@cassandra.apache.org
>>>>> 
>>>>> A
>>>>> user@cassandra.apache.org
>>>>> cc
>>>>> Objet
>>>>> Dead node still being pinged
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> I had a configuration where I had 4 nodes, data-1,4. We then bought 3 
>>>>> bigger machines, data-5,7. And we moved all data from data-1,4 to 
>>>>> data-5,7.
>>>>> To move all the data without interruption of service, I added one new 
>>>>> node at a time. And then I removed one by one the old machines via a 
>>>>> "remove token".
>>>>> 
>>>>> Everything was working fine. Until there was an expected load on our 
>>>>> cluster, the machine started to swap and become unresponsive. We fixed 
>>>>> the unexpected load and the three new machines were restarted. After that 
>>>>> the new cassandra machines were stating that some old token were not 
>>>>> assigned, namely from data-2 and data-4. To fix this I issued again some 
>>>>> "remove token" commands.
>>>>> 
>>>>> Everything seems to be back to normal, but on the network I still see 
>>>>> some packet from the new cluster to the old machines. On the port 7000.
>>>>> How I can tell cassandra to completely forget about the old machines ?
>>>>> 
>>>>> Nicolas
>>>>> 
>>>>> 
>>>> 
>>> 
>>> 
>> 
>

Re: Dead node still being pinged

Reply via email to