Re: Unable to remove dead node from cluster.

Jeff Jirsa Fri, 25 Sep 2015 07:10:14 -0700

The stack trace is one similar to one I recall seeing recently, but don’t have 
in front of me. This is an outside chance that is not at all certain to be the 
case.


For EACH of the hundreds of nodes in your cluster, I suggest you run 

nodetool status | egrep “(^UN|^DN)" | wc -l 

and count to see if every node really has every other node in its ring 
properly. 

I suspect, but am not at all sure, that you have inconsistencies you’re not yet 
aware of (for example, if you expect that you have 100 nodes in the cluster, 
I’m betting that the query above returns 99 on at least one of the nodes).  If 
this is the case, please reply so that you and I can submit a Jira and compare 
our stack traces and we can find the underlying root cause of this together. 

- Jeff

From:  Dikang Gu
Reply-To:  "user@cassandra.apache.org"
Date:  Thursday, September 24, 2015 at 9:10 PM
To:  cassandra
Subject:  Re: Unable to remove dead node from cluster.

@Jeff, I just use jmx connect to one node, run the unsafeAssainateEndpoint, and 
pass in the "10.210.165.55" ip address.

Yes, we have hundreds of other nodes in the nodetool status output as well.

On Tue, Sep 22, 2015 at 11:31 PM, Jeff Jirsa <jeff.ji...@crowdstrike.com> wrote:
When you run unsafeAssassinateEndpoint, to which host are you connected, and 
what argument are you passing?

Are there other nodes in the ring that you’re not including in the ‘nodetool 
status’ output?


From: Dikang Gu
Reply-To: "user@cassandra.apache.org"
Date: Tuesday, September 22, 2015 at 10:09 PM
To: cassandra
Cc: "d...@cassandra.apache.org"
Subject: Re: Unable to remove dead node from cluster.

ping.

On Mon, Sep 21, 2015 at 11:51 AM, Dikang Gu <dikan...@gmail.com> wrote:
I have tried all of them, neither of them worked. 
1. decommission: the host had hardware issue, and I can not connect to it.
2. remove, there is not HostID, so the removenode did not work.
3. unsafeAssassinateEndpoint, it will throw NPE as I pasted before, can we fix 
it?

Thanks
Dikang.

On Mon, Sep 21, 2015 at 11:11 AM, Sebastian Estevez 
<sebastian.este...@datastax.com> wrote:

Order is decommission, remove, assassinate.

Which have you tried?

On Sep 21, 2015 10:47 AM, "Dikang Gu" <dikan...@gmail.com> wrote:
Hi there, 

I have a dead node in our cluster, which is a wired state right now, and can 
not be removed from cluster.

The nodestatus shows:
Datacenter: DC1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address                          Load       Tokens  Owns    Host ID         
                      Rack
DN  10.210.165.55                    ?          256     ?       null            
                      r1

I tried the unsafeAssassinateEndpoint, but got exception like:
2015-09-18_23:21:40.79760 INFO  23:21:40 InetAddress /10.210.165.55 is now DOWN
2015-09-18_23:21:40.80667 ERROR 23:21:40 Exception in thread 
Thread[GossipStage:1,5,main]
2015-09-18_23:21:40.80668 java.lang.NullPointerException: null
2015-09-18_23:21:40.80669       at 
org.apache.cassandra.service.StorageService.getApplicationStateValue(StorageService.java:1584)
 ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
2015-09-18_23:21:40.80669       at 
org.apache.cassandra.service.StorageService.getTokensFor(StorageService.java:1592)
 ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
2015-09-18_23:21:40.80670       at 
org.apache.cassandra.service.StorageService.handleStateLeft(StorageService.java:1822)
 ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
2015-09-18_23:21:40.80671       at 
org.apache.cassandra.service.StorageService.onChange(StorageService.java:1495) 
~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
2015-09-18_23:21:40.80671       at 
org.apache.cassandra.service.StorageService.onJoin(StorageService.java:2121) 
~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
2015-09-18_23:21:40.80672       at 
org.apache.cassandra.gms.Gossiper.handleMajorStateChange(Gossiper.java:1009) 
~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
2015-09-18_23:21:40.80673       at 
org.apache.cassandra.gms.Gossiper.applyStateLocally(Gossiper.java:1113) 
~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
2015-09-18_23:21:40.80673       at 
org.apache.cassandra.gms.GossipDigestAck2VerbHandler.doVerb(GossipDigestAck2VerbHandler.java:49)
 ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
2015-09-18_23:21:40.80673       at 
org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:62) 
~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
2015-09-18_23:21:40.80674       at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) 
~[na:1.7.0_45]
2015-09-18_23:21:40.80674       at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) 
~[na:1.7.0_45]
2015-09-18_23:21:40.80674       at java.lang.Thread.run(Thread.java:744) 
~[na:1.7.0_45]
2015-09-18_23:21:40.85812 WARN  23:21:40 Not marking nodes down due to local 
pause of 10852378435 > 5000000000

Any suggestions about how to remove it?
Thanks.

-- 
Dikang




-- 
Dikang




-- 
Dikang




-- 
Dikang

smime.p7s
Description: S/MIME cryptographic signature

Re: Unable to remove dead node from cluster.

Reply via email to