Re: Cassandra is not showing a node up hours after restart

Hiroyuki Yamada Wed, 11 Dec 2019 19:53:05 -0800

Hello Paul,

The behavior looks similar to what we experienced and reported.
https://issues.apache.org/jira/browse/CASSANDRA-15138


In our testing, "service cassandra stop" makes a cluster sometimes in
a wrong state.
How about doing kill -9 ?

Thanks,
Hiro

On Sun, Dec 8, 2019 at 7:47 PM Hossein Ghiyasi Mehr
<ghiyasim...@gmail.com> wrote:
>
> Which version of Cassandra did you install? deb or tar?
> If it's deb, its script should be used for start/stop.
> If it's tar, kill pid of cassandra to stop and use bin/cassandra to start.
>
> Stop doesn't need any other actions: drain, disable gossip & etc.
>
> Where do you use Cassandra?
> -------------------------------------------------------
> VafaTech : A Total Solution for Data Gathering & Analysis
> -------------------------------------------------------
>
>
> On Fri, Dec 6, 2019 at 11:20 PM Paul Mena <pm...@whoi.edu> wrote:
>>
>> As we are still without a functional Cassandra cluster in our development 
>> environment, I thought I’d try restarting the same node (one of 4 in the 
>> cluster) with the following command:
>>
>>
>>
>> ip=$(cat /etc/hostname); nodetool disablethrift && nodetool disablebinary && 
>> sleep 5 && nodetool disablegossip && nodetool drain && sleep 10 && sudo 
>> service cassandra restart && until echo "SELECT * FROM system.peers LIMIT 
>> 1;" | cqlsh $ip > /dev/null 2>&1; do echo "Node $ip is still DOWN"; sleep 
>> 10; done && echo "Node $ip is now UP"
>>
>>
>>
>> The above command returned “Node is now UP” after about 40 seconds, 
>> confirmed on “node001” via “nodetool status”:
>>
>>
>>
>> user@node001=> nodetool status
>>
>> Datacenter: datacenter1
>>
>> =======================
>>
>> Status=Up/Down
>>
>> |/ State=Normal/Leaving/Joining/Moving
>>
>> --  Address          Load       Tokens  Owns    Host ID                      
>>          Rack
>>
>> UN  192.168.187.121  539.43 GB  256     ?       
>> c99cf581-f4ae-4aa9-ab37-1a114ab2429b  rack1
>>
>> UN  192.168.187.122  633.92 GB  256     ?       
>> bfa07f47-7e37-42b4-9c0b-024b3c02e93f  rack1
>>
>> UN  192.168.187.123  576.31 GB  256     ?       
>> 273df9f3-e496-4c65-a1f2-325ed288a992  rack1
>>
>> UN  192.168.187.124  628.5 GB   256     ?       
>> b8639cf1-5413-4ece-b882-2161bbb8a9c3  rack1
>>
>>
>>
>> As was the case before, running “nodetool status” on any of the other nodes 
>> shows that “node001” is still down:
>>
>>
>>
>> user@node002=> nodetool status
>>
>> Datacenter: datacenter1
>>
>> =======================
>>
>> Status=Up/Down
>>
>> |/ State=Normal/Leaving/Joining/Moving
>>
>> --  Address          Load       Tokens  Owns    Host ID                      
>>          Rack
>>
>> DN  192.168.187.121  538.94 GB  256     ?       
>> c99cf581-f4ae-4aa9-ab37-1a114ab2429b  rack1
>>
>> UN  192.168.187.122  634.04 GB  256     ?       
>> bfa07f47-7e37-42b4-9c0b-024b3c02e93f  rack1
>>
>> UN  192.168.187.123  576.42 GB  256     ?       
>> 273df9f3-e496-4c65-a1f2-325ed288a992  rack1
>>
>> UN  192.168.187.124  628.56 GB  256     ?       
>> b8639cf1-5413-4ece-b882-2161bbb8a9c3  rack1
>>
>>
>>
>> Is it inadvisable to continue with the rolling restart?
>>
>>
>>
>> Paul Mena
>>
>> Senior Application Administrator
>>
>> WHOI - Information Services
>>
>> 508-289-3539
>>
>>
>>
>> From: Shalom Sagges <shalomsag...@gmail.com>
>> Sent: Tuesday, November 26, 2019 12:59 AM
>> To: user@cassandra.apache.org
>> Subject: Re: Cassandra is not showing a node up hours after restart
>>
>>
>>
>> Hi Paul,
>>
>>
>>
>> From the gossipinfo output, it looks like the node's IP address and 
>> rpc_address are different.
>>
>> /192.168.187.121 vs RPC_ADDRESS:192.168.185.121
>>
>> You can also see that there's a schema disagreement between nodes, e.g. 
>> schema_id on node001 is fd2dcb4b-ca62-30df-b8f2-d3fd774f2801 and on node002 
>> it is fd2dcb4b-ca62-30df-b8f2-d3fd774f2801.
>>
>> You can run nodetool describecluster to see it as well.
>>
>> So I suggest to change the rpc_address to the ip_address of the node or set 
>> it to 0.0.0.0 and it should resolve the issue.
>>
>>
>>
>> Hope this helps!
>>
>>
>>
>>
>>
>> On Tue, Nov 26, 2019 at 4:05 AM Inquistive allen <inquial...@gmail.com> 
>> wrote:
>>
>> Hello ,
>>
>>
>>
>> Check and compare everything parameters
>>
>>
>>
>> 1. Java version should ideally match across all nodes in the cluster
>>
>> 2. Check if port 7000 is open between the nodes. Use telnet or nc commands
>>
>> 3. You must see some clues in system logs, why the gossip is failing.
>>
>>
>>
>> Do confirm on the above things.
>>
>>
>>
>> Thanks
>>
>>
>>
>>
>>
>> On Tue, 26 Nov, 2019, 2:50 AM Paul Mena, <pm...@whoi.edu> wrote:
>>
>> NTP was restarted on the Cassandra nodes, but unfortunately I’m still 
>> getting the same result: the restarted node does not appear to be rejoining 
>> the cluster.
>>
>>
>>
>> Here’s another data point: “nodetool gossipinfo”, when run from the 
>> restarted node (“node001”) shows a status of “normal”:
>>
>>
>>
>> user@node001=> nodetool -u gossipinfo
>>
>> /192.168.187.121
>>
>>   generation:1574364410
>>
>>   heartbeat:209150
>>
>>   NET_VERSION:8
>>
>>   RACK:rack1
>>
>>   STATUS:NORMAL,-104847506331695918
>>
>>   RELEASE_VERSION:2.1.9
>>
>>   SEVERITY:0.0
>>
>>   LOAD:5.78684155614E11
>>
>>   HOST_ID:c99cf581-f4ae-4aa9-ab37-1a114ab2429b
>>
>>   SCHEMA:fd2dcb4b-ca62-30df-b8f2-d3fd774f2801
>>
>>   DC:datacenter1
>>
>>   RPC_ADDRESS:192.168.185.121
>>
>>
>>
>> When run from one of the other nodes, however, node001’s status is shown as 
>> “shutdown”:
>>
>>
>>
>> user@node002=> nodetool gossipinfo
>>
>> /192.168.187.121
>>
>>   generation:1491825076
>>
>>   heartbeat:2147483647
>>
>>   STATUS:shutdown,true
>>
>>   RACK:rack1
>>
>>   NET_VERSION:8
>>
>>   LOAD:5.78679987693E11
>>
>>   RELEASE_VERSION:2.1.9
>>
>>   DC:datacenter1
>>
>>   SCHEMA:fd2dcb4b-ca62-30df-b8f2-d3fd774f2801
>>
>>   HOST_ID:c99cf581-f4ae-4aa9-ab37-1a114ab2429b
>>
>>   RPC_ADDRESS:192.168.185.121
>>
>>   SEVERITY:0.0
>>
>>
>>
>>
>>
>> Paul Mena
>>
>> Senior Application Administrator
>>
>> WHOI - Information Services
>>
>> 508-289-3539
>>
>>
>>
>> From: Paul Mena
>> Sent: Monday, November 25, 2019 9:29 AM
>> To: user@cassandra.apache.org
>> Subject: RE: Cassandra is not showing a node up hours after restart
>>
>>
>>
>> I’ve just discovered that NTP is not running on any of these Cassandra 
>> nodes, and that the timestamps are all over the map. Could this be causing 
>> my issue?
>>
>>
>>
>> user@remote=> ansible pre-prod-cassandra -a date
>>
>> node001.intra.myorg.org | CHANGED | rc=0 >>
>>
>> Mon Nov 25 13:58:17 UTC 2019
>>
>>
>>
>> node004.intra.myorg.org | CHANGED | rc=0 >>
>>
>> Mon Nov 25 14:07:20 UTC 2019
>>
>>
>>
>> node003.intra.myorg.org | CHANGED | rc=0 >>
>>
>> Mon Nov 25 13:57:06 UTC 2019
>>
>>
>>
>> node001.intra.myorg.org | CHANGED | rc=0 >>
>>
>> Mon Nov 25 14:07:22 UTC 2019
>>
>>
>>
>> Paul Mena
>>
>> Senior Application Administrator
>>
>> WHOI - Information Services
>>
>> 508-289-3539
>>
>>
>>
>> From: Inquistive allen <inquial...@gmail.com>
>> Sent: Monday, November 25, 2019 2:46 AM
>> To: user@cassandra.apache.org
>> Subject: Re: Cassandra is not showing a node up hours after restart
>>
>>
>>
>> Hello team,
>>
>>
>>
>> Just to add on to the discussion, one may run,
>>
>> Nodetool disablebinary followed by a nodetool disablethrift followed by 
>> nodetool drain.
>>
>> Nodetool drain also does the work of nodetool flush+ declaring in the 
>> cluster that I'm down and not accepting traffic.
>>
>>
>>
>> Thanks
>>
>>
>>
>>
>>
>> On Mon, 25 Nov, 2019, 12:55 AM Surbhi Gupta, <surbhi.gupt...@gmail.com> 
>> wrote:
>>
>> Before Cassandra shutdown, nodetool drain should be executed first. As soon 
>> as you do nodetool drain, others node will see this node down and no new 
>> traffic will come to this node.
>>
>> I generally gives 10 seconds gap between nodetool drain and Cassandra stop.
>>
>>
>>
>> On Sun, Nov 24, 2019 at 9:52 AM Paul Mena <pm...@whoi.edu> wrote:
>>
>> Thank you for the replies. I had made no changes to the config before the 
>> rolling restart.
>>
>>
>>
>> I can try another restart but was wondering if I should do it differently. I 
>> had simply done "service cassandra stop" followed by "service cassandra 
>> start".  Since then I've seen some suggestions to proceed the shutdown with 
>> "nodetool disablegossip" and/or "nodetool drain". Are these commands 
>> advisable? Are any other commands recommended either before the shutdown or 
>> after the startup?
>>
>>
>>
>> Thanks again!
>>
>>
>>
>> Paul
>>
>> ________________________________
>>
>> From: Naman Gupta <naman.gu...@girnarsoft.com>
>> Sent: Sunday, November 24, 2019 11:18:14 AM
>> To: user@cassandra.apache.org
>> Subject: Re: Cassandra is not showing a node up hours after restart
>>
>>
>>
>> Did you change the name of datacenter or any other config changes before the 
>> rolling restart?
>>
>>
>>
>> On Sun, Nov 24, 2019 at 8:49 PM Paul Mena <pm...@whoi.edu> wrote:
>>
>> I am in the process of doing a rolling restart on a 4-node cluster running 
>> Cassandra 2.1.9. I stopped and started Cassandra on node 1 via "service 
>> cassandra stop/start", and noted nothing unusual in either system.log or 
>> cassandra.log. Doing a "nodetool status" from node 1 shows all four nodes up:
>>
>>
>>
>> user@node001=> nodetool status
>>
>> Datacenter: datacenter1
>>
>> =======================
>>
>> Status=Up/Down
>>
>> |/ State=Normal/Leaving/Joining/Moving
>>
>> --  Address          Load       Tokens  Owns    Host ID                      
>>          Rack
>>
>> UN  192.168.187.121  538.95 GB  256     ?       
>> c99cf581-f4ae-4aa9-ab37-1a114ab2429b  rack1
>>
>> UN  192.168.187.122  630.72 GB  256     ?       
>> bfa07f47-7e37-42b4-9c0b-024b3c02e93f  rack1
>>
>> UN  192.168.187.123  572.73 GB  256     ?       
>> 273df9f3-e496-4c65-a1f2-325ed288a992  rack1
>>
>> UN  192.168.187.124  625.05 GB  256     ?       
>> b8639cf1-5413-4ece-b882-2161bbb8a9c3  rack1
>>
>> But doing the same command from any other of the 3 nodes shows node 1 still 
>> down.
>>
>>
>>
>> user@node002=> nodetool status
>>
>> Datacenter: datacenter1
>>
>> =======================
>>
>> Status=Up/Down
>>
>> |/ State=Normal/Leaving/Joining/Moving
>>
>> --  Address          Load       Tokens  Owns    Host ID                      
>>          Rack
>>
>> DN  192.168.187.121  538.94 GB  256     ?       
>> c99cf581-f4ae-4aa9-ab37-1a114ab2429b  rack1
>>
>> UN  192.168.187.122  630.72 GB  256     ?       
>> bfa07f47-7e37-42b4-9c0b-024b3c02e93f  rack1
>>
>> UN  192.168.187.123  572.73 GB  256     ?       
>> 273df9f3-e496-4c65-a1f2-325ed288a992  rack1
>>
>> UN  192.168.187.124  625.04 GB  256     ?       
>> b8639cf1-5413-4ece-b882-2161bbb8a9c3  rack1
>>
>> Is there something I can do to remedy this current situation - so that I can 
>> continue with the rolling restart?
>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org

Re: Cassandra is not showing a node up hours after restart

Reply via email to