Was more than one node added to the cluster at the same time? I.e. did you start a new node which will join the cluster without waiting for a previous node finish joining the same cluster? This can happen if you don't have "serial: 1" in your Ansible script, or don't have a proper wait.

Removing the data directory will remove the node ID, and then the node will join the cluster as a brand new node, which will solve the duplicate node ID issue.

Regarding the seed nodes, you should not make all nodes seed nodes, with the only exception of very small cluster (<= 3 nodes). Assuming you have more than 1 DC, use 1 or 2 seed nodes per DC is fairly reasonable. Even only 1 per DC is still reliable in a multi-DC setup, as the following 3 things must all happen at the same time to make it fail: 1. network partitioning affecting the DC; 2. seed node failure in the same DC; and 3. starting a node in the same DC. Even so, the problem will go away automatically once the network connectivity between DCs is restored or the seed node comes back. If you have multiple racks per DC, you can also consider 1 seed node per rack as an alternative.

Determining the DC & rack depends on your server provider, and tokens per node depends on the hardware difference between nodes (is it just the disk? or RAM and CPU too?). There's no one size fit all solution, use your own judgement.

On 06/06/2022 09:54, Marc Hoppins wrote:
No. It transpires that, after seeing errors when running a start.yml for 
ansible, I decided to start all nodes again and when starting some assumed the 
same ID as others.

I resolved this by shutting down the service on the affected nodes, removing 
the data dirs. (these are all new nodes: no data) and restarted the service, 
one by one, making sure that the new node appeared before starting another.

All are now alive and kicking (copyright Simple Minds).

Vis. Seeds: given my setup only has a small number of nodes, I used 1 node out 
of 4 as a seed. I have seen folk suggesting every node (sounds excessive) and 1 
per datacentre (seems unreliable), and also 3 seeds per datacentre...which 
could be adequate if all not in the same rack (which mine currently are).  What 
suggestions/best practice?  2 per switch/rack for failover, or just go with a 
set number per datacentre?

For automated install: how do you go about resolving dc & rack, and tokens per 
node (if the hardware varies)?

Marc

-----Original Message-----
From: Bowen Song <bo...@bso.ng>
Sent: Saturday, June 4, 2022 3:10 PM
To: user@cassandra.apache.org
Subject: Re: Cluster & Nodetool

EXTERNAL


That sounds like something caused by duplicated node IDs (the Host ID column in 
`nodetool status`). Did you by any chance copied the Cassandra data directory 
between nodes? (e.g. spinning up a new node from a VM snapshot that contains a 
non-empty data directory)

On 03/06/2022 12:38, Marc Hoppins wrote:
Hi all,

Am new to Cassandra.  Just finished installing on 22 nodes across 2 datacentres.

If I run nodetool describecluster  I get

Stats for all nodes:
          Live: 22
          Joining: 0
          Moving: 0
          Leaving: 0
          Unreachable: 0

Data Centers:
          BA #Nodes: 9 #Down: 0
          DR1 #Nodes: 8 #Down: 0

There should be 12 in BA and 10 in DR1.  The service is running on these other 
nodes...yet nodetool status also only shows the above numbers.

Datacenter: BA
==============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address       Load        Tokens  Owns (effective)  Host ID                 
              Rack
UN  10.1.146.197  304.72 KiB  16      11.4%             
26d5a89c-aa8f-4249-b2b5-82341cc214bc  SSW09
UN  10.1.146.186  245.02 KiB  16      9.0%              
29f20519-51f9-493c-b891-930762d82231  SSW09
UN  10.1.146.20   129.53 KiB  16      12.5%             
f90dd318-1357-46ca-9870-807d988658b3  SSW09
UN  10.1.146.200  150.31 KiB  16      11.1%             
c544e85a-c2c5-4afd-aca8-1854a1723c2f  SSW09
UN  10.1.146.17   185.9 KiB   16      11.7%             
db9d9856-3082-44a8-b292-156da1a17d0a  SSW09
UN  10.1.146.174  288.64 KiB  16      12.1%             
03126eba-8b58-4a96-80ca-10cec2e18e69  SSW09
UN  10.1.146.199  146.71 KiB  16      13.7%             
860d6549-94ab-4a07-b665-70ea7e53f41a  SSW09
UN  10.1.146.78   69.05 KiB   16      11.5%             
7d9fdbab-40b0-4a9e-b0c9-4ffa822c42fd  SSW09
UN  10.1.146.67   304.5 KiB   16      13.6%             
48e9eba2-9112-4d91-8f26-8272cb5ce7bc  SSW09

Datacenter: DR1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address       Load        Tokens  Owns (effective)  Host ID                 
              Rack
UN  10.1.146.137  209.33 KiB  16      12.6%             
f65c685f-048c-41de-85e4-308c4b84d047  SSW02
UN  10.1.146.141  237.21 KiB  16      9.8%              
847ad921-fceb-4cef-acec-1c918d2a6517  SSW02
UN  10.1.146.131  311.05 KiB  16      11.7%             
7263f6c6-c4d6-438e-8ee7-d07666242ba0  SSW02
UN  10.1.146.139  283.33 KiB  16      11.5%             
264cbe47-acb4-49cc-97d0-6f9e2cee6844  SSW02
UN  10.1.146.140  258.46 KiB  16      11.6%             
43dbbe91-5dac-4c3a-9df5-2f5ccf268eb6  SSW02
UN  10.1.146.132  157.03 KiB  16      12.3%             
1c0cb23c-af78-4fa2-bd92-20fa7d39ec30  SSW02
UN  10.1.146.135  301.13 KiB  16      11.2%             
26159fbe-cf78-4c94-88e0-54773bcf7bed  SSW02
UN  10.1.146.130  305.16 KiB  16      12.5%             
d6d6c490-551d-4a97-a93c-3b772b750d7d  SSW02

So I restarted the service on one of the missing addresses. It appeared in the 
list but one other dropped off.  I tried this several times.  It seems I can 
only get 9 and 8 not 12 and 10.

Anyone have an idea why this may be so?

Thanks

Marc

Reply via email to