Was more than one node added to the cluster at the same time? I.e. did
you start a new node which will join the cluster without waiting for a
previous node finish joining the same cluster? This can happen if you
don't have "serial: 1" in your Ansible script, or don't have a proper wait.
Removing the data directory will remove the node ID, and then the node
will join the cluster as a brand new node, which will solve the
duplicate node ID issue.
Regarding the seed nodes, you should not make all nodes seed nodes, with
the only exception of very small cluster (<= 3 nodes). Assuming you have
more than 1 DC, use 1 or 2 seed nodes per DC is fairly reasonable. Even
only 1 per DC is still reliable in a multi-DC setup, as the following 3
things must all happen at the same time to make it fail: 1. network
partitioning affecting the DC; 2. seed node failure in the same DC; and
3. starting a node in the same DC. Even so, the problem will go away
automatically once the network connectivity between DCs is restored or
the seed node comes back. If you have multiple racks per DC, you can
also consider 1 seed node per rack as an alternative.
Determining the DC & rack depends on your server provider, and tokens
per node depends on the hardware difference between nodes (is it just
the disk? or RAM and CPU too?). There's no one size fit all solution,
use your own judgement.
On 06/06/2022 09:54, Marc Hoppins wrote:
No. It transpires that, after seeing errors when running a start.yml for
ansible, I decided to start all nodes again and when starting some assumed the
same ID as others.
I resolved this by shutting down the service on the affected nodes, removing
the data dirs. (these are all new nodes: no data) and restarted the service,
one by one, making sure that the new node appeared before starting another.
All are now alive and kicking (copyright Simple Minds).
Vis. Seeds: given my setup only has a small number of nodes, I used 1 node out
of 4 as a seed. I have seen folk suggesting every node (sounds excessive) and 1
per datacentre (seems unreliable), and also 3 seeds per datacentre...which
could be adequate if all not in the same rack (which mine currently are). What
suggestions/best practice? 2 per switch/rack for failover, or just go with a
set number per datacentre?
For automated install: how do you go about resolving dc & rack, and tokens per
node (if the hardware varies)?
Marc
-----Original Message-----
From: Bowen Song <bo...@bso.ng>
Sent: Saturday, June 4, 2022 3:10 PM
To: user@cassandra.apache.org
Subject: Re: Cluster & Nodetool
EXTERNAL
That sounds like something caused by duplicated node IDs (the Host ID column in
`nodetool status`). Did you by any chance copied the Cassandra data directory
between nodes? (e.g. spinning up a new node from a VM snapshot that contains a
non-empty data directory)
On 03/06/2022 12:38, Marc Hoppins wrote:
Hi all,
Am new to Cassandra. Just finished installing on 22 nodes across 2 datacentres.
If I run nodetool describecluster I get
Stats for all nodes:
Live: 22
Joining: 0
Moving: 0
Leaving: 0
Unreachable: 0
Data Centers:
BA #Nodes: 9 #Down: 0
DR1 #Nodes: 8 #Down: 0
There should be 12 in BA and 10 in DR1. The service is running on these other
nodes...yet nodetool status also only shows the above numbers.
Datacenter: BA
==============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID
Rack
UN 10.1.146.197 304.72 KiB 16 11.4%
26d5a89c-aa8f-4249-b2b5-82341cc214bc SSW09
UN 10.1.146.186 245.02 KiB 16 9.0%
29f20519-51f9-493c-b891-930762d82231 SSW09
UN 10.1.146.20 129.53 KiB 16 12.5%
f90dd318-1357-46ca-9870-807d988658b3 SSW09
UN 10.1.146.200 150.31 KiB 16 11.1%
c544e85a-c2c5-4afd-aca8-1854a1723c2f SSW09
UN 10.1.146.17 185.9 KiB 16 11.7%
db9d9856-3082-44a8-b292-156da1a17d0a SSW09
UN 10.1.146.174 288.64 KiB 16 12.1%
03126eba-8b58-4a96-80ca-10cec2e18e69 SSW09
UN 10.1.146.199 146.71 KiB 16 13.7%
860d6549-94ab-4a07-b665-70ea7e53f41a SSW09
UN 10.1.146.78 69.05 KiB 16 11.5%
7d9fdbab-40b0-4a9e-b0c9-4ffa822c42fd SSW09
UN 10.1.146.67 304.5 KiB 16 13.6%
48e9eba2-9112-4d91-8f26-8272cb5ce7bc SSW09
Datacenter: DR1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID
Rack
UN 10.1.146.137 209.33 KiB 16 12.6%
f65c685f-048c-41de-85e4-308c4b84d047 SSW02
UN 10.1.146.141 237.21 KiB 16 9.8%
847ad921-fceb-4cef-acec-1c918d2a6517 SSW02
UN 10.1.146.131 311.05 KiB 16 11.7%
7263f6c6-c4d6-438e-8ee7-d07666242ba0 SSW02
UN 10.1.146.139 283.33 KiB 16 11.5%
264cbe47-acb4-49cc-97d0-6f9e2cee6844 SSW02
UN 10.1.146.140 258.46 KiB 16 11.6%
43dbbe91-5dac-4c3a-9df5-2f5ccf268eb6 SSW02
UN 10.1.146.132 157.03 KiB 16 12.3%
1c0cb23c-af78-4fa2-bd92-20fa7d39ec30 SSW02
UN 10.1.146.135 301.13 KiB 16 11.2%
26159fbe-cf78-4c94-88e0-54773bcf7bed SSW02
UN 10.1.146.130 305.16 KiB 16 12.5%
d6d6c490-551d-4a97-a93c-3b772b750d7d SSW02
So I restarted the service on one of the missing addresses. It appeared in the
list but one other dropped off. I tried this several times. It seems I can
only get 9 and 8 not 12 and 10.
Anyone have an idea why this may be so?
Thanks
Marc