I had a 5 nodes cluster, then increased to 6, then to 7, then to 8, then
back to 7. I installed 3.11.6 back when node_tokens defaulted to 256, so
as far as I understand at the expense of long repairs it should have an
excellent capacity to scale to new nodes, but I get this status:
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns
UN node1 1.08 TiB 256 46.4%
UN node2 1.06 TiB 256 45.8%
UN node3 1.02 TiB 256 45.1%
UN node4 1.01 TiB 256 46.6%
UN node5 994.92 GiB 256 44.0%
UN node7 1.04 TiB 256 38.1%
UN node8 882.03 GiB 256 33.9%
(I renamed nodes and sorted them to represent the date they entered the
cluster; notice node6 was decommissioned and later replaced by node8)
This is a Prometheus+Grafana graph of the process of population of a new
table (created when the cluster was already stable with node8):
https://i.imgur.com/CLDLENU.png
I don't understand why node7 (in blue) and node8 (in red) are way less
loaded with data than the others.
(as correctly reported both by "owns" and the graph)
PS: the purple node at the top is the disaster recovery node in a remote
location, and is alone instead of being a cluster, so it's right that it
has way more load than the others.
I tried summing all the token ranges from "nodetool ring" and they are
quite balanced (as expected with 256 virtual tokens, I guess):
% nodetool ring | awk '/^=/ { prev = -1 } /^[0-9]/ { ip = $1; pos = $8;
if (prev != -1) host[ip] += pos - prev; prev= pos; } END { tot = 0; for
(ip in host) if (ip != "nodeDR") tot += host[ip]; for (ip in host) print
host[ip] / tot, ip; }'
0.992797 nodeDR
0.146039 node1
0.148853 node2
0.139175 node3
0.135932 node4
0.140542 node5
0.143875 node7
0.145583 node8
(yes I know it has a slight bias because it doesn't manage correctly the
first line, but that's less than 0.8%)
It's true that node8 being newer probably has less "extra data", but
after adding it and after waiting for Reaper to repair all tables, I did
"nodetool cleanup" on all other nodes, so that shouldn't be it.
Oh, the tables that account for 99.9% of the used space (included the
one in the graph above) have millions of records and have a timeuuid
inside the partition key, so they should distribute perfectly well among
all tokens.
Is there any other reason for the load unbalance I didn't think of?
Is there a way to force things back to normal?
--
Lapo Luchini
l...@lapo.it
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org