The load calculation always has issues so I wouldn't count on it, although in this case it does seem to roughly line up. Are you sure your ring calculation was accurate? It doesn't really seem to line up with the owns % for the 33% node, and it is feasible (although unlikely) that you could roll a node with a bunch of useless tokens and end up in this scenario.
As a sanity check what is the RF of the keyspace and are all these nodes in a single rack? If you want to fix it you'd be looking at decommissioning and reprovisioning, potentially using the token allocation algorithm on the new node to force C* to select tokens that will balance the ring. However if it's not a huge issue you could just live with it for the moment - if you expand the cluster in the future you can probably expect the tokens to get a slightly better balance as you add more nodes.  raft.so - Cassandra consulting, support, and managed services On Wed., 3 Mar. 2021, 18:27 Lapo Luchini, <l...@lapo.it> wrote: > I had a 5 nodes cluster, then increased to 6, then to 7, then to 8, then > back to 7. I installed 3.11.6 back when node_tokens defaulted to 256, so > as far as I understand at the expense of long repairs it should have an > excellent capacity to scale to new nodes, but I get this status: > > Status=Up/Down > |/ State=Normal/Leaving/Joining/Moving > -- Address Load Tokens Owns > UN node1 1.08 TiB 256 46.4% > UN node2 1.06 TiB 256 45.8% > UN node3 1.02 TiB 256 45.1% > UN node4 1.01 TiB 256 46.6% > UN node5 994.92 GiB 256 44.0% > UN node7 1.04 TiB 256 38.1% > UN node8 882.03 GiB 256 33.9% > > (I renamed nodes and sorted them to represent the date they entered the > cluster; notice node6 was decommissioned and later replaced by node8) > > This is a Prometheus+Grafana graph of the process of population of a new > table (created when the cluster was already stable with node8): > > https://i.imgur.com/CLDLENU.png > > I don't understand why node7 (in blue) and node8 (in red) are way less > loaded with data than the others. > (as correctly reported both by "owns" and the graph) > PS: the purple node at the top is the disaster recovery node in a remote > location, and is alone instead of being a cluster, so it's right that it > has way more load than the others. > > I tried summing all the token ranges from "nodetool ring" and they are > quite balanced (as expected with 256 virtual tokens, I guess): > > % nodetool ring | awk '/^=/ { prev = -1 } /^[0-9]/ { ip = $1; pos = $8; > if (prev != -1) host[ip] += pos - prev; prev= pos; } END { tot = 0; for > (ip in host) if (ip != "nodeDR") tot += host[ip]; for (ip in host) print > host[ip] / tot, ip; }' > 0.992797 nodeDR > 0.146039 node1 > 0.148853 node2 > 0.139175 node3 > 0.135932 node4 > 0.140542 node5 > 0.143875 node7 > 0.145583 node8 > (yes I know it has a slight bias because it doesn't manage correctly the > first line, but that's less than 0.8%) > > It's true that node8 being newer probably has less "extra data", but > after adding it and after waiting for Reaper to repair all tables, I did > "nodetool cleanup" on all other nodes, so that shouldn't be it. > > Oh, the tables that account for 99.9% of the used space (included the > one in the graph above) have millions of records and have a timeuuid > inside the partition key, so they should distribute perfectly well among > all tokens. > > Is there any other reason for the load unbalance I didn't think of? > Is there a way to force things back to normal? > > -- > Lapo Luchini > l...@lapo.it > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org > For additional commands, e-mail: user-h...@cassandra.apache.org > >