The load calculation always has issues so I wouldn't count on it, although
in this case it does seem to roughly line up. Are you sure your ring
calculation was accurate? It doesn't really seem to line up with the owns %
for the 33% node, and it is feasible (although unlikely) that you could
roll a node with a bunch of useless tokens and end up in this scenario.

As a sanity check what is the RF of the keyspace and are all these nodes in
a single rack?

If you want to fix it you'd be looking at decommissioning and
reprovisioning, potentially using the token allocation algorithm on the new
node to force C* to select tokens that will balance the ring. However if
it's not a huge issue you could just live with it for the moment - if you
expand the cluster in the future you can probably expect the tokens to get
a slightly better balance as you add more nodes.


raft.so - Cassandra consulting, support, and managed services

On Wed., 3 Mar. 2021, 18:27 Lapo Luchini, <l...@lapo.it> wrote:

> I had a 5 nodes cluster, then increased to 6, then to 7, then to 8, then
> back to 7. I installed 3.11.6 back when node_tokens defaulted to 256, so
> as far as I understand at the expense of long repairs it should have an
> excellent capacity to scale to new nodes, but I get this status:
>
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  Address            Load         Tokens  Owns
> UN  node1              1.08 TiB     256     46.4%
> UN  node2              1.06 TiB     256     45.8%
> UN  node3              1.02 TiB     256     45.1%
> UN  node4              1.01 TiB     256     46.6%
> UN  node5              994.92 GiB   256     44.0%
> UN  node7              1.04 TiB     256     38.1%
> UN  node8              882.03 GiB   256     33.9%
>
> (I renamed nodes and sorted them to represent the date they entered the
> cluster; notice node6 was decommissioned and later replaced by node8)
>
> This is a Prometheus+Grafana graph of the process of population of a new
> table (created when the cluster was already stable with node8):
>
> https://i.imgur.com/CLDLENU.png
>
> I don't understand why node7 (in blue) and node8 (in red) are way less
> loaded with data than the others.
> (as correctly reported both by "owns" and the graph)
> PS: the purple node at the top is the disaster recovery node in a remote
> location, and is alone instead of being a cluster, so it's right that it
> has way more load than the others.
>
> I tried summing all the token ranges from "nodetool ring" and they are
> quite balanced (as expected with 256 virtual tokens, I guess):
>
> % nodetool ring | awk '/^=/ { prev = -1 } /^[0-9]/ { ip = $1; pos = $8;
> if (prev != -1) host[ip] +=  pos - prev; prev= pos; } END { tot = 0; for
> (ip in host) if (ip != "nodeDR") tot += host[ip]; for (ip in host) print
> host[ip] / tot, ip; }'
> 0.992797 nodeDR
> 0.146039 node1
> 0.148853 node2
> 0.139175 node3
> 0.135932 node4
> 0.140542 node5
> 0.143875 node7
> 0.145583 node8
> (yes I know it has a slight bias because it doesn't manage correctly the
> first line, but that's less than 0.8%)
>
> It's true that node8 being newer probably has less "extra data", but
> after adding it and after waiting for Reaper to repair all tables, I did
> "nodetool cleanup" on all other nodes, so that shouldn't be it.
>
> Oh, the tables that account for 99.9% of the used space (included the
> one in the graph above) have millions of records and have a timeuuid
> inside the partition key, so they should distribute perfectly well among
> all tokens.
>
> Is there any other reason for the load unbalance I didn't think of?
> Is there a way to force things back to normal?
>
> --
> Lapo Luchini
> l...@lapo.it
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>
>

Reply via email to