On 14 April 2013 00:56, Rustam Aliyev <rustam.li...@code.az> wrote:

>  Just a followup on this issue. Due to the cost of shuffle, we decided
> not to do it. Recently, we added new node and ended up in not well balanced
> cluster:
>
> Datacenter: datacenter1
> =======================
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  Address           Load       Tokens  Owns   Host
> ID                               Rack
> UN  10.0.1.8          52.28 GB   260     18.3%
> d28df6a6-c888-4658-9be1-f9e286368dce  rack1
> UN  10.0.1.11         55.21 GB   256     9.4%
> 7b0cf3c8-0c42-4443-9b0c-68f794299443  rack1
> UN  10.0.1.2          49.03 GB   259     17.9%
> 2d308bc3-1fd7-4fa4-b33f-cbbbdc557b2f  rack1
> UN  10.0.1.4          48.51 GB   255     18.4%
> c253dcdf-3e93-495c-baf1-e4d2a033bce3  rack1
> UN  10.0.1.1          67.14 GB   253     17.9%
> 4f77fd70-b134-486b-9c25-cfea96b6d412  rack1
> UN  10.0.1.3          47.65 GB   253     18.0%
> 4d03690d-5363-42c1-85c2-5084596e09fc  rack1
>
> It looks like new node took from each other node equal amount of vnodes -
> which is good. However, it's not clear why it decided to have twice less
> than other nodes.
>

I think this is expected behaviour when adding a node to a cluster that has
been upgraded to vnodes without shuffling.  The old nodes have equally
spaced contiguous tokens.  The new node will choose 256 random new tokens,
which will on average bisect the old ranges.  This means each token the new
node has will only cover half the range (on average) as the old ones.

However, the thing that really matters is the load, which is surprisingly
balanced at 55 GB.  This isn't guaranteed though - it could be about half
or it could be significantly more.  The problem with not doing the shuffle
is the vnode after all the contiguous vnodes for a certain node will be the
target for the second replica of *all* the vnodes for that node.  E.g. if
node A has tokens 10, 20, 30, 40, node B has tokens 50, 60, 70, 80 and node
C (the new node) chooses token 45, it will store a replica for all data
stored in A's tokens.  This is exactly the same reason as why tokens in a
multi-DC deployment need to be interleaved rather than be contiguous.

If shuffle isn't going to work, you could instead decommission each node
then bootstrap it in again.  In principle that should copy your data twice
as much as required (shuffle is optimal in terms of data transfer) but some
implementation details might make it more efficient.

Richard.

Reply via email to