On 14 April 2013 00:56, Rustam Aliyev <rustam.li...@code.az> wrote: > Just a followup on this issue. Due to the cost of shuffle, we decided > not to do it. Recently, we added new node and ended up in not well balanced > cluster: > > Datacenter: datacenter1 > ======================= > Status=Up/Down > |/ State=Normal/Leaving/Joining/Moving > -- Address Load Tokens Owns Host > ID Rack > UN 10.0.1.8 52.28 GB 260 18.3% > d28df6a6-c888-4658-9be1-f9e286368dce rack1 > UN 10.0.1.11 55.21 GB 256 9.4% > 7b0cf3c8-0c42-4443-9b0c-68f794299443 rack1 > UN 10.0.1.2 49.03 GB 259 17.9% > 2d308bc3-1fd7-4fa4-b33f-cbbbdc557b2f rack1 > UN 10.0.1.4 48.51 GB 255 18.4% > c253dcdf-3e93-495c-baf1-e4d2a033bce3 rack1 > UN 10.0.1.1 67.14 GB 253 17.9% > 4f77fd70-b134-486b-9c25-cfea96b6d412 rack1 > UN 10.0.1.3 47.65 GB 253 18.0% > 4d03690d-5363-42c1-85c2-5084596e09fc rack1 > > It looks like new node took from each other node equal amount of vnodes - > which is good. However, it's not clear why it decided to have twice less > than other nodes. >
I think this is expected behaviour when adding a node to a cluster that has been upgraded to vnodes without shuffling. The old nodes have equally spaced contiguous tokens. The new node will choose 256 random new tokens, which will on average bisect the old ranges. This means each token the new node has will only cover half the range (on average) as the old ones. However, the thing that really matters is the load, which is surprisingly balanced at 55 GB. This isn't guaranteed though - it could be about half or it could be significantly more. The problem with not doing the shuffle is the vnode after all the contiguous vnodes for a certain node will be the target for the second replica of *all* the vnodes for that node. E.g. if node A has tokens 10, 20, 30, 40, node B has tokens 50, 60, 70, 80 and node C (the new node) chooses token 45, it will store a replica for all data stored in A's tokens. This is exactly the same reason as why tokens in a multi-DC deployment need to be interleaved rather than be contiguous. If shuffle isn't going to work, you could instead decommission each node then bootstrap it in again. In principle that should copy your data twice as much as required (shuffle is optimal in terms of data transfer) but some implementation details might make it more efficient. Richard.