> TLDR, based on availability concerns, skew concerns, operational 
> concerns, and based on the fact that the new allocation algorithm can 
> be configured fairly simply now, this is a proposal to go with 4 as the 
> new default and the allocate_tokens_for_local_replication_factor set to 
> 3.  


I'm uncomfortable going with the default of `num_tokens: 4`.
I would rather see a default of `num_tokens: 16` based on the following…

a) 4 num_tokens does not provide a good out-of-the-box experience.
b) 4 num_tokens doesn't provide any significant streaming benefits over 16.
c)  edge-case availability doesn't trump (a) & (b)


For (a)…
 The first node in each rack, up to RF racks, in each datacenter can't use the 
allocation strategy. With 4 num_tokens, 3 racks and RF=3, the first three nodes 
will be poorly balanced. If three poorly unbalanced nodes in a cluster is an 
issue (because the cluster is small enough) therefore 4 is the wrong default. 
From our own experience, we have had to bootstrap these nodes multiple times 
until they generate something ok. In practice 4 num_tokens (over 16) has 
provided more headache with clients than gain.

Elaborating, 256 was originally chosen because the token randomness over that 
many always averaged out. With a default of  
`allocate_tokens_for_local_replication_factor: 3` this issue is largely solved, 
but you will still have those initial nodes with randomly generated tokens. 
Ref: 
https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/dht/tokenallocator/ReplicationAwareTokenAllocator.java#L80
And to be precise: tokens are randomly generated until there is a node in each 
rack up to RF racks. So, if you have RF=3, in theory (or are a newbie) you 
could boot 100 nodes only in the first two racks, and they will all be random 
tokens regardless of the allocate_tokens_for_local_replication_factor setting.

For example, using 4 num_tokens, 3 racks and RF=3…
 - in a 6 node cluster, there's a total of 24 tokens, half of which are random,
 - in a 9 node cluster, there's a total of 36 tokens, a third of which is 
random,
 - etc

Following this logic i would not be willing to apply 4 unless you know there 
will be more than 36 nodes in each data centre, ie less than ~8% of your tokens 
are randomly generated. Many clusters don't have that size, and imho that's why 
4 is a bad default. 

A default of 16 by the same logic only needs 9 nodes in each dc to overcome 
that randomness degree.

The workaround to all this is having to manually define `initial_token: …` on 
those initial nodes. I'm really not inspired imposing that upon new users.

For (b)…
 there's been a number of improvements already around streaming that solves 
much of what would be any difference there is between 4 and 16 num_tokens. And 
4 num_tokens means bigger token ranges so could well be disadvantageous due to 
over-streaming.

For (c)…
 we are trying to optimise availability in situations we can never guarantee 
availability. I understand it's a nice operational advantage to have in a 
shit-show, but it's not a systems design that you can design and rely upon. 
There's also the question of availability vs the size of the token-range that 
becomes unavailable.



regards,
Mick


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org

Reply via email to