While I (mostly) understand the maths behind using 4 vnodes as a default
(which really is a question of extreme availability), I don't think they
provide noticeable performance improvements over using 16, while 16 vnodes
will protect folks from imbalances. It is very hard to deal with unbalanced
clusters, and people start to deal with it once some nodes are already
close to being full. Operationally, it's far from trivial.
We're going to make some experiments at bootstrapping clusters with 4
tokens on the latest alpha to see how much balance we can expect, and how
removing one node could impact it.

If we're talking about repairs, using 4 vnodes will generate overstreaming,
which can create lots of serious performance issues. Even on clusters with
500GB of node density, we never use less than ~15 segments per node with
Reaper.
Not everyone uses Reaper, obviously, and there will be no protection
against overstreaming with such a low default for folks not using subrange
repairs.
On small clusters, even with 256 vnodes, using Cassandra 3.0/3.x and Reaper
already allows to get good repair performance because token ranges sharing
the exact same replicas will be processed in a single repair session. On
large clusters, I reckon it's good to have way less vnodes to speed up
repairs.

Cassandra 4.0 is supposed to aim at providing a rock stable release of
Cassandra, fixing past instabilities, and I think lowering to 4 tokens by
default defeats that purpose.
16 tokens is a reasonable compromise for clusters of all sizes, without
being too aggressive. Those with enough C* experience can still lower that
number for their clusters.

Cheers,

-----------------
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com


On Fri, Jan 31, 2020 at 1:41 PM Mick Semb Wever <m...@apache.org> wrote:

>
> > TLDR, based on availability concerns, skew concerns, operational
> > concerns, and based on the fact that the new allocation algorithm can
> > be configured fairly simply now, this is a proposal to go with 4 as the
> > new default and the allocate_tokens_for_local_replication_factor set to
> > 3.
>
>
> I'm uncomfortable going with the default of `num_tokens: 4`.
> I would rather see a default of `num_tokens: 16` based on the following…
>
> a) 4 num_tokens does not provide a good out-of-the-box experience.
> b) 4 num_tokens doesn't provide any significant streaming benefits over 16.
> c)  edge-case availability doesn't trump (a) & (b)
>
>
> For (a)…
>  The first node in each rack, up to RF racks, in each datacenter can't use
> the allocation strategy. With 4 num_tokens, 3 racks and RF=3, the first
> three nodes will be poorly balanced. If three poorly unbalanced nodes in a
> cluster is an issue (because the cluster is small enough) therefore 4 is
> the wrong default. From our own experience, we have had to bootstrap these
> nodes multiple times until they generate something ok. In practice 4
> num_tokens (over 16) has provided more headache with clients than gain.
>
> Elaborating, 256 was originally chosen because the token randomness over
> that many always averaged out. With a default of
> `allocate_tokens_for_local_replication_factor: 3` this issue is largely
> solved, but you will still have those initial nodes with randomly generated
> tokens. Ref:
> https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/dht/tokenallocator/ReplicationAwareTokenAllocator.java#L80
> And to be precise: tokens are randomly generated until there is a node in
> each rack up to RF racks. So, if you have RF=3, in theory (or are a newbie)
> you could boot 100 nodes only in the first two racks, and they will all be
> random tokens regardless of the
> allocate_tokens_for_local_replication_factor setting.
>
> For example, using 4 num_tokens, 3 racks and RF=3…
>  - in a 6 node cluster, there's a total of 24 tokens, half of which are
> random,
>  - in a 9 node cluster, there's a total of 36 tokens, a third of which is
> random,
>  - etc
>
> Following this logic i would not be willing to apply 4 unless you know
> there will be more than 36 nodes in each data centre, ie less than ~8% of
> your tokens are randomly generated. Many clusters don't have that size, and
> imho that's why 4 is a bad default.
>
> A default of 16 by the same logic only needs 9 nodes in each dc to
> overcome that randomness degree.
>
> The workaround to all this is having to manually define `initial_token: …`
> on those initial nodes. I'm really not inspired imposing that upon new
> users.
>
> For (b)…
>  there's been a number of improvements already around streaming that
> solves much of what would be any difference there is between 4 and 16
> num_tokens. And 4 num_tokens means bigger token ranges so could well be
> disadvantageous due to over-streaming.
>
> For (c)…
>  we are trying to optimise availability in situations we can never
> guarantee availability. I understand it's a nice operational advantage to
> have in a shit-show, but it's not a systems design that you can design and
> rely upon. There's also the question of availability vs the size of the
> token-range that becomes unavailable.
>
>
>
> regards,
> Mick
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: dev-h...@cassandra.apache.org
>
>

Reply via email to