Do Vnodes address anything besides alleviating cluster planners from doing
token range management on nodes manually? Do we have a centralized list of
advantages they provide beyond that?

There seem to be lots of downsides. 2i index performance, the above
availability, etc.

I also wonder if in vnodes (and manually managed tokens... I'll return to
this) the node recovery scenarios are being hampered by sstables having the
hash ranges of the vnodes intermingled in the same set of sstables. I
wondered in another thread in vnodes why sstables are separated into sets
by the vnode ranges they represent. For a manually managed contiguous token
range, you could separate the sstables into a fixed number of sets, kind of
vnode-light.

So if there was rebalancing or reconstruction, you could sneakernet or
reliably send entire sstable sets that would belong in a range.

I also thing this would improve compactions and repairs too. Compactions
would be naturally parallelizable in all compaction schemes, and repairs
would have natural subsets to do merkle tree calculations.

Granted sending sstables might result in "overstreaming" due to data
replication across the sstables, but you wouldn't have CPU and random I/O
to look up the data. Just sequential transfers.

For manually managed tokens with subdivided sstables, if there was
rebalancing, you would have the "fringe" edges of the hash range subdivided
already, and you would only need to deal with the data in the border areas
of the token range, and again could sneakernet / dumb transfer the tables
and then let the new node remove the unneeded in future repairs.
(Compaction does not remove data that is not longer managed by a node, only
repair does? Or does only nodetool clean do that?)

Pre-subdivided sstables for manually maanged tokens would REALLY pay big
dividends in large-scale cluster expansion. Say you wanted to double or
triple the cluster. Since the sstables are already split by some numeric
factor that has lots of even divisors (60 for RF 2,3,4,5), you simply bulk
copy the already-subdivided sstables for the new nodes' hash ranges and
you'd basically be done. In AWS EBS volumes, that could just be a drive
detach / drive attach.




On Tue, Apr 17, 2018 at 7:37 AM, kurt greaves <k...@instaclustr.com> wrote:

> Great write up. Glad someone finally did the math for us. I don't think
> this will come as a surprise for many of the developers. Availability is
> only one issue raised by vnodes. Load distribution and performance are also
> pretty big concerns.
>
> I'm always a proponent for fixing vnodes, and removing them as a default
> until we do. Happy to help on this and we have ideas in mind that at some
> point I'll create tickets for...
>
> On Tue., 17 Apr. 2018, 06:16 Joseph Lynch, <joe.e.ly...@gmail.com> wrote:
>
> > If the blob link on github doesn't work for the pdf (looks like mobile
> > might not like it), try:
> >
> >
> > https://github.com/jolynch/python_performance_toolkit/
> raw/master/notebooks/cassandra_availability/whitepaper/cassandra-
> availability-virtual.pdf
> >
> > -Joey
> > <
> > https://github.com/jolynch/python_performance_toolkit/
> raw/master/notebooks/cassandra_availability/whitepaper/cassandra-
> availability-virtual.pdf
> > >
> >
> > On Mon, Apr 16, 2018 at 1:14 PM, Joseph Lynch <joe.e.ly...@gmail.com>
> > wrote:
> >
> > > Josh Snyder and I have been working on evaluating virtual nodes for
> large
> > > scale deployments and while it seems like there is a lot of anecdotal
> > > support for reducing the vnode count [1], we couldn't find any concrete
> > > math on the topic, so we had some fun and took a whack at quantifying
> how
> > > different choices of num_tokens impact a Cassandra cluster.
> > >
> > > According to the model we developed [2] it seems that at small cluster
> > > sizes there isn't much of a negative impact on availability, but when
> > > clusters scale up to hundreds of hosts, vnodes have a major impact on
> > > availability. In particular, the probability of outage during short
> > > failures (e.g. process restarts or failures) or permanent failure (e.g.
> > > disk or machine failure) appears to be orders of magnitude higher for
> > large
> > > clusters.
> > >
> > > The model attempts to explain why we may care about this and advances a
> > > few existing/new ideas for how to fix the scalability problems that
> > vnodes
> > > fix without the availability (and consistency—due to the effects on
> > repair)
> > > problems high num_tokens create. We would of course be very interested
> in
> > > any feedback. The model source code is on github [3], PRs are welcome
> or
> > > feel free to play around with the jupyter notebook to match your
> > > environment and see what the graphs look like. I didn't attach the pdf
> > here
> > > because it's too large apparently (lots of pretty graphs).
> > >
> > > I know that users can always just pick whichever number they prefer,
> but
> > I
> > > think the current default was chosen when token placement was random,
> > and I
> > > wonder whether it's still the right default.
> > >
> > > Thank you,
> > > -Joey Lynch
> > >
> > > [1] https://issues.apache.org/jira/browse/CASSANDRA-13701
> > > [2] https://github.com/jolynch/python_performance_toolkit/
> > > raw/master/notebooks/cassandra_availability/whitepaper/cassandra-
> > > availability-virtual.pdf
> > >
> > > <
> > https://github.com/jolynch/python_performance_toolkit/
> blob/master/notebooks/cassandra_availability/whitepaper/cassandra-
> availability-virtual.pdf
> > >
> > > [3] https://github.com/jolynch/python_performance_toolkit/tree/m
> > > aster/notebooks/cassandra_availability
> > >
> >
>

Reply via email to