I've posted a bunch of things relevant to commitlog --> sstable and associated compaction / sstable metadata changes on here. I really need to learn that section of the code.
On Tue, Apr 17, 2018 at 10:29 AM, Jeff Jirsa <jji...@gmail.com> wrote: > There are two huge advantages > > 1) during expansion / replacement / decom, you stream from far more > ranges. Since streaming is single threaded per stream, this enables you to > max out machines during streaming where single token doesn’t > > 2) when adjusting the size of a cluster, you can often grow incrementally > without rebalancing > > Streaming entire wholly covered/contained/owned sstables during range > movements is probably a huge benefit in many use cases that may make the > single threaded streaming implementation less of a concern, and likely > works reasonably well without major changes to LCS in particular - I’m > fairly confident there’s a JIRA for this, if not it’s been discussed in > person among various operators for years as an obvious future improvement. > > -- > Jeff Jirsa > > > > On Apr 17, 2018, at 8:17 AM, Carl Mueller <carl.muel...@smartthings.com> > wrote: > > > > Do Vnodes address anything besides alleviating cluster planners from > doing > > token range management on nodes manually? Do we have a centralized list > of > > advantages they provide beyond that? > > > > There seem to be lots of downsides. 2i index performance, the above > > availability, etc. > > > > I also wonder if in vnodes (and manually managed tokens... I'll return to > > this) the node recovery scenarios are being hampered by sstables having > the > > hash ranges of the vnodes intermingled in the same set of sstables. I > > wondered in another thread in vnodes why sstables are separated into sets > > by the vnode ranges they represent. For a manually managed contiguous > token > > range, you could separate the sstables into a fixed number of sets, kind > of > > vnode-light. > > > > So if there was rebalancing or reconstruction, you could sneakernet or > > reliably send entire sstable sets that would belong in a range. > > > > I also thing this would improve compactions and repairs too. Compactions > > would be naturally parallelizable in all compaction schemes, and repairs > > would have natural subsets to do merkle tree calculations. > > > > Granted sending sstables might result in "overstreaming" due to data > > replication across the sstables, but you wouldn't have CPU and random I/O > > to look up the data. Just sequential transfers. > > > > For manually managed tokens with subdivided sstables, if there was > > rebalancing, you would have the "fringe" edges of the hash range > subdivided > > already, and you would only need to deal with the data in the border > areas > > of the token range, and again could sneakernet / dumb transfer the tables > > and then let the new node remove the unneeded in future repairs. > > (Compaction does not remove data that is not longer managed by a node, > only > > repair does? Or does only nodetool clean do that?) > > > > Pre-subdivided sstables for manually maanged tokens would REALLY pay big > > dividends in large-scale cluster expansion. Say you wanted to double or > > triple the cluster. Since the sstables are already split by some numeric > > factor that has lots of even divisors (60 for RF 2,3,4,5), you simply > bulk > > copy the already-subdivided sstables for the new nodes' hash ranges and > > you'd basically be done. In AWS EBS volumes, that could just be a drive > > detach / drive attach. > > > > > > > > > >> On Tue, Apr 17, 2018 at 7:37 AM, kurt greaves <k...@instaclustr.com> > wrote: > >> > >> Great write up. Glad someone finally did the math for us. I don't think > >> this will come as a surprise for many of the developers. Availability is > >> only one issue raised by vnodes. Load distribution and performance are > also > >> pretty big concerns. > >> > >> I'm always a proponent for fixing vnodes, and removing them as a default > >> until we do. Happy to help on this and we have ideas in mind that at > some > >> point I'll create tickets for... > >> > >>> On Tue., 17 Apr. 2018, 06:16 Joseph Lynch, <joe.e.ly...@gmail.com> > wrote: > >>> > >>> If the blob link on github doesn't work for the pdf (looks like mobile > >>> might not like it), try: > >>> > >>> > >>> https://github.com/jolynch/python_performance_toolkit/ > >> raw/master/notebooks/cassandra_availability/whitepaper/cassandra- > >> availability-virtual.pdf > >>> > >>> -Joey > >>> < > >>> https://github.com/jolynch/python_performance_toolkit/ > >> raw/master/notebooks/cassandra_availability/whitepaper/cassandra- > >> availability-virtual.pdf > >>>> > >>> > >>> On Mon, Apr 16, 2018 at 1:14 PM, Joseph Lynch <joe.e.ly...@gmail.com> > >>> wrote: > >>> > >>>> Josh Snyder and I have been working on evaluating virtual nodes for > >> large > >>>> scale deployments and while it seems like there is a lot of anecdotal > >>>> support for reducing the vnode count [1], we couldn't find any > concrete > >>>> math on the topic, so we had some fun and took a whack at quantifying > >> how > >>>> different choices of num_tokens impact a Cassandra cluster. > >>>> > >>>> According to the model we developed [2] it seems that at small cluster > >>>> sizes there isn't much of a negative impact on availability, but when > >>>> clusters scale up to hundreds of hosts, vnodes have a major impact on > >>>> availability. In particular, the probability of outage during short > >>>> failures (e.g. process restarts or failures) or permanent failure > (e.g. > >>>> disk or machine failure) appears to be orders of magnitude higher for > >>> large > >>>> clusters. > >>>> > >>>> The model attempts to explain why we may care about this and advances > a > >>>> few existing/new ideas for how to fix the scalability problems that > >>> vnodes > >>>> fix without the availability (and consistency—due to the effects on > >>> repair) > >>>> problems high num_tokens create. We would of course be very interested > >> in > >>>> any feedback. The model source code is on github [3], PRs are welcome > >> or > >>>> feel free to play around with the jupyter notebook to match your > >>>> environment and see what the graphs look like. I didn't attach the pdf > >>> here > >>>> because it's too large apparently (lots of pretty graphs). > >>>> > >>>> I know that users can always just pick whichever number they prefer, > >> but > >>> I > >>>> think the current default was chosen when token placement was random, > >>> and I > >>>> wonder whether it's still the right default. > >>>> > >>>> Thank you, > >>>> -Joey Lynch > >>>> > >>>> [1] https://issues.apache.org/jira/browse/CASSANDRA-13701 > >>>> [2] https://github.com/jolynch/python_performance_toolkit/ > >>>> raw/master/notebooks/cassandra_availability/whitepaper/cassandra- > >>>> availability-virtual.pdf > >>>> > >>>> < > >>> https://github.com/jolynch/python_performance_toolkit/ > >> blob/master/notebooks/cassandra_availability/whitepaper/cassandra- > >> availability-virtual.pdf > >>>> > >>>> [3] https://github.com/jolynch/python_performance_toolkit/tree/m > >>>> aster/notebooks/cassandra_availability > >>>> > >>> > >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org > For additional commands, e-mail: dev-h...@cassandra.apache.org > >