There is now a parent ticket for this issue in JIRA:
https://issues.apache.org/jira/browse/CASSANDRA-4119
Comments and contributions are still welcome!
Cheers,
Sam
On 16 March 2012 23:38, Sam Overton wrote:
> Hello cassandra-dev,
>
> This is a long email. It concerns a significant change to Ca
On Sat, Mar 24, 2012 at 7:55 AM, Peter Schuller wrote:
> > No I don't think you did, in fact, depending on the size of your SSTable
> a
> > contiguous range (or the entire SSTable) may or may not be affected by a
> > cleanup/move or any type of topology change. There is lots of room for
> > optim
> No I don't think you did, in fact, depending on the size of your SSTable a
> contiguous range (or the entire SSTable) may or may not be affected by a
> cleanup/move or any type of topology change. There is lots of room for
> optimization here. After loading the indexes we actually know start/end
> The SSTable indices should still be scanned for size tiered compaction.
> Do I miss anything here?
>
>
No I don't think you did, in fact, depending on the size of your SSTable a
contiguous range (or the entire SSTable) may or may not be affected by a
cleanup/move or any type of topology change. T
On Fri, Mar 23, 2012 at 6:54 AM, Peter Schuller wrote:
> > You would have to iterate through all sstables on the system to repair
> one
> > vnode, yes: but building the tree for just one range of the data means
> that
> > huge portions of the sstables files can be skipped. It should scale down
>
> You would have to iterate through all sstables on the system to repair one
> vnode, yes: but building the tree for just one range of the data means that
> huge portions of the sstables files can be skipped. It should scale down
> linearly as the number of vnodes increases (ie, with 100 vnodes, it
>
> Does the new scheme still require the node to re-iterate all sstables to
> build the merkle tree or streaming data for partition level
> repair and move?
You would have to iterate through all sstables on the system to repair one
vnode, yes: but building the tree for just one range of the data
On Thu, Mar 22, 2012 at 6:20 PM, Richard Low wrote:
> On 22 March 2012 05:48, Zhu Han wrote:
>
> > I second it.
> >
> > Is there some goals we missed which can not be achieved by assigning
> > multiple tokens to a single node?
>
> This is exactly the proposed solution. The discussion is about h
On 22 March 2012 05:48, Zhu Han wrote:
> I second it.
>
> Is there some goals we missed which can not be achieved by assigning
> multiple tokens to a single node?
This is exactly the proposed solution. The discussion is about how to
implement this, and the methods of choosing tokens and replica
e achieved by assigning
multiple tokens to a single node?
>
> -Jeremiah Jordan
>
>
> From: Rick Branson [rbran...@datastax.com]
> Sent: Monday, March 19, 2012 5:16 PM
> To: dev@cassandra.apache.org
> Subject: Re: RFC: Cassandra Virtual Nodes
>
>
A friend pointed out to me privately that I came across pretty harsh
in this thread. While I stand by my technical concerns, I do want to
acknowledge that Sam's proposal here indicates a strong grasp of the
principles involved, and a deeper level of thought into the issues
than I think anyone else
>>> I envision vnodes as Cassandra master being a shared cache,memtables,
and manager for what we today consider a Cassandra instance.
It might be kind of problematic when you are moving the nodes you want the
data associated with the node to move too, otherwise it will be a pain to
cleanup after
I just see vnodes as a way to make the problem smaller and by making the
problem smaller the overall system is more agile. Aka rather then 1 node
streaming 100 gb the 4 nodes stream 25gb. Moves by hand are not so bad
because the take 1/4th the time.
The most simple vnode implementation is vmware.
> Software wise it is the same deal. Each node streams off only disk 4
> to the new node.
I think an implication on software is that if you want to make
specific selections of partitions to move, you are effectively
incompatible with deterministically generating the mapping of
partition to respons
On Wed, Mar 21, 2012 at 3:24 PM, Tom Wilkie wrote:
> Hi Edward
>
>> 1) No more raid 0. If a machine is responsible for 4 vnodes they
>> should correspond to for JBOD.
>
> So each vnode corresponds to a disk? I suppose we could have a
> separate data directory per disk, but I think this should be
On Wed, Mar 21, 2012 at 8:50 AM, Eric Evans wrote:
> I must admit I find this a little disheartening. The discussion has
> barely started. No one has had a chance to discuss implementation
> specifics so that the rest of us could understand *how* disruptive it
> would be (a necessary requirement
Hi Edward
> 1) No more raid 0. If a machine is responsible for 4 vnodes they
> should correspond to for JBOD.
So each vnode corresponds to a disk? I suppose we could have a
separate data directory per disk, but I think this should be a
separate, subsequent change.
However, do note that making t
I'm going to agree with Eric on this one. Twitter has wanted some sort of vnode
support for quite sometime. We even were willing to do all the work. I have
reservations about that now We have been silent due to the community and how
this is more like an exclusive Datastax project than an Apache
On Wed, Mar 21, 2012 at 9:50 AM, Eric Evans wrote:
> On Tue, Mar 20, 2012 at 9:53 PM, Jonathan Ellis wrote:
>> It's reasonable that we can attach different levels of importance to
>> these things. Taking a step back, I have two main points:
>>
>> 1) vnodes add enormous complexity to *many* parts
On Tue, Mar 20, 2012 at 9:53 PM, Jonathan Ellis wrote:
> It's reasonable that we can attach different levels of importance to
> these things. Taking a step back, I have two main points:
>
> 1) vnodes add enormous complexity to *many* parts of Cassandra. I'm
> skeptical of the cost:benefit ratio
It's reasonable that we can attach different levels of importance to
these things. Taking a step back, I have two main points:
1) vnodes add enormous complexity to *many* parts of Cassandra. I'm
skeptical of the cost:benefit ratio here.
1a) The benefit is lower in my mind because many of the pr
> Each node would have a lower and an upper token, which would form a range
> that would be actively distributed via gossip. Read and replication
> requests would only be routed to a replica when the key of these operations
> matched the replica's token range in the gossip tables. Each node would
>
On 20 March 2012 14:55, Jonathan Ellis wrote:
> Here's how I see Sam's list:
>
> * Even load balancing when growing and shrinking the cluster
>
> Nice to have, but post-bootstrap load balancing works well in practice
> (and is improved by TRP).
Post-bootstrap load balancing without vnodes necessa
On 20 March 2012 14:50, Rick Branson wrote:
> To support a form of DF, I think some tweaking of the replica placement could
> achieve this effect quite well. We could introduce a variable into replica
> placement, which I'm going to incorrectly call DF for the purposes of
> illustration. The k
On 19 March 2012 23:41, Peter Schuller wrote:
>>> Using this ring bucket in the CRUSH topology, (with the hash function
>>> being the identity function) would give the exact same distribution
>>> properties as the virtual node strategy that I suggested previously,
>>> but of course with much bette
whole cluster, not just your neighbor. etc. etc.
-Jeremiah Jordan
From: Rick Branson [rbran...@datastax.com]
Sent: Monday, March 19, 2012 5:16 PM
To: dev@cassandra.apache.org
Subject: Re: RFC: Cassandra Virtual Nodes
I think if we could go back and r
On Tue, Mar 20, 2012 at 9:08 AM, Eric Evans wrote:
> On Tue, Mar 20, 2012 at 8:39 AM, Jonathan Ellis wrote:
>> I like this idea. It feels like a good 80/20 solution -- 80% of the
>> benefits, 20% of the effort. More like 5% of the effort. I can't
>> even enumerate all the places full vnode sup
On 20 March 2012 13:37, Eric Evans wrote:
> On Tue, Mar 20, 2012 at 6:40 AM, Sam Overton wrote:
>> On 20 March 2012 04:35, Vijay wrote:
>>> May be, what i mean is little more simple than that... We can consider
>>> every node having a multiple conservative ranges and moving those ranges
>>> for
> > I like this idea. It feels like a good 80/20 solution -- 80% of the
> > benefits, 20% of the effort. More like 5% of the effort. I can't
> > even enumerate all the places full vnode support would change, but an
> > "active token range" concept would be relatively limited in scope.
>
>
> It on
On Tue, Mar 20, 2012 at 8:39 AM, Jonathan Ellis wrote:
> I like this idea. It feels like a good 80/20 solution -- 80% of the
> benefits, 20% of the effort. More like 5% of the effort. I can't
> even enumerate all the places full vnode support would change, but an
> "active token range" concept
I like this idea. It feels like a good 80/20 solution -- 80% of the
benefits, 20% of the effort. More like 5% of the effort. I can't
even enumerate all the places full vnode support would change, but an
"active token range" concept would be relatively limited in scope.
Full vnodes feels a lot m
On Tue, Mar 20, 2012 at 6:40 AM, Sam Overton wrote:
> On 20 March 2012 04:35, Vijay wrote:
>> On Mon, Mar 19, 2012 at 8:24 PM, Eric Evans wrote:
>>
>>> I'm guessing you're referring to Rick's proposal about ranges per node?
>>>
>>
>> May be, what i mean is little more simple than that... We can
On 20 March 2012 04:35, Vijay wrote:
> On Mon, Mar 19, 2012 at 8:24 PM, Eric Evans wrote:
>
>> I'm guessing you're referring to Rick's proposal about ranges per node?
>>
>
> May be, what i mean is little more simple than that... We can consider
> every node having a multiple conservative ranges a
On Mon, Mar 19, 2012 at 8:24 PM, Eric Evans wrote:
> I'm guessing you're referring to Rick's proposal about ranges per node?
>
May be, what i mean is little more simple than that... We can consider
every node having a multiple conservative ranges and moving those ranges
for bootstrap etc, instea
On Mon, Mar 19, 2012 at 9:37 PM, Vijay wrote:
> I also did create a ticket
> https://issues.apache.org/jira/browse/CASSANDRA-3768 with some of the
> reason why I would like to see vnodes in cassandra.
> It can also potentially reduce the SSTable seeks which a node has to do to
> query data in Size
On Mon, Mar 19, 2012 at 4:45 PM, Peter Schuller
wrote:
> > As a side note: vnodes fail to provide solutions to node-based limitations
> > that seem to me to cause a substantial portion of operational issues such
> > as impact of node restarts / upgrades, GC and compaction induced latency. I
>
> Ac
I also did create a ticket
https://issues.apache.org/jira/browse/CASSANDRA-3768 with some of the
reason why I would like to see vnodes in cassandra.
It can also potentially reduce the SSTable seeks which a node has to do to
query data in SizeTireCompaction if extended to the filesystem.
But 110% a
(I may comment on other things more later)
> As a side note: vnodes fail to provide solutions to node-based limitations
> that seem to me to cause a substantial portion of operational issues such
> as impact of node restarts / upgrades, GC and compaction induced latency. I
Actually, it does. At l
>> Using this ring bucket in the CRUSH topology, (with the hash function
>> being the identity function) would give the exact same distribution
>> properties as the virtual node strategy that I suggested previously,
>> but of course with much better topology awareness.
>
> I will have to re-read yo
> a) a virtual node partitioning scheme (to support heterogeneity and
> management simplicity)
> b) topology aware replication
> c) topology aware routing
I would add (d) limiting the distribution factor to decrease the
probability of data loss/multiple failures within a replica set.
> First of a
I think if we could go back and rebuild Cassandra from scratch, vnodes
would likely be implemented from the beginning. However, I'm concerned that
implementing them now could be a big distraction from more productive uses
of all of our time and introduce major potential stability issues into what
i
On Mon, Mar 19, 2012 at 4:24 PM, Sam Overton wrote:
>> For OPP the problem of load balancing is more profound. Now you need
>> vnodes per keyspace because you can not expect each keyspace to have
>> the same distribution. With three keyspaces you are not unsure as to
>> which was is causing the ho
> For OPP the problem of load balancing is more profound. Now you need
> vnodes per keyspace because you can not expect each keyspace to have
> the same distribution. With three keyspaces you are not unsure as to
> which was is causing the hotness. I think OPP should just go away.
That's a good po
On Mon, Mar 19, 2012 at 4:15 PM, Sam Overton wrote:
> On 19 March 2012 09:23, Radim Kolar wrote:
>>
>>>
>>> Hi Radim,
>>>
>>> The number of virtual nodes for each host would be configurable by the
>>> user, in much the same way that initial_token is configurable now. A host
>>> taking a larger nu
On 19 March 2012 09:23, Radim Kolar wrote:
>
>>
>> Hi Radim,
>>
>> The number of virtual nodes for each host would be configurable by the
>> user, in much the same way that initial_token is configurable now. A host
>> taking a larger number of virtual nodes (tokens) would have
>> proportionately
>
Hi Peter,
It's great to hear that others have come to some of the same conclusions!
I think a CRUSH-like strategy for topologically aware
replication/routing/locality is a great idea. I think I can see three
mostly orthogonal sets of functionality that we're concerned with:
a) a virtual node par
Hi Radim,
The number of virtual nodes for each host would be configurable by the
user, in much the same way that initial_token is configurable now. A host
taking a larger number of virtual nodes (tokens) would have proportionately
more of the data. This is how we anticipate support for heterog
Point of clarification: My use of the term "bucket" is completely
unrelated to the term "bucket" used in the CRUSH paper.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
> *The summary is*: we believe virtual nodes are the way forward. We would
> like to add virtual nodes to Cassandra and we are asking for comments,
> criticism and collaboration!
I am very happy to see some momentum on this, and I would like to go
even further than what you propose. The main reaso
On Sat, Mar 17, 2012 at 3:22 PM, Zhu Han wrote:
> On Sat, Mar 17, 2012 at 7:38 AM, Sam Overton wrote:
>> This is a long email. It concerns a significant change to Cassandra, so
>> deserves a thorough introduction.
>>
>> *The summary is*: we believe virtual nodes are the way forward. We would
>> l
I agree having smaller regions would help the rebalencing situation both
with rp and bop. However i an not sure if dividing tables across disk s
will give any better performance. you will have more seeking spindles and
can possibly sub divide token ranges into separate files. But fs cache will
ge
On Sat, Mar 17, 2012 at 11:15 AM, Radim Kolar wrote:
> I don't like that every node will have same portion of data.
>
> 1. We are using nodes with different HW sizes (number of disks)
> 2. especially with ordered partitioner there tends to be hotspots and you
> must assign smaller portion of data
On Sat, Mar 17, 2012 at 7:38 AM, Sam Overton wrote:
> Hello cassandra-dev,
>
> This is a long email. It concerns a significant change to Cassandra, so
> deserves a thorough introduction.
>
> *The summary is*: we believe virtual nodes are the way forward. We would
> like to add virtual nodes to Ca
On 17 March 2012 11:15, Radim Kolar wrote:
> I don't like that every node will have same portion of data.
>
> 1. We are using nodes with different HW sizes (number of disks)
> 2. especially with ordered partitioner there tends to be hotspots and you
> must assign smaller portion of data to nodes
I don't like that every node will have same portion of data.
1. We are using nodes with different HW sizes (number of disks)
2. especially with ordered partitioner there tends to be hotspots and
you must assign smaller portion of data to nodes holding hotspots
55 matches
Mail list logo