On 20 March 2012 14:50, Rick Branson <rbran...@datastax.com> wrote: > To support a form of DF, I think some tweaking of the replica placement could > achieve this effect quite well. We could introduce a variable into replica > placement, which I'm going to incorrectly call DF for the purposes of > illustration. The key range for a node would be sub-divided by DF (1 by > default) and this would be used to further distribution replica selection > based on this "sub-partition". > > Currently, the offset formula works out to be something like this: > > offset = replica > > For RandomPartitioner, DF placement might look something like: > > offset = replica + (token % DF) > > Now, I realize replica selection is actually much more complicated than this, > but these formulas are for illustration purposes. > > Modifying replica placement & the partitioners to support this seems > straightforward, but I'm unsure of what's required to get it working for ring > management operations. On the surface, it does seem like this could be added > without any kind of difficult migration support. > > Thoughts?
This solution increases the DF, which has the advantage of providing some balancing when a node is down temporarily. The reads and writes it would have served are now distributed around ~DF nodes. However, it doesn't have any distributed rebuild. In fact, any distribution mechanism with one token per node cannot have distributed rebuild. Should a node fail, the next node in the ring has twice the token range so must have twice the data. This node will limit the rebuild time - 'nodetool removetoken' will have to replicate the data of the failed node onto this node. Increasing the distribution factor without speeding up rebuild increases the failure probability - both for data loss or being unable to reach required consistency levels. The failure probability is a trade-off between rebuild time and distribution factor. Lower rebuild time helps, and lower distribution factor helps. Cassandra as it is now has the longest rebuild time and lowest possible distribution factor. The original vnodes scheme is the other extreme - shortest rebuild time and largest possible distribution factor. It turns out that the rebuild time is more important, so this decreases failure probability (with some assumptions you can show it decreases by a factor RF! - I'll spare you the math but can send it if you're interested). This scheme has the longest rebuild time and a (tuneable) distribution factor, but larger than the lowest. That necessarily increases the failure probability over both Cassandra now and vnode schemes, so I'd be very careful about choosing it. Richard.