So my understanding of how cassandra saves data is incorrect.

I was/am under the impression that a node owns a particular token
range, and does not save any data that falls outside of that range
(with exception to any data that might be replicated to it). Based on
what you are saying, each node owns a token range, but also maintains
copies of data outside of the range. If this is correct, then I can
understand how all of my previous questions seemed "wrong." Cassandra
already does what I want, provided that I use the correct RF and CL
values.

Thank you Peter.

On Fri, Jul 15, 2011 at 9:53 AM, Peter Schuller
<peter.schul...@infidyne.com> wrote:
>> The goal is to configure a cluster in which reads and writes can
>> complete successfully even if only 1 node is online. For this to work,
>
> Why? You should be designing for "only 1 out of N nodes" where N is
> RF. If you happen to have 3 machines now and you want 3 copies in
> total that's fine. But why would you want RF=10 just because you add
> 10 nodes? It seems really off. You add nodes to increase total
> capacity. The number of copies of each piece of data, for redundancy
> purposes, is usually a completely separate concern from your cluster
> size.
>
>> each node would need the entire dataset. Your example of a 3 node ring
>> with RF=3 would satisfy this requirement. However, if two nodes are
>> offline, CL.QUORUM would not work, I would need to use CL.ONE. If all
>> 3 nodes are online, CL.ONE is undershooting, I would want to use
>> CL.QUORUM (or maybe CL.ALL). Or does CL.ONE actually function this
>> way, somewhat?
>
> Writes ALWAYS go to all machines eventually; usually immediately if
> all nodes are up. Consistency Level ONLY affects what is *required* to
> be successfully *acked* before a write (or a read) returns back to the
> client. Using CL.ONE never means that data won't be replicated to all
> nodes eventually.
>
> The reason to use QUORUM is to get strong consistency in the sense
> that a write followed by a subsequent read is guaranteed to see that
> write.
>
> Another reason is to guarantee that no write is lost if a single node
> suddenly evaporates/explodes/kernel panics.
>
> If you don't have strong consistency/durability demands, probably just
> use CL.ONE. Data will still be replicated at whatever replication
> factor (RF) you have chosen.
>
>> A complication occurs when you want to add another node. Now there's a
>> 4 node ring, but only 3 replicas, so each node isn't guaranteed to
>> have all of the data, so the cluster can't completely function when
>> N-1 nodes are offline. So this is why I would like to have the RF
>> scale relative to the size of the cluster. Am I mistaken?
>
> It seems like you have mistaken requirements. Even if for some strange
> reason you really want to tie RF to number of nodes, you can just add
> a node FIRST, and *then* increase RF. But be advised that increasing
> RF implies cluster downtime (if you want to get correct data on reads)
> because you have to run a rotating 'nodetool repair' after changing
> the replication level.
>
> But I repeat: You almost certainly don't want to be changing RF all
> the time. You most likely just want to settle on one particular level
> of redundancy, which is going to be the RF, and also implies the
> *minimum* number of nodes in the cluster. Then you add more nodes for
> *capacity* reasons, but not for redundancy reasons, and there's no
> reason to increase RF.
>
> If you *really* know what you're doing and why you want RF to track
> total node count, I'm sure there are *some* cases where this makes
> sense. But nothing you've said so far really indicates you're in such
> a position.
>
> --
> / Peter Schuller (@scode on twitter)
>

Reply via email to