After sleeping on this, I'm sure my original conclusions are wrong.  In all
of the referenced cases/threads, I internalized "rack awareness" and
"hotspots" to mean something different and wrong.  A hotspot didn't mean
multiple replicas in the same rack (as I had been thinking), it meant the
process of finding replica placement might hit the same vnode
proportionally wrong due to the random association of vnodes <-> {dc,rack}.

To not people astray, I think everything in my email below is correct
until: "Which means a rack failure (3 nodes) has a non-zero chance of data
failure (right?)."  And again, my flaw was thinking that when Cassandra
selected replicas for token "X" in a vnode world, that it would possibly
pick vnodes that happened to be on the same rack due to random placements
of the tokens.  That is wrong (looking at the source for NTS), as NTS does
skip over the same rack (though, it will allow multiple in the same rack if
you "fill up"... I guess if someone did DC:4 with 3 racks they'll always
get one rack with two copies of the data, for example).

will

On Tue, May 13, 2014 at 1:41 PM, William Oberman
<ober...@civicscience.com>wrote:

> I found this:
>
> http://mail-archives.apache.org/mod_mbox/cassandra-user/201404.mbox/%3ccaeduwd1erq-1m-kfj6ubzsbeser8dwh+g-kgdpstnbgqsqc...@mail.gmail.com%3E
>
> I read the three referenced cases.  In addition, case 4123 references:
> http://www.mail-archive.com/dev@cassandra.apache.org/msg03844.html
>
> And even though I *think* I understand all of the issues now, I still want
> to double check...
>
> Assumptions:
> -A cluster using NTS with options [DC:3]
> -Physical layout = In DC, 3 nodes/rack for a total of 9 nodes
>
> No vnodes: I could do token selection using ideas from case 3810 such that
> each rack has one replica.  At this point, my "0% chance of data loss"
> scenarios are:
> 1.) Failure of two nodes at random
> 2.) Failure of 2 racks (6 nodes!)
>
> Vnodes: my "0% chance of data loss" scenarios are:
> 1.) Failure of two nodes at random
> Which means a rack failure (3 nodes) has a non-zero chance of data failure
> (right?).
>
> To get specific, I'm in AWS, so racks ~= "availability zones".  In the
> years I've been in AWS, I've seen several occasions of "single zone
> downtimes", and one time of "single zone catastrophic loss".  E.g. for AWS
> I feel like you *have* to plan for a single zone failure, and in terms of
> "safety first" you *should* plan for two zone failures.
>
> To mitigate this data loss risk seems rough for vnodes, again if I'm
> understanding everything correctly:
> -To ensure 0% data loss for one zone => I need RF=4
> -To ensure 0% data loss for two zones => I need RF=7
>
> I'd really like to use vnodes, but RF=7 is crazy.
>
> To reiterate what I think is the core idea of this message:
> 1.) for vnodes 0% data loss => RF=(# of allowed failures at once)+1
> 2.) racks don't change the above equation at all
>
> will
>

Reply via email to