After sleeping on this, I'm sure my original conclusions are wrong. In all of the referenced cases/threads, I internalized "rack awareness" and "hotspots" to mean something different and wrong. A hotspot didn't mean multiple replicas in the same rack (as I had been thinking), it meant the process of finding replica placement might hit the same vnode proportionally wrong due to the random association of vnodes <-> {dc,rack}.
To not people astray, I think everything in my email below is correct until: "Which means a rack failure (3 nodes) has a non-zero chance of data failure (right?)." And again, my flaw was thinking that when Cassandra selected replicas for token "X" in a vnode world, that it would possibly pick vnodes that happened to be on the same rack due to random placements of the tokens. That is wrong (looking at the source for NTS), as NTS does skip over the same rack (though, it will allow multiple in the same rack if you "fill up"... I guess if someone did DC:4 with 3 racks they'll always get one rack with two copies of the data, for example). will On Tue, May 13, 2014 at 1:41 PM, William Oberman <ober...@civicscience.com>wrote: > I found this: > > http://mail-archives.apache.org/mod_mbox/cassandra-user/201404.mbox/%3ccaeduwd1erq-1m-kfj6ubzsbeser8dwh+g-kgdpstnbgqsqc...@mail.gmail.com%3E > > I read the three referenced cases. In addition, case 4123 references: > http://www.mail-archive.com/dev@cassandra.apache.org/msg03844.html > > And even though I *think* I understand all of the issues now, I still want > to double check... > > Assumptions: > -A cluster using NTS with options [DC:3] > -Physical layout = In DC, 3 nodes/rack for a total of 9 nodes > > No vnodes: I could do token selection using ideas from case 3810 such that > each rack has one replica. At this point, my "0% chance of data loss" > scenarios are: > 1.) Failure of two nodes at random > 2.) Failure of 2 racks (6 nodes!) > > Vnodes: my "0% chance of data loss" scenarios are: > 1.) Failure of two nodes at random > Which means a rack failure (3 nodes) has a non-zero chance of data failure > (right?). > > To get specific, I'm in AWS, so racks ~= "availability zones". In the > years I've been in AWS, I've seen several occasions of "single zone > downtimes", and one time of "single zone catastrophic loss". E.g. for AWS > I feel like you *have* to plan for a single zone failure, and in terms of > "safety first" you *should* plan for two zone failures. > > To mitigate this data loss risk seems rough for vnodes, again if I'm > understanding everything correctly: > -To ensure 0% data loss for one zone => I need RF=4 > -To ensure 0% data loss for two zones => I need RF=7 > > I'd really like to use vnodes, but RF=7 is crazy. > > To reiterate what I think is the core idea of this message: > 1.) for vnodes 0% data loss => RF=(# of allowed failures at once)+1 > 2.) racks don't change the above equation at all > > will >