Modifying NTS in place would not be possible if it changes rack placement in a way that breaks existing clusters on upgrade. A strategy introducing a change to placement like this would need a new name. A new strategy would be fine in trunk.
Logging a warning seems appropriate if RF > rack count. A discuss thread seems fine for this rather than a CEP to me. — Scott > On Mar 6, 2023, at 2:51 AM, Miklosovic, Stefan <stefan.mikloso...@netapp.com> > wrote: > > Hi all, > > some time ago we identified an issue with NetworkTopologyStrategy. The > problem is that when RF > number of racks, it may happen that NTS places > replicas in such a way that when whole rack is lost, we lose QUORUM and data > are not available anymore if QUORUM CL is used. > > To illustrate this problem, lets have this setup: > > 9 nodes in 1 DC, 3 racks, 3 nodes per rack. RF = 5. Then, NTS could place > replicas like this: 3 replicas in rack1, 1 replica in rack2, 1 replica in > rack3. Hence, when rack1 is lost, we do not have QUORUM. > > It seems to us that there is already some logic around this scenario (1) but > the implementation is not entirely correct. This solution is not computing > the replica placement correctly so the above problem would be addressed. > > We created a draft here (2, 3) which fixes it. > > There is also a test which simulates this scenario. When I assign 256 tokens > to each node randomly (by same mean as generatetokens command uses) and I try > to compute natural replicas for 1 billion random tokens and I compute how > many cases there will be when 3 replicas out of 5 are inserted in the same > rack (so by losing it we would lose quorum), for above setup I get around 6%. > > For 12 nodes, 3 racks, 4 nodes per rack, rf = 5, this happens in 10% cases. > > To interpret this number, it basically means that with such topology, RF and > CL, when a random rack fails completely, when doing a random read, there is > 6% chance that data will not be available (or 10%, respectively). > > One caveat here is that NTS is not compatible with this new strategy anymore > because it will place replicas differently. So I guess that fixing this in > NTS will not be possible because of upgrades. I think people would need to > setup completely new keyspace and somehow migrate data if they wish or they > just start from scratch with this strategy. > > Questions: > > 1) do you think this is meaningful to fix and it might end up in trunk? > > 2) should not we just ban this scenario entirely? It might be possible to > check the configuration upon keyspace creation (rf > num of racks) and if we > see this is problematic we would just fail that query? Guardrail maybe? > > 3) people in the ticket mention writing "CEP" for this but I do not see any > reason to do so. It is just a strategy as any other. What would that CEP > would even be about? Is this necessary? > > Regards > > (1) > https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/locator/NetworkTopologyStrategy.java#L126-L128 > (2) https://github.com/apache/cassandra/pull/2191 > (3) https://issues.apache.org/jira/browse/CASSANDRA-16203