Re: Degradation of availability when using NTS and RF > number of racks

Jeremiah D Jordan Tue, 07 Mar 2023 10:32:09 -0800

Right, why I said we should make NTS do the right thing, rather than throwing a 
warning.  Doing the right thing, and not getting a warning, is the best 
behavior.


> On Mar 7, 2023, at 11:12 AM, Derek Chen-Becker <de...@chen-becker.org> wrote:
> 
> I think that the warning would only be thrown in the case where a potentially 
> QUORUM-busting configuration is used. I think it would be a worse experience 
> to not warn and let the user discover later when they can't write at QUORUM.
> 
> Cheers,
> 
> Derek
> 
> On Tue, Mar 7, 2023 at 9:32 AM Jeremiah D Jordan <jeremiah.jor...@gmail.com 
> <mailto:jeremiah.jor...@gmail.com>> wrote:
>> I agree with Paulo, it would be nice if we could figure out some way to make 
>> new NTS work correctly, with a parameter to fall back to the “bad” behavior, 
>> so that people restoring backups to a new cluster can get the right behavior 
>> to match their backups.
>> The problem with only fixing this in a new strategy is we have a ton of 
>> tutorials and docs out there which tell people to use NTS, so it would be 
>> great if we could keep “use NTS” as the recommendation.  Throwing a warning 
>> when someone uses NTS is kind of user hostile.  If someone just read some 
>> tutorial or doc which told them “make your key space this way” and then when 
>> they do that the database yells at them telling them they did it wrong, it 
>> is not a great experience.
>> 
>> -Jeremiah
>> 
>> > On Mar 7, 2023, at 10:16 AM, Benedict <bened...@apache.org 
>> > <mailto:bened...@apache.org>> wrote:
>> > 
>> > My view is that if this is a pretty serious bug. I wonder if transactional 
>> > metadata will make it possible to safely fix this for users without 
>> > rebuilding (only via opt-in, of course).
>> > 
>> >> On 7 Mar 2023, at 15:54, Miklosovic, Stefan <stefan.mikloso...@netapp.com 
>> >> <mailto:stefan.mikloso...@netapp.com>> wrote:
>> >> 
>> >> Thanks everybody for the feedback.
>> >> 
>> >> I think that emitting a warning upon keyspace creation (and alteration) 
>> >> should be enough for starters. If somebody can not live without 100% 
>> >> bullet proof solution over time we might choose some approach from the 
>> >> offered ones. As the saying goes there is no silver bullet. If we decide 
>> >> to implement that new strategy, we would probably emit warnings anyway on 
>> >> NTS but it would be already done so just new strategy would be provided.
>> >> 
>> >> ________________________________________
>> >> From: Paulo Motta <pauloricard...@gmail.com 
>> >> <mailto:pauloricard...@gmail.com>>
>> >> Sent: Monday, March 6, 2023 17:48
>> >> To: dev@cassandra.apache.org <mailto:dev@cassandra.apache.org>
>> >> Subject: Re: Degradation of availability when using NTS and RF > number 
>> >> of racks
>> >> 
>> >> NetApp Security WARNING: This is an external email. Do not click links or 
>> >> open attachments unless you recognize the sender and know the content is 
>> >> safe.
>> >> 
>> >> 
>> >> 
>> >> It's a bit unfortunate that NTS does not maintain the ability to lose a 
>> >> rack without loss of quorum for RF > #racks > 2, since this can be easily 
>> >> achieved by evenly placing replicas across all racks.
>> >> 
>> >> Since RackAwareTopologyStrategy is a superset of NetworkTopologyStrategy, 
>> >> can't we just use the new correct placement logic for newly created 
>> >> keyspaces instead of having a new strategy?
>> >> 
>> >> The placement logic would be backwards-compatible for RF <= #racks. On 
>> >> upgrade, we could mark existing keyspaces with RF > #racks with 
>> >> use_legacy_replica_placement=true to maintain backwards compatibility and 
>> >> log a warning that the rack loss guarantee is not maintained for 
>> >> keyspaces created before the fix. Old keyspaces with RF <=#racks would 
>> >> still work with the new replica placement. The downside is that we would 
>> >> need to keep the old NTS logic around, or we could eventually deprecate 
>> >> it and require users to migrate keyspaces using the legacy placement 
>> >> strategy.
>> >> 
>> >> Alternatively we could have RackAwareTopologyStrategy and fail NTS 
>> >> keyspace creation for RF > #racks and indicate users to use 
>> >> RackAwareTopologyStrategy to maintain the quorum guarantee on rack loss 
>> >> or set an override flag "support_quorum_on_rack_loss=false". This feels a 
>> >> bit iffy though since it could potentially confuse users about when to 
>> >> use each strategy.
>> >> 
>> >> On Mon, Mar 6, 2023 at 5:51 AM Miklosovic, Stefan 
>> >> <stefan.mikloso...@netapp.com 
>> >> <mailto:stefan.mikloso...@netapp.com><mailto:stefan.mikloso...@netapp.com 
>> >> <mailto:stefan.mikloso...@netapp.com>>> wrote:
>> >> Hi all,
>> >> 
>> >> some time ago we identified an issue with NetworkTopologyStrategy. The 
>> >> problem is that when RF > number of racks, it may happen that NTS places 
>> >> replicas in such a way that when whole rack is lost, we lose QUORUM and 
>> >> data are not available anymore if QUORUM CL is used.
>> >> 
>> >> To illustrate this problem, lets have this setup:
>> >> 
>> >> 9 nodes in 1 DC, 3 racks, 3 nodes per rack. RF = 5. Then, NTS could place 
>> >> replicas like this: 3 replicas in rack1, 1 replica in rack2, 1 replica in 
>> >> rack3. Hence, when rack1 is lost, we do not have QUORUM.
>> >> 
>> >> It seems to us that there is already some logic around this scenario (1) 
>> >> but the implementation is not entirely correct. This solution is not 
>> >> computing the replica placement correctly so the above problem would be 
>> >> addressed.
>> >> 
>> >> We created a draft here (2, 3) which fixes it.
>> >> 
>> >> There is also a test which simulates this scenario. When I assign 256 
>> >> tokens to each node randomly (by same mean as generatetokens command 
>> >> uses) and I try to compute natural replicas for 1 billion random tokens 
>> >> and I compute how many cases there will be when 3 replicas out of 5 are 
>> >> inserted in the same rack (so by losing it we would lose quorum), for 
>> >> above setup I get around 6%.
>> >> 
>> >> For 12 nodes, 3 racks, 4 nodes per rack, rf = 5, this happens in 10% 
>> >> cases.
>> >> 
>> >> To interpret this number, it basically means that with such topology, RF 
>> >> and CL, when a random rack fails completely, when doing a random read, 
>> >> there is 6% chance that data will not be available (or 10%, respectively).
>> >> 
>> >> One caveat here is that NTS is not compatible with this new strategy 
>> >> anymore because it will place replicas differently. So I guess that 
>> >> fixing this in NTS will not be possible because of upgrades. I think 
>> >> people would need to setup completely new keyspace and somehow migrate 
>> >> data if they wish or they just start from scratch with this strategy.
>> >> 
>> >> Questions:
>> >> 
>> >> 1) do you think this is meaningful to fix and it might end up in trunk?
>> >> 
>> >> 2) should not we just ban this scenario entirely? It might be possible to 
>> >> check the configuration upon keyspace creation (rf > num of racks) and if 
>> >> we see this is problematic we would just fail that query? Guardrail maybe?
>> >> 
>> >> 3) people in the ticket mention writing "CEP" for this but I do not see 
>> >> any reason to do so. It is just a strategy as any other. What would that 
>> >> CEP would even be about? Is this necessary?
>> >> 
>> >> Regards
>> >> 
>> >> (1) 
>> >> https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/locator/NetworkTopologyStrategy.java#L126-L128
>> >> (2) https://github.com/apache/cassandra/pull/2191
>> >> (3) https://issues.apache.org/jira/browse/CASSANDRA-16203
>> > 
>> 
> 
> 
> -- 
> +---------------------------------------------------------------+
> | Derek Chen-Becker                                             |
> | GPG Key available at https://keybase.io/dchenbecker 
> <https://urldefense.com/v3/__https://keybase.io/dchenbecker__;!!PbtH5S7Ebw!ZHcTzN1au7p0BSEK3WkAR3W3Qrwu4vmO_cXqNzAdhLL3xl5SKig0_e7MUX1aCAmpvA24C47vIZqz-F9jniA$>
>  and       |
> | https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org 
> <https://urldefense.com/v3/__https://pgp.mit.edu/pks/lookup?search=derek*40chen-becker.org__;JQ!!PbtH5S7Ebw!ZHcTzN1au7p0BSEK3WkAR3W3Qrwu4vmO_cXqNzAdhLL3xl5SKig0_e7MUX1aCAmpvA24C47vIZqzT3hnFAg$>
>  |
> | Fngrprnt: EB8A 6480 F0A3 C8EB C1E7  7F42 AFC5 AFEE 96E4 6ACC  |
> +---------------------------------------------------------------+
>

Re: Degradation of availability when using NTS and RF > number of racks

Reply via email to