Re: [DISCUSS] CASSANDRA-13704 Safer handling of out of range tokens

Josh McKenzie Fri, 13 Sep 2024 05:38:02 -0700

I think it's worth exploring where the disconnect is just a *bit* more, though 
I agree with you Benedict in that there appears to be a clear consensus so the 
thread has evolved more into talking about our principles than what to do in 
this specific scenario.


Mick - this patch doesn't fix things 100%. It can't. BUT - it does take us from 
"In all cases where this occurs you will silently lose data" to "in some cases 
where this occurs you will have a rejected write, in others you'll have 
coordinator level logging, and in the worst split-brain case you can still have 
data loss (split-brain ownership where the quorum agrees on incorrect state)".

Whether we're 100% fixing data loss, or 10%, at the end of the day my position 
is that we should fix all the data loss we can (this statement is unique to 
gossip's architectural limitations) *even if that means loss of availability in 
edge-cases where it is not strictly required*. I think where we might be 
misaligned:
> It can prevent data mislocation in some scenarios but it offers no guarantees 
> about that, and can also degrade availability unnecessarily. 
It *can't* prevent mis-location in all scenarios. That's a shortcoming of 
Gossip's architecture.

If the availability degradation is part and parcel to the mitigation, and the 
mitigation approach is the *best worst option* in an eventually consistent 
topology sync environment, *it's not unnecessary*. It's the price we pay for 
mitigating a flaw in the architecture as best we can.

A comparable: if we have 1% data loss, and the only way we can mitigate that is 
to have a 1% hit in availability instead, the former is clearly worse for our 
users than the latter. Recovery from data loss is multiple orders of magnitude 
harder than retrying a query on availability degradation.

On Fri, Sep 13, 2024, at 8:15 AM, Benedict wrote:
> 
> I think everyone has made their case, the costs and benefits are fairly well 
> understood, and there appears to be a strong quorum that favours an approach 
> that prioritises avoiding data loss. 
> 
> So, I propose we either recognise that this is the clear position of the 
> community, or move to a vote to formalise it one way or the other.
> 
> 
>> On 13 Sep 2024, at 13:02, Alex Petrov <al...@coffeenco.de> wrote:
>> 
>> I agree with folks saying that we absolutely need to reject misplaced 
>> writes. It may not preclude coordinator making a local write, or making a 
>> write to a local replica, but even reducing probability of a misplaced write 
>> shown as success to the client is a substantial win. 
>> 
>> On Fri, Sep 13, 2024, at 10:41 AM, Mick Semb Wever wrote:
>>> replies below (to Scott, Josh and Jeremiah).
>>> 
>>> tl;dr all my four points remain undisputed, when the patch is applied.   
>>> This is a messy situation, but no denying the value of rejection writes to 
>>> various known popular scenarios.  Point (1) remains important to highlight 
>>> IMHO.
>>> 
>>> 
>>> On Fri, 13 Sept 2024 at 03:03, C. Scott Andreas <sc...@paradoxica.net> 
>>> wrote:
>>>> Since that time, I’ve received several urgent messages from major users of 
>>>> Apache Cassandra and even customers of Cassandra ecosystem vendors asking 
>>>> about this bug. Some were able to verify the presence of lost data in 
>>>> SSTables on nodes where it didn’t belong, demonstrate empty read responses 
>>>> for data that is known proof-positive to exist (think content-addressable 
>>>> stores), or reproduce this behavior in a local cluster after forcing 
>>>> disagreement.
>>>> 
>>> 
>>> 
>>> 
>>> Having been privy to the background of those "urgent messages" I can say 
>>> the information you received wasn't correct (or complete). 
>>> 
>>> My challenge on this thread is about understanding where this might 
>>> unexpectedly bite users, which should be part of our due diligence when 
>>> applying such patches to stable branches.   I ask you to run through my 
>>> four points, which AFAIK still stand true.  
>>> 
>>> 
>>>> But I **don't** think it's healthy for us to repeatedly re-litigate 
>>>> whether data loss is acceptable based on how long it's been around, or how 
>>>> frequently some of us on the project have observed some given phenomenon. 
>>> 
>>> 
>>> Josh, that's true, but talking to these things helps open up the 
>>> discussion, see my point above wrt being aware of second-hand evidence that 
>>> was inaccurate.
>>> 
>>> 
>>>> The severity and frequency of this issue combined with the business risk 
>>>> to Apache Cassandra users changed my mind about fixing it in earlier 
>>>> branches despite TCM having been merged to fix it for good on trunk.
>>>> 
>>> 
>>> 
>>> That shouldn't prevent us from investigating known edge-cases, collateral 
>>> damage, and unexpected behavioural changes in patch versions.   
>>> 
>>>  
>>> 
>>>  
>>>>> On Sep 12, 2024, at 3:40 PM, Jeremiah Jordan <jeremiah.jor...@gmail.com> 
>>>>> wrote:
>>>>>> 1. Rejecting writes does not prevent data loss in this situation.  It 
>>>>>> only reduces it.  The investigation and remediation of possible 
>>>>>> mislocated data is still required.
>>>>> All nodes which reject a write prevent mislocated data.  There is still 
>>>>> the possibility of some node having the same wrong view of the ring as 
>>>>> the coordinator (including if they are the same node) accepting data.  
>>>>> Unless there are multiple nodes with the same wrong view of the ring, 
>>>>> data loss is prevented for CL > ONE.
>>> 
>>> 
>>> (1) stands true, for all CLs.  I think this is pretty important here.
>>> 
>>> With writes rejection enabled, we can tell people it may have prevented a 
>>> lot of data mislocation and is of great benefit and safety, but there's no 
>>> guarantee that it's prevented all data mislocation.  If an operator 
>>> encounters writes rejected in this manner they must still go investigate a 
>>> possible data loss situation.  
>>> 
>>> We are aware of our own situations where we have been hit by this, and they 
>>> come in a number of variants, but we can't speak to every situation users 
>>> will find themselves in.   We're making a trade-off here of reduced 
>>> availability against more forceful alerting and an alleviation of data 
>>> mislocation.   
>>> 
>>> 
>>> 
>>>>> 
>>>>>> 2. Rejecting writes is a louder form of alerting for users unaware of 
>>>>>> the scenario, those not already monitoring logs or metrics.
>>>>> Without this patch no one is aware of any issues at all.  Maybe you are 
>>>>> referring to a situation where the patch is applied, but the default 
>>>>> behavior is to still accept the “bad” data?  In that case yes, turning on 
>>>>> rejection makes it “louder” in that your queries can fail if too many 
>>>>> nodes are wrong.
>>> 
>>> (2) stands true.    Rejecting is a louder alert, but it is not complete, 
>>> see next point.  (All four points are made with the patch applied.)
>>> 
>>>  
>>>>> 
>>>>>> 3. Rejecting writes does not capture all places where the problem is 
>>>>>> occurring.  Only logging/metrics fully captures everywhere the problem 
>>>>>> is occurring.
>>>>> 
>>>>> Not sure what you are saying here.
>>> 
>>> Rejected writes can be swallowed by a coordinator sending background writes 
>>> to other nodes when it has already ack'd the response to the client.  If 
>>> the operator wants a complete and accurate overview of out-of-range writes 
>>> they have to look at the logs/metrics.
>>> 
>>> (3) stands true.
>>> 
>>>  
>>>>> 
>>>>>> 4. … nodes can be rejecting writes when they are in fact correct hence 
>>>>>> causing “over-eager unavailability”.
>>>>> When would this occur?  I guess when the node with the bad ring 
>>>>> information is a replica sent data from a coordinator with the correct 
>>>>> ring state?  There would be no “unavailability” here unless there were 
>>>>> multiple nodes in such a state.  I also again would not call this over 
>>>>> eager, because the node with the bad ring state is f’ed up and needs to 
>>>>> be fixed.  So if being considered unavailable doesn’t seem over-eager to 
>>>>> me.
>>> 
>>> This fails in a quorum write.  And the node need not be f'ed up, just 
>>> delayed in its view.
>>> 
>>> (4) stands true.
>>> 
>>> 
>>>>> 
>>>>> Given the fact that a user can read NEWS.txt and turn off this rejection 
>>>>> of writes, I see no reason not to err on the side of “the setting which 
>>>>> gives better protection even if it is not perfect”.  We should not let 
>>>>> the want to solve everything prevent incremental improvements, especially 
>>>>> when we actually do have the solution coming in TCM.
>>> 
>>> 
>>> Yes, this is what I'm aiming at, to be truthful that it's best-effort at 
>>> alerting and alleviating some very serious scenarios.  It can prevent data 
>>> mislocation in some scenarios but it offers no guarantees about that, and 
>>> can also degrade availability unnecessarily.  Through production experience 
>>> from a number of large cluster operators we know it does significantly and 
>>> importantly improve the consistency and durability of data, but ultimately 
>>> an operator finding themselves in this situation must still assume possible 
>>> eventual data loss and investigate accordingly.    Due diligence and 
>>> accuracy.  
>>> 
>>>  
>>

Re: [DISCUSS] CASSANDRA-13704 Safer handling of out of range tokens

Reply via email to