I think it's worth exploring where the disconnect is just a *bit* more, though I agree with you Benedict in that there appears to be a clear consensus so the thread has evolved more into talking about our principles than what to do in this specific scenario.
Mick - this patch doesn't fix things 100%. It can't. BUT - it does take us from "In all cases where this occurs you will silently lose data" to "in some cases where this occurs you will have a rejected write, in others you'll have coordinator level logging, and in the worst split-brain case you can still have data loss (split-brain ownership where the quorum agrees on incorrect state)". Whether we're 100% fixing data loss, or 10%, at the end of the day my position is that we should fix all the data loss we can (this statement is unique to gossip's architectural limitations) *even if that means loss of availability in edge-cases where it is not strictly required*. I think where we might be misaligned: > It can prevent data mislocation in some scenarios but it offers no guarantees > about that, and can also degrade availability unnecessarily. It *can't* prevent mis-location in all scenarios. That's a shortcoming of Gossip's architecture. If the availability degradation is part and parcel to the mitigation, and the mitigation approach is the *best worst option* in an eventually consistent topology sync environment, *it's not unnecessary*. It's the price we pay for mitigating a flaw in the architecture as best we can. A comparable: if we have 1% data loss, and the only way we can mitigate that is to have a 1% hit in availability instead, the former is clearly worse for our users than the latter. Recovery from data loss is multiple orders of magnitude harder than retrying a query on availability degradation. On Fri, Sep 13, 2024, at 8:15 AM, Benedict wrote: > > I think everyone has made their case, the costs and benefits are fairly well > understood, and there appears to be a strong quorum that favours an approach > that prioritises avoiding data loss. > > So, I propose we either recognise that this is the clear position of the > community, or move to a vote to formalise it one way or the other. > > >> On 13 Sep 2024, at 13:02, Alex Petrov <al...@coffeenco.de> wrote: >> >> I agree with folks saying that we absolutely need to reject misplaced >> writes. It may not preclude coordinator making a local write, or making a >> write to a local replica, but even reducing probability of a misplaced write >> shown as success to the client is a substantial win. >> >> On Fri, Sep 13, 2024, at 10:41 AM, Mick Semb Wever wrote: >>> replies below (to Scott, Josh and Jeremiah). >>> >>> tl;dr all my four points remain undisputed, when the patch is applied. >>> This is a messy situation, but no denying the value of rejection writes to >>> various known popular scenarios. Point (1) remains important to highlight >>> IMHO. >>> >>> >>> On Fri, 13 Sept 2024 at 03:03, C. Scott Andreas <sc...@paradoxica.net> >>> wrote: >>>> Since that time, I’ve received several urgent messages from major users of >>>> Apache Cassandra and even customers of Cassandra ecosystem vendors asking >>>> about this bug. Some were able to verify the presence of lost data in >>>> SSTables on nodes where it didn’t belong, demonstrate empty read responses >>>> for data that is known proof-positive to exist (think content-addressable >>>> stores), or reproduce this behavior in a local cluster after forcing >>>> disagreement. >>>> >>> >>> >>> >>> Having been privy to the background of those "urgent messages" I can say >>> the information you received wasn't correct (or complete). >>> >>> My challenge on this thread is about understanding where this might >>> unexpectedly bite users, which should be part of our due diligence when >>> applying such patches to stable branches. I ask you to run through my >>> four points, which AFAIK still stand true. >>> >>> >>>> But I **don't** think it's healthy for us to repeatedly re-litigate >>>> whether data loss is acceptable based on how long it's been around, or how >>>> frequently some of us on the project have observed some given phenomenon. >>> >>> >>> Josh, that's true, but talking to these things helps open up the >>> discussion, see my point above wrt being aware of second-hand evidence that >>> was inaccurate. >>> >>> >>>> The severity and frequency of this issue combined with the business risk >>>> to Apache Cassandra users changed my mind about fixing it in earlier >>>> branches despite TCM having been merged to fix it for good on trunk. >>>> >>> >>> >>> That shouldn't prevent us from investigating known edge-cases, collateral >>> damage, and unexpected behavioural changes in patch versions. >>> >>> >>> >>> >>>>> On Sep 12, 2024, at 3:40 PM, Jeremiah Jordan <jeremiah.jor...@gmail.com> >>>>> wrote: >>>>>> 1. Rejecting writes does not prevent data loss in this situation. It >>>>>> only reduces it. The investigation and remediation of possible >>>>>> mislocated data is still required. >>>>> All nodes which reject a write prevent mislocated data. There is still >>>>> the possibility of some node having the same wrong view of the ring as >>>>> the coordinator (including if they are the same node) accepting data. >>>>> Unless there are multiple nodes with the same wrong view of the ring, >>>>> data loss is prevented for CL > ONE. >>> >>> >>> (1) stands true, for all CLs. I think this is pretty important here. >>> >>> With writes rejection enabled, we can tell people it may have prevented a >>> lot of data mislocation and is of great benefit and safety, but there's no >>> guarantee that it's prevented all data mislocation. If an operator >>> encounters writes rejected in this manner they must still go investigate a >>> possible data loss situation. >>> >>> We are aware of our own situations where we have been hit by this, and they >>> come in a number of variants, but we can't speak to every situation users >>> will find themselves in. We're making a trade-off here of reduced >>> availability against more forceful alerting and an alleviation of data >>> mislocation. >>> >>> >>> >>>>> >>>>>> 2. Rejecting writes is a louder form of alerting for users unaware of >>>>>> the scenario, those not already monitoring logs or metrics. >>>>> Without this patch no one is aware of any issues at all. Maybe you are >>>>> referring to a situation where the patch is applied, but the default >>>>> behavior is to still accept the “bad” data? In that case yes, turning on >>>>> rejection makes it “louder” in that your queries can fail if too many >>>>> nodes are wrong. >>> >>> (2) stands true. Rejecting is a louder alert, but it is not complete, >>> see next point. (All four points are made with the patch applied.) >>> >>> >>>>> >>>>>> 3. Rejecting writes does not capture all places where the problem is >>>>>> occurring. Only logging/metrics fully captures everywhere the problem >>>>>> is occurring. >>>>> >>>>> Not sure what you are saying here. >>> >>> Rejected writes can be swallowed by a coordinator sending background writes >>> to other nodes when it has already ack'd the response to the client. If >>> the operator wants a complete and accurate overview of out-of-range writes >>> they have to look at the logs/metrics. >>> >>> (3) stands true. >>> >>> >>>>> >>>>>> 4. … nodes can be rejecting writes when they are in fact correct hence >>>>>> causing “over-eager unavailability”. >>>>> When would this occur? I guess when the node with the bad ring >>>>> information is a replica sent data from a coordinator with the correct >>>>> ring state? There would be no “unavailability” here unless there were >>>>> multiple nodes in such a state. I also again would not call this over >>>>> eager, because the node with the bad ring state is f’ed up and needs to >>>>> be fixed. So if being considered unavailable doesn’t seem over-eager to >>>>> me. >>> >>> This fails in a quorum write. And the node need not be f'ed up, just >>> delayed in its view. >>> >>> (4) stands true. >>> >>> >>>>> >>>>> Given the fact that a user can read NEWS.txt and turn off this rejection >>>>> of writes, I see no reason not to err on the side of “the setting which >>>>> gives better protection even if it is not perfect”. We should not let >>>>> the want to solve everything prevent incremental improvements, especially >>>>> when we actually do have the solution coming in TCM. >>> >>> >>> Yes, this is what I'm aiming at, to be truthful that it's best-effort at >>> alerting and alleviating some very serious scenarios. It can prevent data >>> mislocation in some scenarios but it offers no guarantees about that, and >>> can also degrade availability unnecessarily. Through production experience >>> from a number of large cluster operators we know it does significantly and >>> importantly improve the consistency and durability of data, but ultimately >>> an operator finding themselves in this situation must still assume possible >>> eventual data loss and investigate accordingly. Due diligence and >>> accuracy. >>> >>> >>