Re: [DISCUSS] CASSANDRA-13704 Safer handling of out of range tokens

Dinesh Joshi Thu, 12 Sep 2024 13:14:57 -0700

My 2c are below –

We have a patch that is preventing a known data loss issue. People may or
may not know they're suffering from this issue so this should go in all
supported versions of Cassandra with it enabled by default. Will this cause
issues for operators? Sure. Is it worth keeping this feature off to avoid
issues for operators? No. We can mitigate any upgrade related issues by
putting in a warning in the release notes.


I'm +1 on this patch landing in all supported branches of Cassandra and
feature being on by default with adequate warnings for the operator.

Thanks,

Dinesh



On Thu, Sep 12, 2024 at 12:56 PM Josh McKenzie <jmcken...@apache.org> wrote:

> I'd like to propose we treat all data loss bugs as "fix by default on all
> supported branches even if that might introduce user-facing changes".
>
> Even if only N of M people on a thread have experienced it.
> Even if we only uncover it through testing (looking at you Harry).
>
> My gut tells me this is something we should have a clear cultural value
> system around as a project, and that value system should be "Above all
> else, we don't lose data". Just because users aren't aware it might be
> happening doesn't mean it's not a *massive* problem.
>
> I would bet good money that there are *a lot* of user-felt pains using
> this project that we're all unfortunately insulated from.
>
> On Thu, Sep 12, 2024, at 3:35 PM, Mick Semb Wever wrote:
>
> Great that the discussion explores the issue as well.
>
> So far we've heard three* companies being impacted, and four times in
> total…?  Info is helpful here.
>
> *) Jordan, you say you've been hit by _other_ bugs _like_ it.  Jon i'm
> assuming the company you refer to doesn't overlap. JD we know it had
> nothing to do with range movements and could/should have been prevented far
> simpler with operational correctness/checks.
>
> In the extreme, when no writes have gone to any of the replicas, what
> happened ? Either this was CL.*ONE, or it was an operational failure (not
> C* at fault).  If it's an operational fault, both the coordinator and the
> node can be wrong.  With CL.ONE, just the coordinator can be wrong and the
> problem still exists (and with rejection enabled the operator is now more
> likely to ignore it).
>
> WRT to the remedy, is it not to either run repair (when 1+ replica has
> it), or to load flushed and recompacted sstables (from the period in
> question) to their correct nodes.  This is not difficult, but
> understandably lost-sleep and time-intensive.
>
> Neither of the above two points I feel are that material to the outcome,
> but I think it helps keep the discussion on track and informative.   We
> also know there are many competent operators out there that do detect data
> loss.
>
>
>
> On Thu, 12 Sept 2024 at 20:07, Caleb Rackliffe <calebrackli...@gmail.com>
> wrote:
>
> If we don’t reject by default, but log by default, my fear is that we’ll
> simply be alerting the operator to something that has already gone very
> wrong that they may not be in any position to ever address.
>
> On Sep 12, 2024, at 12:44 PM, Jordan West <jw...@apache.org> wrote:
>
> 
> I’m +1 on enabling rejection by default on all branches. We have been bit
> by silent data loss (due to other bugs like the schema issues in 4.1) from
> lack of rejection on several occasions and short of writing extremely
> specialized tooling its unrecoverable. While both lack of availability and
> data loss are critical, I will always pick lack of availability over data
> loss. Its better to fail a write that will be lost than silently lose it.
>
> Of course, a change like this requires very good communication in NEWS.txt
> and elsewhere but I think its well worth it. While it may surprise some
> users I think they would be more surprised that they were silently losing
> data.
>
> Jordan
>
> On Thu, Sep 12, 2024 at 10:22 Mick Semb Wever <m...@apache.org> wrote:
>
> Thanks for starting the thread Caleb, it is a big and impacting patch.
>
> Appreciate the criticality, in a new major release rejection by default is
> obvious.   Otherwise the logging and metrics is an important addition to
> help users validate the existence and degree of any problem.
>
> Also worth mentioning that rejecting writes can cause degraded
> availability in situations that pose no problem.  This is a coordination
> problem on a probabilistic design, it's choose your evil: unnecessary
> degraded availability or mislocated data (eventual data loss).   Logging
> and metrics makes alerting on and handling the data mislocation possible,
> i.e. avoids data loss with manual intervention.  (Logging and metrics also
> face the same problem with false positives.)
>
> I'm +0 for rejection default in 5.0.1, and +1 for only logging default in
> 4.x
>
>
> On Thu, 12 Sept 2024 at 18:56, Jeff Jirsa <jji...@gmail.com> wrote:
>
> This patch is so hard for me.
>
> The safety it adds is critical and should have been added a decade ago.
> Also it’s a huge patch, and touches “everything”.
>
> It definitely belongs in 5.0. I’d probably reject by default in 5.0.1.
>
> 4.0 / 4.1 - if we treat this like a fix for latent opportunity for data
> loss (which it implicitly is), I guess?
>
>
>
> > On Sep 12, 2024, at 9:46 AM, Brandon Williams <dri...@gmail.com> wrote:
> >
> > On Thu, Sep 12, 2024 at 11:41 AM Caleb Rackliffe
> > <calebrackli...@gmail.com> wrote:
> >>
> >> Are you opposed to the patch in its entirety, or just rejecting unsafe
> operations by default?
> >
> > I had the latter in mind.  Changing any default in a patch release is
> > a potential surprise for operators and one of this nature especially
> > so.
> >
> > Kind Regards,
> > Brandon
>
>
>

Re: [DISCUSS] CASSANDRA-13704 Safer handling of out of range tokens

Reply via email to