Re: Welcome Chris Bannister, James Hartig, Jackson Flemming and João Reis, as cassandra-gocql-driver committers

2024-09-12 Thread guo Maxwell
Congratulations! James Hartig 于2024年9月13日周五 08:11写道: > Thanks everyone! Excited to contribute. > > On Thu, Sep 12, 2024, at 4:59 PM, Francisco Guerrero wrote: > > Congratulations! > > On 2024/09/12 11:39:40 Mick Semb Wever wrote: > > The PMC's members are pleased to announce that Chris Bannister

Re: [DISCUSS] CASSANDRA-13704 Safer handling of out of range tokens

2024-09-12 Thread guo Maxwell
Sorry for sending an interfering email . Please ignore the above one. My wrong operation cannot be undone. For this DISCUSS, I personally +1 with enabling rejection by default. We did something similar to address this issue. Although the default behavior may change. But what we solve is the corre

Re: [DISCUSS] CASSANDRA-13704 Safer handling of out of range tokens

2024-09-12 Thread guo Maxwell
+1,默认在所有分支上启用拒绝。我们做了类似的事情来解决这个问题。虽然默认行为可能会改变。但我们解决的是数据存储的正确性,我认为这应该是数据库最重要的事情,这样其他事情可能就不那么重要了。 Josh McKenzie 于2024年9月13日周五 09:34写道: > 即使修复只是部分的,那么实际上它更多的是通过过于急切的不可用性更有力地提醒操作员问题……? > > 有时原则立场可能会使我们远离讨论中的重要细节。 > > 我对此票的理解(没有深入研究代码,只是查看了 JIRA > 和此线程)是,这是我们在非确定性、非基于纪元、非事务性元数据系统中可以找到的最有效的解决方案。即 > Gos

Re: [DISCUSS] CASSANDRA-13704 Safer handling of out of range tokens

2024-09-12 Thread Josh McKenzie
> Even when the fix is only partial, so really it's more about more forcefully > alerting the operator to the problem via over-eager unavailability …? > > Sometimes a principled stance can take us away from the important details in > the discussions. My understanding of the ticket (having not d

Re: [DISCUSS] CASSANDRA-13704 Safer handling of out of range tokens

2024-09-12 Thread C. Scott Andreas
Thanks all for discussion on this.It’s hard to describe the sinking feeling that hit me when it became clear to me how common this problem is - and how horribly difficult it is to prove one has encountered this bug.Two years ago, my understanding was that this is an exceptionally rare and transient

Re: Welcome Chris Bannister, James Hartig, Jackson Flemming and João Reis, as cassandra-gocql-driver committers

2024-09-12 Thread James Hartig
Thanks everyone! Excited to contribute. On Thu, Sep 12, 2024, at 4:59 PM, Francisco Guerrero wrote: > Congratulations! > > On 2024/09/12 11:39:40 Mick Semb Wever wrote: > > The PMC's members are pleased to announce that Chris Bannister, James > > Hartig, Jackson Flemming and João Reis have accept

Re: [DISCUSS] CASSANDRA-13704 Safer handling of out of range tokens

2024-09-12 Thread Jeremiah Jordan
> > 1. Rejecting writes does not prevent data loss in this situation. It only > reduces it. The investigation and remediation of possible mislocated data > is still required. > All nodes which reject a write prevent mislocated data. There is still the possibility of some node having the same wr

Re: [DISCUSS] CASSANDRA-13704 Safer handling of out of range tokens

2024-09-12 Thread Mick Semb Wever
I'm less concerned with what the defaults are in each branch, and more the accuracy of what we say, e.g. in NEWS.txt This is my understanding so far, and where I hoped to be corrected. 1. Rejecting writes does not prevent data loss in this situation. It only reduces it. The investigation and re

Re: Welcome Jordan West and Stefan Miklosovic as Cassandra PMC members!

2024-09-12 Thread Francisco Guerrero
Great news! Well deserved and congratulations to both of you! On 2024/08/30 20:18:43 Jon Haddad wrote: > The PMC's members are pleased to announce that Jordan West and Stefan > Miklosovic have accepted invitations to become PMC members. > > Thanks a lot, Jordan and Stefan, for everything you have

Re: Welcome Chris Bannister, James Hartig, Jackson Flemming and João Reis, as cassandra-gocql-driver committers

2024-09-12 Thread Francisco Guerrero
Congratulations! On 2024/09/12 11:39:40 Mick Semb Wever wrote: > The PMC's members are pleased to announce that Chris Bannister, James > Hartig, Jackson Flemming and João Reis have accepted invitations to become > committers on the Drivers subproject. > > Thanks a lot for everything you have done

Re: Welcome Chris Bannister, James Hartig, Jackson Flemming and João Reis, as cassandra-gocql-driver committers

2024-09-12 Thread Jordan West
Congrats, welcome! On Thu, Sep 12, 2024 at 13:16 Dinesh Joshi wrote: > Congratulations, everyone! > > On Thu, Sep 12, 2024 at 4:40 AM Mick Semb Wever wrote: > >> The PMC's members are pleased to announce that Chris Bannister, James >> Hartig, Jackson Flemming and João Reis have accepted invitat

Re: [DISCUSS] CASSANDRA-13704 Safer handling of out of range tokens

2024-09-12 Thread Jeremiah Jordan
> > JD we know it had nothing to do with range movements and could/should have > been prevented far simpler with operational correctness/checks. > “Be better” is not the answer. Also I think you are confusing our incidents, the out of range token issue we saw was not because of an operational “oop

Re: [DISCUSS] CASSANDRA-13704 Safer handling of out of range tokens

2024-09-12 Thread Jordan West
To clarify my response: We didn’t hit a bug “like it”. We hit a bug that resulted in an improper view of the ring (on my phone so can’t dig up the JIRA but it was a ring issue introduced in 4.1.0 and fixed in 4.1.4 iirc). There have been several bugs of this form in the past. So it wasn’t that we

Re: [DISCUSS] CASSANDRA-13704 Safer handling of out of range tokens

2024-09-12 Thread David Capwell
> if we are counting on users to read NEWS.txt, can we not count on them to > enable rejection if this is important to them? I think we can make the inverse statement… if accepting data loss is a tradeoff they want then disabling is there for them? So we could default for safety and let you op

Re: [DISCUSS] CASSANDRA-13704 Safer handling of out of range tokens

2024-09-12 Thread Mick Semb Wever
reply below On Thu, 12 Sept 2024 at 21:56, Josh McKenzie wrote: > I'd like to propose we treat all data loss bugs as "fix by default on all > supported branches even if that might introduce user-facing changes". > > Even if only N of M people on a thread have experienced it. > Even if we only un

Re: Welcome Chris Bannister, James Hartig, Jackson Flemming and João Reis, as cassandra-gocql-driver committers

2024-09-12 Thread Dinesh Joshi
Congratulations, everyone! On Thu, Sep 12, 2024 at 4:40 AM Mick Semb Wever wrote: > The PMC's members are pleased to announce that Chris Bannister, James > Hartig, Jackson Flemming and João Reis have accepted invitations to > become committers on the Drivers subproject. > > Thanks a lot for ever

Re: [DISCUSS] CASSANDRA-13704 Safer handling of out of range tokens

2024-09-12 Thread Dinesh Joshi
My 2c are below – We have a patch that is preventing a known data loss issue. People may or may not know they're suffering from this issue so this should go in all supported versions of Cassandra with it enabled by default. Will this cause issues for operators? Sure. Is it worth keeping this featu

Re: [DISCUSS] CASSANDRA-13704 Safer handling of out of range tokens

2024-09-12 Thread Josh McKenzie
I'd like to propose we treat all data loss bugs as "fix by default on all supported branches even if that might introduce user-facing changes". Even if only N of M people on a thread have experienced it. Even if we only uncover it through testing (looking at you Harry). My gut tells me this is s

Re: [DISCUSS] CASSANDRA-13704 Safer handling of out of range tokens

2024-09-12 Thread Ekaterina Dimitrova
I agree we should be precise in the docs after people share their opinion and experience in this thread and the ticket work gets settled. Thank you Caleb for opening it! It is important On Thu, 12 Sep 2024 at 15:49, Mick Semb Wever wrote: > Yes, and my usage of CL.*ONE wasn't so correct (as writ

Re: Welcome Chris Bannister, James Hartig, Jackson Flemming and João Reis, as cassandra-gocql-driver committers

2024-09-12 Thread J. D. Jordan
Welcome to the project! > On Sep 12, 2024, at 2:42 PM, Brandon Williams wrote: > > Congratulations! > > Kind Regards, > Brandon > >> On Thu, Sep 12, 2024 at 6:40 AM Mick Semb Wever wrote: >> >> The PMC's members are pleased to announce that Chris Bannister, James >> Hartig, Jackson Flemmin

Re: [DISCUSS] CASSANDRA-13704 Safer handling of out of range tokens

2024-09-12 Thread Mick Semb Wever
Yes, and my usage of CL.*ONE wasn't so correct (as writes are backgrounded), but… The point is we should be accurate and precise in talking about this (folk will come back and read this thread), both in how it can manifest, the limitations of both logging and rejection, what possible remedies are,

Re: Welcome Chris Bannister, James Hartig, Jackson Flemming and João Reis, as cassandra-gocql-driver committers

2024-09-12 Thread Brandon Williams
Congratulations! Kind Regards, Brandon On Thu, Sep 12, 2024 at 6:40 AM Mick Semb Wever wrote: > > The PMC's members are pleased to announce that Chris Bannister, James Hartig, > Jackson Flemming and João Reis have accepted invitations to become committers > on the Drivers subproject. > > Thank

Re: [DISCUSS] CASSANDRA-13704 Safer handling of out of range tokens

2024-09-12 Thread Jeff Jirsa
First, Any violation of consistency should be treated as data loss, because we can’t tell what people are doing as a result of the missing data downstream (it may trigger an action outside of the database that is unrecoverable)Second, If you have a coordinator with a broken snitch configured, it ma

Re: Welcome Chris Bannister, James Hartig, Jackson Flemming and João Reis, as cassandra-gocql-driver committers

2024-09-12 Thread Jon Haddad
Congratulations!! On Thu, Sep 12, 2024 at 12:11 PM Ekaterina Dimitrova wrote: > Congratulations! 👏🏻 > > On Thu, 12 Sep 2024 at 12:43, Patrick McFadin wrote: > >> And the gocql users everywhere are high-fiving and taking the rest of the >> day off to celebrate. Congrats everyone! >> >> On Thu,

Re: [DISCUSS] CASSANDRA-13704 Safer handling of out of range tokens

2024-09-12 Thread Mick Semb Wever
Great that the discussion explores the issue as well. So far we've heard three* companies being impacted, and four times in total…? Info is helpful here. *) Jordan, you say you've been hit by _other_ bugs _like_ it. Jon i'm assuming the company you refer to doesn't overlap. JD we know it had no

Re: [DISCUSS] CASSANDRA-13704 Safer handling of out of range tokens

2024-09-12 Thread Jeff Jirsa
On Sep 12, 2024, at 12:22 PM, J. D. Jordan wrote:I have lost sleep (and data) over this multiple times in the past few months, that was only recently tracked down to this exact scenario.+1 for including it in all active releases and enabling the failure of the writes on “wrong” nodes by default.I

Re: [DISCUSS] CASSANDRA-13704 Safer handling of out of range tokens

2024-09-12 Thread J. D. Jordan
I have lost sleep (and data) over this multiple times in the past few months, that was only recently tracked down to this exact scenario.+1 for including it in all active releases and enabling the failure of the writes on “wrong” nodes by default.I haven’t looked at the patch, but as long as only o

Re: Welcome Chris Bannister, James Hartig, Jackson Flemming and João Reis, as cassandra-gocql-driver committers

2024-09-12 Thread Ekaterina Dimitrova
Congratulations! 👏🏻 On Thu, 12 Sep 2024 at 12:43, Patrick McFadin wrote: > And the gocql users everywhere are high-fiving and taking the rest of the > day off to celebrate. Congrats everyone! > > On Thu, Sep 12, 2024 at 8:18 AM Bernardo Botella < > conta...@bernardobotella.com> wrote: > >> It is

Re: [DISCUSS] CASSANDRA-13704 Safer handling of out of range tokens

2024-09-12 Thread Jordan West
I think folks not losing sleep over this are only in that position because they don’t know it’s happening. Like Brandon said, ignorance is bliss (but it’s a false bliss). Very few users do the work necessary to detect data loss outside the obvious paths. I agree with Caleb, if we log and give them

Re: [DISCUSS] CASSANDRA-13704 Safer handling of out of range tokens

2024-09-12 Thread Caleb Rackliffe
We aren’t counting on users to read NEWS.txt. That’s the point. We’re saying we’re going to make things safer, as they should always have been, and if someone out there has tooling that somehow allows them to avoid the risks, they can disable rejection. > On Sep 12, 2024, at 1:21 PM, Brandon Wi

Re: [DISCUSS] CASSANDRA-13704 Safer handling of out of range tokens

2024-09-12 Thread Doug Rohrer
+1 on rejection-by-default, for several reasons: 1) Jordan’s point on the fact that recovery from this kind of data misplacement is very difficult. 2) Without any sort of warning or error in existing Cassandra installations, how many operators/users would actually know that they have been hit by

Re: [DISCUSS] CASSANDRA-13704 Safer handling of out of range tokens

2024-09-12 Thread Brandon Williams
On Thu, Sep 12, 2024 at 1:13 PM Caleb Rackliffe wrote: > > I think I can count at least 4 people on this thread who literally have lost > sleep over this. Probably good examples of not being the majority though, heh. If we are counting on users to read NEWS.txt, can we not count on them to enab

Re: [DISCUSS] CASSANDRA-13704 Safer handling of out of range tokens

2024-09-12 Thread Abe Ratnofsky
Expressing another vote in favor of rejection-by-default. If a user doesn't want to lose sleep for data loss while on-call, they can read NEWS.txt and disable rejection.

Re: [DISCUSS] CASSANDRA-13704 Safer handling of out of range tokens

2024-09-12 Thread Cheng Wang via dev
> If we don’t reject by default, but log by default, my fear is that we’ll simply be alerting the operator to something that has already gone very wrong that they may not be in any position to ever address. Yes, logging and alerting is not enough here. We have seen the same issue before that we hav

Re: [DISCUSS] CASSANDRA-13704 Safer handling of out of range tokens

2024-09-12 Thread Caleb Rackliffe
I think I can count at least 4 people on this thread who literally have lost sleep over this. > On Sep 12, 2024, at 1:07 PM, Brandon Williams wrote: > > On Thu, Sep 12, 2024 at 11:52 AM Josh McKenzie > wrote: >> >> More or less surprising than learning that they've been at risk of or >> ac

Re: [DISCUSS] CASSANDRA-13704 Safer handling of out of range tokens

2024-09-12 Thread Jon Haddad
I have worked with teams that have lost weeks of sleep, and customers, due to these issues. It cost a fortune 500 company that I work with millions in revenue. I think ignorance is bliss may also apply to us, if we are unaware of the number of times C* users have been bit by this issue but not ra

Re: [DISCUSS] CASSANDRA-13704 Safer handling of out of range tokens

2024-09-12 Thread Brandon Williams
On Thu, Sep 12, 2024 at 11:52 AM Josh McKenzie wrote: > > More or less surprising than learning that they've been at risk of or > actively losing data for years? Ignorance is bliss though and I think the majority of users haven't lost sleep over this since it's been present since the beginning o

Re: [DISCUSS] CASSANDRA-13704 Safer handling of out of range tokens

2024-09-12 Thread Caleb Rackliffe
If we don’t reject by default, but log by default, my fear is that we’ll simply be alerting the operator to something that has already gone very wrong that they may not be in any position to ever address.On Sep 12, 2024, at 12:44 PM, Jordan West wrote:I’m +1 on enabling rejection by default on al

Re: [DISCUSS] CASSANDRA-13704 Safer handling of out of range tokens

2024-09-12 Thread Cheng Wang via dev
I am +1 with enabling rejection by default. We had encountered similar situations before that we lost data in silence, which made us create a patch to trade availability with data loss. While I agree that it might be a surprise to operators, I think it's worth having good communication in the NEWS.

Re: [DISCUSS] CASSANDRA-13704 Safer handling of out of range tokens

2024-09-12 Thread Jordan West
I’m +1 on enabling rejection by default on all branches. We have been bit by silent data loss (due to other bugs like the schema issues in 4.1) from lack of rejection on several occasions and short of writing extremely specialized tooling its unrecoverable. While both lack of availability and data

Re: [DISCUSS] CASSANDRA-13704 Safer handling of out of range tokens

2024-09-12 Thread Mick Semb Wever
Thanks for starting the thread Caleb, it is a big and impacting patch. Appreciate the criticality, in a new major release rejection by default is obvious. Otherwise the logging and metrics is an important addition to help users validate the existence and degree of any problem. Also worth mentio

Re: [DISCUSS] CASSANDRA-13704 Safer handling of out of range tokens

2024-09-12 Thread Josh McKenzie
> 4.0 / 4.1 - if we treat this like a fix for latent opportunity for data loss > (which it implicitly is), I guess? If we have known data loss scenarios we should fix them in all supported branches even if that fix can potentially modify user-facing behavior. We should definitely try and priorit

Re: [DISCUSS] CASSANDRA-13704 Safer handling of out of range tokens

2024-09-12 Thread Jeff Jirsa
This patch is so hard for me. The safety it adds is critical and should have been added a decade ago. Also it’s a huge patch, and touches “everything”. It definitely belongs in 5.0. I’d probably reject by default in 5.0.1. 4.0 / 4.1 - if we treat this like a fix for latent opportunity for da

Re: [DISCUSS] CASSANDRA-13704 Safer handling of out of range tokens

2024-09-12 Thread Josh McKenzie
More or less surprising than learning that they've been at risk of or actively losing data for years? On Thu, Sep 12, 2024, at 12:46 PM, Brandon Williams wrote: > On Thu, Sep 12, 2024 at 11:41 AM Caleb Rackliffe > wrote: > > > > Are you opposed to the patch in its entirety, or just rejecting uns

Re: [DISCUSS] CASSANDRA-13704 Safer handling of out of range tokens

2024-09-12 Thread Caleb Rackliffe
Potentially losing thousands of records while cluster metadata is changing is also a surprise, and one that comes with no explanation. Which is worse? I know that CASSANDRA-12126 was about correctness vs performance, not about correctness vs

Re: [DISCUSS] CASSANDRA-13704 Safer handling of out of range tokens

2024-09-12 Thread Brandon Williams
On Thu, Sep 12, 2024 at 11:41 AM Caleb Rackliffe wrote: > > Are you opposed to the patch in its entirety, or just rejecting unsafe > operations by default? I had the latter in mind. Changing any default in a patch release is a potential surprise for operators and one of this nature especially s

Re: Welcome Chris Bannister, James Hartig, Jackson Flemming and João Reis, as cassandra-gocql-driver committers

2024-09-12 Thread Patrick McFadin
And the gocql users everywhere are high-fiving and taking the rest of the day off to celebrate. Congrats everyone! On Thu, Sep 12, 2024 at 8:18 AM Bernardo Botella < conta...@bernardobotella.com> wrote: > It is great to see the project growing like this. Congratulations!! > > On Sep 12, 2024, at

Re: [DISCUSS] CASSANDRA-13704 Safer handling of out of range tokens

2024-09-12 Thread Caleb Rackliffe
> I don't think this should be done in a patch release. Are you opposed to the patch in its entirety, or just rejecting unsafe operations by default? On Thu, Sep 12, 2024 at 11:37 AM Chris Lohfink wrote: > While the code touches quite a few places the change itself is > pretty innocuous but is

Re: [DISCUSS] CASSANDRA-13704 Safer handling of out of range tokens

2024-09-12 Thread Chris Lohfink
While the code touches quite a few places the change itself is pretty innocuous but is massively impactful in bad scenarios. I am in favor of this patch myself as this protects the database from data loss that occurs in many different ways. An example I have seen recently (in 4.1) is when using GPF

Re: [DISCUSS] CASSANDRA-13704 Safer handling of out of range tokens

2024-09-12 Thread Brandon Williams
On Thu, Sep 12, 2024 at 11:07 AM Caleb Rackliffe wrote: > > The one consequence of that we might discuss here is that if gossip is behind > in notifying a node with a pending range, local rejection as it receives > writes for that range may cause a small issue of availability. I don't think thi

[DISCUSS] CASSANDRA-13704 Safer handling of out of range tokens

2024-09-12 Thread Caleb Rackliffe
Until we release TCM, it will continue to be possible for nodes to have a divergent view of the ring, and this means operations can still be sent to the wrong nodes. For example, writes may be sent to nodes that do not and never will own that data, and this opens us up to rather devious silent data

Re: Welcome Chris Bannister, James Hartig, Jackson Flemming and João Reis, as cassandra-gocql-driver committers

2024-09-12 Thread Bernardo Botella
It is great to see the project growing like this. Congratulations!! > On Sep 12, 2024, at 6:27 AM, Tolbert, Andy wrote: > > Congratulations everyone! 🎉 > > On Thu, Sep 12, 2024 at 6:41 AM Mick Semb Wever > wrote: >> The PMC's members are pleased to announce that Chris B

Re: Welcome Chris Bannister, James Hartig, Jackson Flemming and João Reis, as cassandra-gocql-driver committers

2024-09-12 Thread Tolbert, Andy
Congratulations everyone! 🎉 On Thu, Sep 12, 2024 at 6:41 AM Mick Semb Wever wrote: > The PMC's members are pleased to announce that Chris Bannister, James > Hartig, Jackson Flemming and João Reis have accepted invitations to > become committers on the Drivers subproject. > > Thanks a lot for eve

Welcome Chris Bannister, James Hartig, Jackson Flemming and João Reis, as cassandra-gocql-driver committers

2024-09-12 Thread Mick Semb Wever
The PMC's members are pleased to announce that Chris Bannister, James Hartig, Jackson Flemming and João Reis have accepted invitations to become committers on the Drivers subproject. Thanks a lot for everything you have done with the gocql driver all these years. We are very excited to see the dr