Re: [DISCUSS] Gossip Protocol Change

2024-05-16 Thread Cheng Wang via dev
Hi Cameron, Just want to make sure I understood correctly. So your observation was at the end of the shadow round, some nodes have empty endpointStateMap? But I read the code at GossipDigestSynVerbHandler::createShadowReply it seems that the receiving node will reply with a full stateMap? return

Re: [DISCUSS] Gossip Protocol Change

2024-05-16 Thread David Capwell
> Sounds like the request was to hit the pause button until TCM merged rather > than skipping the work entirely so that's promising. Correct, I was only asked to wait a few days and to rebase after TCM merged. The issue was that I had to time box this work and the fact it hit issues kinda beca

Re: [DISCUSS] Gossip Protocol Change

2024-05-16 Thread Josh McKenzie
I'm +1 to continuing work on CASSANDRA-18917 for all the reasons Jordan listed. Sounds like the request was to hit the pause button until TCM merged rather than skipping the work entirely so that's promising. On Thu, May 16, 2024, at 1:43 PM, Jon Haddad wrote: > I have also recently worked with

Re: [DISCUSS] Gossip Protocol Change

2024-05-16 Thread Jon Haddad
I have also recently worked with a teams who lost critical data as a result of gossip issues combined with collision in our token allocation. I haven’t filed a jira yet as it slipped my mind but I’ve seen it in my own testing as well. I’ll get a JIRA in describing it in detail. It’s severe enough

Re: [DISCUSS] Gossip Protocol Change

2024-05-16 Thread Jordan West
I’m a big +1 on 18917 or more testing of gossip. While I appreciate that it makes TCM more complicated, gossip and schema propagation bugs have been the source of our two worst data loss events in the last 3 years. Data loss should immediately cause us to evaluate what we can do better. We will li

Re: [DISCUSS] Gossip Protocol Change

2024-05-13 Thread David Capwell
So, I created https://issues.apache.org/jira/browse/CASSANDRA-18917 which lets you do deterministic gossip simulation testing cross large clusters within seconds… I stopped this work as it conflicted with TCM (they were trying to merge that week) and it hit issues where some nodes never converge

[DISCUSS] Gossip Protocol Change

2024-05-13 Thread Zemek, Cameron via dev
In looking into CASSANDRA-19580 I noticed something that raises a question. With Gossip SYN it doesn't check for missing digests. If its empty for shadow round it will add everything from endpointStateMap to the reply. But why not included missing entries in normal replies? The branching for rep