Re: [DISCUSS] Gossip Protocol Change

David Capwell Mon, 13 May 2024 14:38:10 -0700

So, I created https://issues.apache.org/jira/browse/CASSANDRA-18917 which lets 
you do deterministic gossip simulation testing cross large clusters within 
seconds… I stopped this work as it conflicted with TCM (they were trying to 
merge that week) and it hit issues where some nodes never converged… I didn’t 
have time to debug so I had to drop the patch…


This type of change would be a good reason to resurrect that patch as testing 
gossip is super dangerous right now… its behavior is only in a few peoples 
heads and even then its just bits and pieces scattered cross multiple people 
(and likely missing pieces)… 

My brain is far too fried right now to say your idea is safe or not, but 
honestly feel that we would need to improve our tests (we have 0) before making 
such a change… 

I do welcome the patch though...

> On May 12, 2024, at 8:05 PM, Zemek, Cameron via dev 
> <dev@cassandra.apache.org> wrote:
> 
> In looking into CASSANDRA-19580 I noticed something that raises a question. 
> With Gossip SYN it doesn't check for missing digests. If its empty for shadow 
> round it will add everything from endpointStateMap to the reply. But why not 
> included missing entries in normal replies? The branching for reply handling 
> of SYN requests could then be merged into single code path (though shadow 
> round handles empty state different with CASSANDRA-16213). Potential is 
> performance impact as this requires doing a set difference.
> 
> For example, something along the lines of:
> 
> ```
>         Set<InetAddressAndPort> missing = new 
> HashSet<>(endpointStateMap.keySet());
>         
> missing.removeAll(gDigestList.stream().map(GossipDigest::getEndpoint).collect(Collectors.toSet()));
>         for ( InetAddressAndPort endpoint : missing)
>         {
>             gDigestList.add(new GossipDigest(endpoint, 0, 0));
>         }
> ```
> 
> It seems odd to me that after shadow round for a new node we have 
> endpointStateMap with only itself as an entry. Then the only way it gets the 
> gossip state is by another node choosing to send the new node a gossip SYN. 
> The choosing of this is random. Yeah this happens every second so eventually 
> its going to receive one (outside the issue of CASSANDRA-19580 were it 
> doesn't if its in a dead state like hibernate) , but doesn't this open up 
> bootstrapping to failures on very large clusters as it can take longer before 
> its sent a SYN (as the odds of being chosen for SYN get lower)? For years 
> been seeing bootstrap failures with 'Unable to contact any seeds' but they 
> are infrequent and never been able to figure out how to reproduce in order to 
> open a ticket, but I wonder if some of them have been due to not receiving a 
> SYN message before it does the seenAnySeed check.

Re: [DISCUSS] Gossip Protocol Change

Reply via email to