[ 
https://issues.apache.org/jira/browse/CASSANDRA-20659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Capwell updated CASSANDRA-20659:
--------------------------------------
    Attachment: 
ci_summary-cassandra-4.0-7e04a922c3203c0970e220643b117f5e3f1f8f5f.html
                
ci_summary-cassandra-4.1-8cfc452b9a77e89ad06563cfdf25f150af524d9c.html
                
ci_summary-cassandra-5.0-c2ad8e703375af6e7848c8a52592cb3df5f7a7b3.html
                ci_summary-trunk-f2fdc52c5b8c900b350ff5f4c81dd8a33df7530b.html
                
result_details-cassandra-4.0-7e04a922c3203c0970e220643b117f5e3f1f8f5f.tar.gz
                
result_details-cassandra-5.0-c2ad8e703375af6e7848c8a52592cb3df5f7a7b3.tar.gz
                
result_details-trunk-f2fdc52c5b8c900b350ff5f4c81dd8a33df7530b.tar.gz

> Gossip doesn't converge due to race condition when updating EndpointStates 
> multiple fields
> ------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-20659
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-20659
>             Project: Apache Cassandra
>          Issue Type: Bug
>          Components: Cluster/Gossip
>            Reporter: David Capwell
>            Assignee: David Capwell
>            Priority: Normal
>             Fix For: 4.0.18, 4.1.10, 5.0.5, 5.1
>
>         Attachments: 
> ci_summary-cassandra-4.0-7e04a922c3203c0970e220643b117f5e3f1f8f5f.html, 
> ci_summary-cassandra-4.1-8cfc452b9a77e89ad06563cfdf25f150af524d9c.html, 
> ci_summary-cassandra-5.0-c2ad8e703375af6e7848c8a52592cb3df5f7a7b3.html, 
> ci_summary-trunk-f2fdc52c5b8c900b350ff5f4c81dd8a33df7530b.html, 
> result_details-cassandra-4.0-7e04a922c3203c0970e220643b117f5e3f1f8f5f.tar.gz, 
> result_details-cassandra-5.0-c2ad8e703375af6e7848c8a52592cb3df5f7a7b3.tar.gz, 
> result_details-trunk-f2fdc52c5b8c900b350ff5f4c81dd8a33df7530b.tar.gz
>
>          Time Spent: 50m
>  Remaining Estimate: 0h
>
> The issue seen is during shrinks or token moves the cluster gets into a state 
> where some of the nodes never converge and see the latest STATUS state for 
> the changed peers.
> In testing this it was found that:
> 1) org.apache.cassandra.gms.Gossiper#applyStateLocally expects to run in a 
> single thread, so doesn't take any locks
> 2) org.apache.cassandra.gms.Gossiper.GossipTask runs in another thread and 
> uses a taskLock to avoid sending partial state
> 3) org.apache.cassandra.gms.Gossiper#applyNewStates gets called when the 
> generation matches, and tries to apply the state sequentially.
> The theory (and test) is
> 1) localState.setHeartBeatState(remoteState.getHeartBeatState()); runs
> 2) something (gossip or paxos) read the state
> 3) localState.addApplicationStates(updatedStates); updates the state
> the "something" in step 2 sends around the heartbeat which cause others to 
> see a higher max version, so the delta logic won't see the mutations done in 
> step 3



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to