[ 
https://issues.apache.org/jira/browse/CASSANDRA-18913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17773642#comment-17773642
 ] 

Brandon Williams commented on CASSANDRA-18913:
----------------------------------------------

{color:#000000}||Branch||CI|| {color}
|[4.0|https://github.com/driftx/cassandra/tree/CASSANDRA-18913-4.0]|[j8|https://app.circleci.com/pipelines/github/driftx/cassandra/1329/workflows/9615e832-31da-4751-b8ed-859f3752925d],
 [j11|https://app.circleci.com/pipelines/github/drift
x/cassandra/1329/workflows/e3a4dc3a-0208-45e1-b0b6-67ccf1c9e37d]| 
|[4.1|https://github.com/driftx/cassandra/tree/CASSANDRA-18913-4.1]|[j8|https://app.circleci.com/pipelines/github/driftx/cassandra/1328/workflows/6620e30c-3462-4f55-96b6-1e5ab7d0d91d],
 [j11|https://app.circleci.com/pipelines/github/drift
x/cassandra/1328/workflows/e9be2633-010b-4d04-b2e7-4353c6ec028e]| 
|[5.0|https://github.com/driftx/cassandra/tree/CASSANDRA-18913-5.0]|[j11|https://app.circleci.com/pipelines/github/driftx/cassandra/1330/workflows/8a0a955a-8033-4c19-848c-4ce0e55c0b51],
 [j17|https://app.circleci.com/pipelines/github/drif
tx/cassandra/1330/workflows/7ce35fa6-4421-4fb6-8879-e34400c0778e]|

> Gossip NPE due to shutdown event corrupting empty statuses
> ----------------------------------------------------------
>
>                 Key: CASSANDRA-18913
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-18913
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Cluster/Gossip, Cluster/Membership
>            Reporter: David Capwell
>            Assignee: David Capwell
>            Priority: Normal
>             Fix For: 4.0.x, 4.1.x, 5.0.x, 5.x
>
>          Time Spent: 50m
>  Remaining Estimate: 0h
>
> When an instance either disables gossip or shuts down we send a gossip 
> shutdown message, peers ignore it if the endpoint isn’t known, else it 
> mutates its local copy of the state to mark shutdown…
> When an instance restarts it populates gossip with the endpoints found in 
> peers, but the state is empty (not null)
> So, there is a fun timing bug…
> * stop node1
> * start node1; at this point all known endpoints before exist in gossip but 
> are empty
> * node2 shutdown (gossip shutdown or node, doesn’t matter)
> * node1 sees the shutdown before gossip messages, and gets corruptted
> * node3 tries to join the cluster, fails due to node1 being corrupted
> There are 2 different patterns the NPE can happen with, in this example node1 
> and node3 will have different stack traces
> {code}
> org.apache.cassandra.distributed.shared.ShutdownException: Uncaught 
> exceptions were thrown during test
>       Suppressed: java.lang.NullPointerException: Unable to get HOST_ID; 
> HOST_ID is not defined, given EndpointState: HeartBeatState = HeartBeat: 
> generation = 0, version = 2147483647, AppStateMap = 
> {STATUS=Value(shutdown,true,37), RPC_READY=Value(false,38), 
> STATUS_WITH_PORT=Value(shutdown,true,36)}
>               at 
> org.apache.cassandra.gms.Gossiper.getHostId(Gossiper.java:1218)
>               at 
> org.apache.cassandra.gms.Gossiper.getHostId(Gossiper.java:1208)
>               at 
> org.apache.cassandra.service.StorageService.handleStateNormal(StorageService.java:3279)
>               at 
> org.apache.cassandra.service.StorageService.onChange(StorageService.java:2756)
>               at 
> org.apache.cassandra.gms.Gossiper.markAsShutdown(Gossiper.java:611)
>               at 
> org.apache.cassandra.gms.GossipShutdownVerbHandler.doVerb(GossipShutdownVerbHandler.java:39)
>               at 
> org.apache.cassandra.net.InboundSink.lambda$new$0(InboundSink.java:78)
>       Suppressed: java.lang.NullPointerException: Unable to get HOST_ID; 
> HOST_ID is not defined, given EndpointState: HeartBeatState = HeartBeat: 
> generation = 0, version = 2147483647, AppStateMap = 
> {STATUS=Value(shutdown,true,37), RPC_READY=Value(false,38), 
> STATUS_WITH_PORT=Value(shutdown,true,36)}
>               at 
> org.apache.cassandra.gms.Gossiper.getHostId(Gossiper.java:1218)
>               at 
> org.apache.cassandra.gms.Gossiper.getHostId(Gossiper.java:1208)
>               at 
> org.apache.cassandra.service.StorageService.handleStateNormal(StorageService.java:3279)
>               at 
> org.apache.cassandra.service.StorageService.onChange(StorageService.java:2756)
>               at 
> org.apache.cassandra.gms.Gossiper.doOnChangeNotifications(Gossiper.java:1762)
>               at 
> org.apache.cassandra.service.StorageService.onJoin(StorageService.java:3793)
>               at 
> org.apache.cassandra.gms.Gossiper.handleMajorStateChange(Gossiper.java:1465)
>               at 
> org.apache.cassandra.gms.Gossiper.applyStateLocally(Gossiper.java:1678)
>               at 
> org.apache.cassandra.gms.GossipDigestAck2VerbHandler.doVerb(GossipDigestAck2VerbHandler.java:50)
>               at 
> org.apache.cassandra.net.InboundSink.lambda$new$0(InboundSink.java:78)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to