[ 
https://issues.apache.org/jira/browse/CASSANDRA-18897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ke Han updated CASSANDRA-18897:
-------------------------------
    Description: 
When I start up cassandra 3.11.16 cluster (Fresh Start, 3 nodes), I noticed the 
following NPE in the system log (Seed node).
{code:java}
INFO  [HANDSHAKE-/192.168.77.3] 2023-10-02 12:56:43,180 
OutboundTcpConnection.java:561 - Handshaking version with /192.168.77.3
ERROR [GossipStage:1] 2023-10-02 12:56:43,185 CassandraDaemon.java:244 - 
Exception in thread Thread[GossipStage:1,5,main]
java.lang.NullPointerException: null
        at org.apache.cassandra.gms.Gossiper.getHostId(Gossiper.java:998)
        at org.apache.cassandra.gms.Gossiper.getHostId(Gossiper.java:993)
        at 
org.apache.cassandra.service.StorageService.handleStateNormal(StorageService.java:2632)
        at 
org.apache.cassandra.service.StorageService.onChange(StorageService.java:2203)
        at 
org.apache.cassandra.service.StorageService.onJoin(StorageService.java:3020)
        at 
org.apache.cassandra.gms.Gossiper.handleMajorStateChange(Gossiper.java:1246)
        at 
org.apache.cassandra.gms.Gossiper.applyStateLocally(Gossiper.java:1358)
        at 
org.apache.cassandra.gms.GossipDigestAck2VerbHandler.doVerb(GossipDigestAck2VerbHandler.java:49)
        at 
org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:69)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at 
org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:84)
        at java.lang.Thread.run(Thread.java:750) {code}
 
h1. Reproduce

It's non-deterministic and manifest for 0.15% of cases (6 out of 4000 run).
h1. Root Cause

 
{code:java}
public UUID getHostId(InetAddress endpoint, Map<InetAddress, EndpointState> 
epStates)
{
    return 
UUID.fromString(epStates.get(endpoint).getApplicationState(ApplicationState.HOST_ID).value);
} {code}
It happens at a {*}fresh start{*}, so my assumption is that StorageService is 
trying to get the epState of a specific endpoint before the state is created. 
But I am not sure how it's caused exactly by thread interleaving. I assume this 
problem also exists in 4.x version since the logic for 
StorageService.handleStateNormal is almost the same.

Consequences: This NPE won't prevent the cluster from starting up/executing 
commands. It's probably benign.

I have attached the logs of my cluster. (system.log contains logs from the seed 
node, and persistent.tar.gz contains logs from all three nodes and the 
configuration).

 

 

 

  was:
When I start up cassandra 3.11.16 cluster (Fresh Start), I noticed the 
following NPE in the system log (Seed node).

 
{code:java}
INFO  [HANDSHAKE-/192.168.77.3] 2023-10-02 12:56:43,180 
OutboundTcpConnection.java:561 - Handshaking version with /192.168.77.3
ERROR [GossipStage:1] 2023-10-02 12:56:43,185 CassandraDaemon.java:244 - 
Exception in thread Thread[GossipStage:1,5,main]
java.lang.NullPointerException: null
        at org.apache.cassandra.gms.Gossiper.getHostId(Gossiper.java:998)
        at org.apache.cassandra.gms.Gossiper.getHostId(Gossiper.java:993)
        at 
org.apache.cassandra.service.StorageService.handleStateNormal(StorageService.java:2632)
        at 
org.apache.cassandra.service.StorageService.onChange(StorageService.java:2203)
        at 
org.apache.cassandra.service.StorageService.onJoin(StorageService.java:3020)
        at 
org.apache.cassandra.gms.Gossiper.handleMajorStateChange(Gossiper.java:1246)
        at 
org.apache.cassandra.gms.Gossiper.applyStateLocally(Gossiper.java:1358)
        at 
org.apache.cassandra.gms.GossipDigestAck2VerbHandler.doVerb(GossipDigestAck2VerbHandler.java:49)
        at 
org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:69)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at 
org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:84)
        at java.lang.Thread.run(Thread.java:750) {code}
 
h1. Reproduce

It's non-deterministic and manifest for 0.15% of cases (6 out of 4000 run).
h1. Root Cause

 
{code:java}
public UUID getHostId(InetAddress endpoint, Map<InetAddress, EndpointState> 
epStates)
{
    return 
UUID.fromString(epStates.get(endpoint).getApplicationState(ApplicationState.HOST_ID).value);
} {code}
It happens at a {*}fresh start{*}, so my assumption is that StorageService is 
trying to get the epState of a specific endpoint before the state is created. 
But I am not sure how it's caused exactly by thread interleaving. I assume this 
problem also exists in 4.x version since the logic for 
StorageService.handleStateNormal is almost the same.

Consequences: This NPE won't prevent the cluster from starting up/executing 
commands. It's probably benign.

I have attached the logs of my cluster. (system.log contains logs from the seed 
node, and persistent.tar.gz contains logs from all three nodes and the 
configuration).

 

 

 


> NPE in getHostId when fresh starting up the 3.11.16 cluster
> -----------------------------------------------------------
>
>                 Key: CASSANDRA-18897
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-18897
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Cluster/Gossip
>            Reporter: Ke Han
>            Priority: Normal
>         Attachments: persistent.tar.gz, system.log
>
>
> When I start up cassandra 3.11.16 cluster (Fresh Start, 3 nodes), I noticed 
> the following NPE in the system log (Seed node).
> {code:java}
> INFO  [HANDSHAKE-/192.168.77.3] 2023-10-02 12:56:43,180 
> OutboundTcpConnection.java:561 - Handshaking version with /192.168.77.3
> ERROR [GossipStage:1] 2023-10-02 12:56:43,185 CassandraDaemon.java:244 - 
> Exception in thread Thread[GossipStage:1,5,main]
> java.lang.NullPointerException: null
>         at org.apache.cassandra.gms.Gossiper.getHostId(Gossiper.java:998)
>         at org.apache.cassandra.gms.Gossiper.getHostId(Gossiper.java:993)
>         at 
> org.apache.cassandra.service.StorageService.handleStateNormal(StorageService.java:2632)
>         at 
> org.apache.cassandra.service.StorageService.onChange(StorageService.java:2203)
>         at 
> org.apache.cassandra.service.StorageService.onJoin(StorageService.java:3020)
>         at 
> org.apache.cassandra.gms.Gossiper.handleMajorStateChange(Gossiper.java:1246)
>         at 
> org.apache.cassandra.gms.Gossiper.applyStateLocally(Gossiper.java:1358)
>         at 
> org.apache.cassandra.gms.GossipDigestAck2VerbHandler.doVerb(GossipDigestAck2VerbHandler.java:49)
>         at 
> org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:69)
>         at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at 
> org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:84)
>         at java.lang.Thread.run(Thread.java:750) {code}
>  
> h1. Reproduce
> It's non-deterministic and manifest for 0.15% of cases (6 out of 4000 run).
> h1. Root Cause
>  
> {code:java}
> public UUID getHostId(InetAddress endpoint, Map<InetAddress, EndpointState> 
> epStates)
> {
>     return 
> UUID.fromString(epStates.get(endpoint).getApplicationState(ApplicationState.HOST_ID).value);
> } {code}
> It happens at a {*}fresh start{*}, so my assumption is that StorageService is 
> trying to get the epState of a specific endpoint before the state is created. 
> But I am not sure how it's caused exactly by thread interleaving. I assume 
> this problem also exists in 4.x version since the logic for 
> StorageService.handleStateNormal is almost the same.
> Consequences: This NPE won't prevent the cluster from starting up/executing 
> commands. It's probably benign.
> I have attached the logs of my cluster. (system.log contains logs from the 
> seed node, and persistent.tar.gz contains logs from all three nodes and the 
> configuration).
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to