Re: Rolling upgrade from 1.1.12 to 1.2.5 visibility issue

aaron morton Thu, 20 Jun 2013 02:50:30 -0700

I once had something like this, looking at your logs I donot think it's the 
same thing but here is a post on it 
http://thelastpickle.com/2011/12/15/Anatomy-of-a-Cassandra-Partition/


It's a little different in 1.2 but the GossipDigestAckVerbHandler (and ACK2) 
should be calling Gossiper.instance.notifyFailureDetector which will result in 
the FailureDetector being called. This will keep the remote node marked as up. 
it looks like this is happening. 


>     TRACE [GossipTasks:1] 2013-06-19 07:44:52,359 FailureDetector.java (line 
> 189) PHI for /54.254.xxx.xxx : 8.05616263930532

The default phi_convict_threshold is 8, so this node thinks the other is just 
sick enough to be marked as down. 

As a work around try increasing the phi_convict_threshold to 12. Not sure why 
the 1.2 node thinks this, not sure if anything has changed. 

I used to think there was a way to dump the phi values for nodes, but I cannot 
find it. If you call dumpInterArrivalTimes on the 
org.apache.cassandra.net:type=FailureDetector MBean it will dump a file in the 
temp dir called "failuredetector-*" with the arrival times for messages from 
the other nodes. That may help. 

Cheers

-----------------
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 19/06/2013, at 8:34 PM, Polytron Feng <liqpolyt...@gmail.com> wrote:

> 
> Hi, 
> 
> We are trying to roll upgrade from 1.0.12 to 1.2.5, but we found that the 
> 1.2.5 node cannot see other old nodes. 
> Therefore, we tried to upgrade to 1.1.12 first, and it works.
> However, we still saw the same issue when rolling upgrade from 1.1.12 to 
> 1.2.5. 
> This seems to be the fixed issue as 
> https://issues.apache.org/jira/browse/CASSANDRA-5332 but we still saw it in 
> 1.2.5.
> 
> Enviroment:
>    OS: CentOS 6
>    JDK: 6u31
>    cluster:3 nodes for testing, in EC2
>    Snitch: Ec2MultiRegionSnitch
>    NetworkTopologyStrategy: strategy_options = { ap-southeast:3 } 
> 
> We have 3 nodes and we upgraded 122.248.xxx.xxx to 1.2.5 first, the other 2 
> nodes are still 1.1.12.
> When we restarted the upgraded node, it will see the other 2 old nodes as UP 
> in the log. 
> However, after a few seconds, these 2 nodes will be marked as DOWN.
> This is the ring info from 1.2.5 node - 122.248.xxx.xxx 
> 
>     Note: Ownership information does not include topology; for complete 
> information, specify a keyspace
> 
>     Datacenter: ap-southeast
>     ==========
>     Address          Rack        Status State   Load            Owns          
>       Token
>                                                                               
>       113427455640312821154458202477256070486
>     122.248.xxx.xxx  1b          Up     Normal  69.74 GB        33.33%        
>       1
>     54.251.xxx.xxx   1b          Down   Normal  69.77 GB        33.33%        
>       56713727820156410577229101238628035243
>     54.254.xxx.xxx   1b          Down   Normal  70.28 GB        33.33%        
>       113427455640312821154458202477256070486
> 
> 
> but Old 1.1.12 nodes can see new node:
> 
>     Note: Ownership information does not include topology, please specify a 
> keyspace.
>     Address         DC          Rack        Status State   Load            
> Owns                Token
>                                                                               
>                  113427455640312821154458202477256070486
>     122.248.xxx.xxx ap-southeast1b          Up     Normal  69.74 GB        
> 33.33%              1
>     54.251.xxx.xxx  ap-southeast1b          Up     Normal  69.77 GB        
> 33.33%              56713727820156410577229101238628035243
>     54.254.xxx.xxx  ap-southeast1b          Up     Normal  70.28 GB        
> 33.33%              113427455640312821154458202477256070486
> 
> 
> We enabled trace log level to check gossip related logs. The log below from 
> 1.2.5 node shows that the 
> other 2 nodes are UP in the beginning. They seem to complete SYN/ACK/ACK2 
> handshake cycle.
> 
>     TRACE [GossipStage:1] 2013-06-19 07:44:43,047 
> GossipDigestSynVerbHandler.java (line 40) Received a GossipDigestSynMessage 
> from /54.254.xxx.xxx
>     TRACE [GossipStage:1] 2013-06-19 07:44:43,047 
> GossipDigestSynVerbHandler.java (line 71) Gossip syn digests are : 
> /54.254.xxx.xxx:1371617084:10967 /54.251.    xxx.xxx:1371625851:2055 
>     TRACE [GossipStage:1] 2013-06-19 07:44:43,048 Gossiper.java (line 945) 
> requestAll for /54.254.xxx.xxx
>     .....
>     
>     TRACE [GossipStage:1] 2013-06-19 07:44:43,080 
> GossipDigestSynVerbHandler.java (line 84) Sending a GossipDigestAckMessage to 
> /54.254.xxx.xxx
>     TRACE [GossipStage:1] 2013-06-19 07:44:43,080 MessagingService.java (line 
> 601) /122.248.216.142 sending GOSSIP_DIGEST_ACK to 19@/54.254.xxx.xxx
>     TRACE [GossipStage:1] 2013-06-19 07:44:43,080 
> GossipDigestSynVerbHandler.java (line 40) Received a GossipDigestSynMessage 
> from /54.254.xxx.xxx
>     TRACE [GossipStage:1] 2013-06-19 07:44:43,080 
> GossipDigestSynVerbHandler.java (line 71) Gossip syn digests are : 
> /54.254.xxx.xxx:1371617084:10978 /54.251.    xxx.xxx:1371625851:2066 
>     
>     .....
>     
>     TRACE [GossipStage:1] 2013-06-19 07:44:43,083 
> GossipDigestSynVerbHandler.java (line 84) Sending a GossipDigestAckMessage to 
> /54.254.xxx.xxx
>     TRACE [GossipStage:1] 2013-06-19 07:44:43,083 MessagingService.java (line 
> 601) /122.248.216.142 sending GOSSIP_DIGEST_ACK to 22@/54.254.xxx.xxx
>     TRACE [GossipStage:1] 2013-06-19 07:44:43,084 
> GossipDigestAck2VerbHandler.java (line 38) Received a GossipDigestAck2Message 
> from /54.254.xxx.xxx
>     TRACE [GossipStage:1] 2013-06-19 07:44:43,084 FailureDetector.java (line 
> 168) reporting /54.254.xxx.xxx
>      INFO [GossipStage:1] 2013-06-19 07:44:43,093 Gossiper.java (line 805) 
> Node /54.254.xxx.xxx is now part of the cluster
>     TRACE [GossipStage:1] 2013-06-19 07:44:43,094 Gossiper.java (line 808) 
> Adding endpoint state for /54.254.xxx.xxx
>     DEBUG [GossipStage:1] 2013-06-19 07:44:43,094 MessagingService.java (line 
> 377) Resetting pool for /54.254.xxx.xxx
>     TRACE [GossipStage:1] 2013-06-19 07:44:43,094 Gossiper.java (line 764) 
> marking as alive /54.254.xxx.xxx
>     DEBUG [GossipStage:1] 2013-06-19 07:44:43,095 Gossiper.java (line 770) 
> removing expire time for endpoint : /54.254.xxx.xxx
>      INFO [GossipStage:1] 2013-06-19 07:44:43,095 Gossiper.java (line 771) 
> InetAddress /54.254.xxx.xxx is now UP
> 
> 
> 
> After a few seconds, the log shows that the old nodes did not respond ACK2 
> and are marked as DOWN:
> 
>     TRACE [GossipStage:1] 2013-06-19 07:44:52,121 
> GossipDigestSynVerbHandler.java (line 40) Received a GossipDigestSynMessage 
> from /54.254.xxx.xxx
>     TRACE [GossipStage:1] 2013-06-19 07:44:52,123 
> GossipDigestSynVerbHandler.java (line 71) Gossip syn digests are : 
> /54.254.xxx.xxx:1371617084:10991 /54.251.    xxx.xxx:1371625851:2078 
>     TRACE [GossipStage:1] 2013-06-19 07:44:52,124 
> GossipDigestSynVerbHandler.java (line 84) Sending a GossipDigestAckMessage to 
> /54.254.xxx.xxx
>     TRACE [GossipStage:1] 2013-06-19 07:44:52,124 MessagingService.java (line 
> 601) /122.248.xxx.xxx sending GOSSIP_DIGEST_ACK to 52@/54.254.xxx.xxx
>     TRACE [GossipTasks:1] 2013-06-19 07:44:52,358 Gossiper.java (line 116) My 
> heartbeat is now 25
>     TRACE [GossipTasks:1] 2013-06-19 07:44:52,358 Gossiper.java (line 350) 
> Gossip Digests are : /54.251.xxx.xxx:1371625851:2069 /122.248.xxx.    
> xxx:1371627882:25 /54.254.xxx.xxx:1371617084:10980 
>     TRACE [GossipTasks:1] 2013-06-19 07:44:52,358 Gossiper.java (line 505) 
> Sending a GossipDigestSyn to /54.254.xxx.xxx ...
>     TRACE [GossipTasks:1] 2013-06-19 07:44:52,358 MessagingService.java (line 
> 601) /122.248.xxx.xxx sending GOSSIP_DIGEST_SYN to 54@/54.254.xxx.xxx
>     TRACE [GossipTasks:1] 2013-06-19 07:44:52,359 Gossiper.java (line 154) 
> Performing status check ...
>     TRACE [GossipTasks:1] 2013-06-19 07:44:52,359 FailureDetector.java (line 
> 189) PHI for /54.254.xxx.xxx : 8.05616263930532
>     TRACE [GossipTasks:1] 2013-06-19 07:44:52,359 FailureDetector.java (line 
> 193) notifying listeners that /54.254.xxx.xxx is down
>     TRACE [GossipTasks:1] 2013-06-19 07:44:52,359 FailureDetector.java (line 
> 194) intervals: 500.0 mean: 500.0
>     TRACE [GossipTasks:1] 2013-06-19 07:44:52,359 Gossiper.java (line 781) 
> marking as down /54.254.xxx.xxx
>      INFO [GossipTasks:1] 2013-06-19 07:44:52,360 Gossiper.java (line 785) 
> InetAddress /54.254.xxx.xxx is now DOWN
>     DEBUG [GossipTasks:1] 2013-06-19 07:44:52,360 MessagingService.java (line 
> 377) Resetting pool for /54.254.xxx.xxx
> 
> 
> We check that old node try to send SYN to 1.2.5 node but does not receive any 
> GossipDigestAckMessage from it.
> 
> Any ideas how to solve this issue?
> 
> Thank you!
>

Re: Rolling upgrade from 1.1.12 to 1.2.5 visibility issue

Reply via email to