And by that last statement, I mean are there any further things I should look 
for given the information in my response? I'll definitely look at implementing 
your suggestions and see what I can find.

On Aug 7, 2013, at 7:31 PM, "Faraaz Sareshwala" <fsareshw...@quantcast.com> 
wrote:

> Thanks Aaron. The node that was behaving this way was a production node so I 
> had to take some drastic measures to get it back to doing the right thing. 
> It's no longer behaving this way after wiping the system tables and having 
> cassandra resync the schema from other nodes. In hindsight, maybe I could 
> have gotten away with a nodetool resetlocalschema. Since the node has been 
> restored to a working state, I sadly can't run commands on it to investigate 
> any longer.
> 
> When the node was in this hosed state, I did check nodetool gossipinfo. The 
> bad node had the correct schema hash; the same as the rest of the nodes in 
> the cluster. However, it thought every other node in the cluster had another 
> schema hash, most likely the older one everyone migrated from.
> 
> This issue occurred again today on three machines so I feel it may occur 
> again. Typically I see it when our entire datacenter updates it's 
> configuration and restarts along an hour. All nodes point to the same list of 
> seeds, but the restart order is random across one your. I'm not sure if this 
> information helps at all.
> 
> Are there any specific things I should look for when it does occur again?
> 
> Thank you,
> Faraaz
> 
> On Aug 7, 2013, at 7:23 PM, "Aaron Morton" <aa...@thelastpickle.com> wrote:
> 
>>> When looking at nodetool
>>> gossipinfo, I notice that this node has updated to the latest schema hash, 
>>> but
>>> that it thinks other nodes in the cluster are on the older version.
>> What does describe cluster in cassandra-cli say ? It will let you know if 
>> there are multiple schema versions in the cluster. 
>> 
>> Can you include the output from nodetool gossipinfo ? 
>> 
>> You may also get some value from increase the log level for 
>> org.apache.cassandra.gms.Gossiper to DEBUG so you can see what's going on. 
>> It's unusual for only the gossip pool to backup. If there were issues with 
>> GC taking CPU we would expect to see it across the board. 
>> 
>> Cheers
>> 
>> 
>> 
>> -----------------
>> Aaron Morton
>> Cassandra Consultant
>> New Zealand
>> 
>> @aaronmorton
>> http://www.thelastpickle.com
>> 
>> On 7/08/2013, at 7:52 AM, Faraaz Sareshwala <fsareshw...@quantcast.com> 
>> wrote:
>> 
>>> I'm running cassandra-1.2.8 in a cluster with 45 nodes across three racks. 
>>> All
>>> nodes are well behaved except one. Whenever I start this node, it starts
>>> churning CPU. Running nodetool tpstats, I notice that the number of pending
>>> gossip stage tasks is constantly increasing [1]. When looking at nodetool
>>> gossipinfo, I notice that this node has updated to the latest schema hash, 
>>> but
>>> that it thinks other nodes in the cluster are on the older version. I've 
>>> tried
>>> to drain, decommission, wipe node data, bootstrap, and repair the node. 
>>> However,
>>> the node just started doing the same thing again.
>>> 
>>> Has anyone run into this issue before? Can anyone provide any insight into 
>>> why
>>> this node is the only one in the cluster having problems? Are there any easy
>>> fixes?
>>> 
>>> Thank you,
>>> Faraaz
>>> 
>>> [1] $ /cassandra/bin/nodetool tpstats
>>> Pool Name                    Active   Pending      Completed   Blocked  All 
>>> time blocked
>>> ReadStage                         0         0              8         0      
>>>            0
>>> RequestResponseStage              0         0          49198         0      
>>>            0
>>> MutationStage                     0         0         224286         0      
>>>            0
>>> ReadRepairStage                   0         0              0         0      
>>>            0
>>> ReplicateOnWriteStage             0         0              0         0      
>>>            0
>>> GossipStage                       1      2213             18         0      
>>>            0
>>> AntiEntropyStage                  0         0              0         0      
>>>            0
>>> MigrationStage                    0         0             72         0      
>>>            0
>>> MemtablePostFlusher               0         0            102         0      
>>>            0
>>> FlushWriter                       0         0             99         0      
>>>            0
>>> MiscStage                         0         0              0         0      
>>>            0
>>> commitlog_archiver                0         0              0         0      
>>>            0
>>> InternalResponseStage             0         0             19         0      
>>>            0
>>> HintedHandoff                     0         0              2         0      
>>>            0
>>> 
>>> Message type           Dropped
>>> RANGE_SLICE                  0
>>> READ_REPAIR                  0
>>> BINARY                       0
>>> READ                         0
>>> MUTATION                     0
>>> _TRACE                       0
>>> REQUEST_RESPONSE             0
>> 

Reply via email to