And by that last statement, I mean are there any further things I should look for given the information in my response? I'll definitely look at implementing your suggestions and see what I can find.
On Aug 7, 2013, at 7:31 PM, "Faraaz Sareshwala" <fsareshw...@quantcast.com> wrote: > Thanks Aaron. The node that was behaving this way was a production node so I > had to take some drastic measures to get it back to doing the right thing. > It's no longer behaving this way after wiping the system tables and having > cassandra resync the schema from other nodes. In hindsight, maybe I could > have gotten away with a nodetool resetlocalschema. Since the node has been > restored to a working state, I sadly can't run commands on it to investigate > any longer. > > When the node was in this hosed state, I did check nodetool gossipinfo. The > bad node had the correct schema hash; the same as the rest of the nodes in > the cluster. However, it thought every other node in the cluster had another > schema hash, most likely the older one everyone migrated from. > > This issue occurred again today on three machines so I feel it may occur > again. Typically I see it when our entire datacenter updates it's > configuration and restarts along an hour. All nodes point to the same list of > seeds, but the restart order is random across one your. I'm not sure if this > information helps at all. > > Are there any specific things I should look for when it does occur again? > > Thank you, > Faraaz > > On Aug 7, 2013, at 7:23 PM, "Aaron Morton" <aa...@thelastpickle.com> wrote: > >>> When looking at nodetool >>> gossipinfo, I notice that this node has updated to the latest schema hash, >>> but >>> that it thinks other nodes in the cluster are on the older version. >> What does describe cluster in cassandra-cli say ? It will let you know if >> there are multiple schema versions in the cluster. >> >> Can you include the output from nodetool gossipinfo ? >> >> You may also get some value from increase the log level for >> org.apache.cassandra.gms.Gossiper to DEBUG so you can see what's going on. >> It's unusual for only the gossip pool to backup. If there were issues with >> GC taking CPU we would expect to see it across the board. >> >> Cheers >> >> >> >> ----------------- >> Aaron Morton >> Cassandra Consultant >> New Zealand >> >> @aaronmorton >> http://www.thelastpickle.com >> >> On 7/08/2013, at 7:52 AM, Faraaz Sareshwala <fsareshw...@quantcast.com> >> wrote: >> >>> I'm running cassandra-1.2.8 in a cluster with 45 nodes across three racks. >>> All >>> nodes are well behaved except one. Whenever I start this node, it starts >>> churning CPU. Running nodetool tpstats, I notice that the number of pending >>> gossip stage tasks is constantly increasing [1]. When looking at nodetool >>> gossipinfo, I notice that this node has updated to the latest schema hash, >>> but >>> that it thinks other nodes in the cluster are on the older version. I've >>> tried >>> to drain, decommission, wipe node data, bootstrap, and repair the node. >>> However, >>> the node just started doing the same thing again. >>> >>> Has anyone run into this issue before? Can anyone provide any insight into >>> why >>> this node is the only one in the cluster having problems? Are there any easy >>> fixes? >>> >>> Thank you, >>> Faraaz >>> >>> [1] $ /cassandra/bin/nodetool tpstats >>> Pool Name Active Pending Completed Blocked All >>> time blocked >>> ReadStage 0 0 8 0 >>> 0 >>> RequestResponseStage 0 0 49198 0 >>> 0 >>> MutationStage 0 0 224286 0 >>> 0 >>> ReadRepairStage 0 0 0 0 >>> 0 >>> ReplicateOnWriteStage 0 0 0 0 >>> 0 >>> GossipStage 1 2213 18 0 >>> 0 >>> AntiEntropyStage 0 0 0 0 >>> 0 >>> MigrationStage 0 0 72 0 >>> 0 >>> MemtablePostFlusher 0 0 102 0 >>> 0 >>> FlushWriter 0 0 99 0 >>> 0 >>> MiscStage 0 0 0 0 >>> 0 >>> commitlog_archiver 0 0 0 0 >>> 0 >>> InternalResponseStage 0 0 19 0 >>> 0 >>> HintedHandoff 0 0 2 0 >>> 0 >>> >>> Message type Dropped >>> RANGE_SLICE 0 >>> READ_REPAIR 0 >>> BINARY 0 >>> READ 0 >>> MUTATION 0 >>> _TRACE 0 >>> REQUEST_RESPONSE 0 >>