Thanks Aaron. The node that was behaving this way was a production node so I had to take some drastic measures to get it back to doing the right thing. It's no longer behaving this way after wiping the system tables and having cassandra resync the schema from other nodes. In hindsight, maybe I could have gotten away with a nodetool resetlocalschema. Since the node has been restored to a working state, I sadly can't run commands on it to investigate any longer.
When the node was in this hosed state, I did check nodetool gossipinfo. The bad node had the correct schema hash; the same as the rest of the nodes in the cluster. However, it thought every other node in the cluster had another schema hash, most likely the older one everyone migrated from. This issue occurred again today on three machines so I feel it may occur again. Typically I see it when our entire datacenter updates it's configuration and restarts along an hour. All nodes point to the same list of seeds, but the restart order is random across one your. I'm not sure if this information helps at all. Are there any specific things I should look for when it does occur again? Thank you, Faraaz On Aug 7, 2013, at 7:23 PM, "Aaron Morton" <aa...@thelastpickle.com> wrote: >> When looking at nodetool >> gossipinfo, I notice that this node has updated to the latest schema hash, >> but >> that it thinks other nodes in the cluster are on the older version. > What does describe cluster in cassandra-cli say ? It will let you know if > there are multiple schema versions in the cluster. > > Can you include the output from nodetool gossipinfo ? > > You may also get some value from increase the log level for > org.apache.cassandra.gms.Gossiper to DEBUG so you can see what's going on. > It's unusual for only the gossip pool to backup. If there were issues with GC > taking CPU we would expect to see it across the board. > > Cheers > > > > ----------------- > Aaron Morton > Cassandra Consultant > New Zealand > > @aaronmorton > http://www.thelastpickle.com > > On 7/08/2013, at 7:52 AM, Faraaz Sareshwala <fsareshw...@quantcast.com> wrote: > >> I'm running cassandra-1.2.8 in a cluster with 45 nodes across three racks. >> All >> nodes are well behaved except one. Whenever I start this node, it starts >> churning CPU. Running nodetool tpstats, I notice that the number of pending >> gossip stage tasks is constantly increasing [1]. When looking at nodetool >> gossipinfo, I notice that this node has updated to the latest schema hash, >> but >> that it thinks other nodes in the cluster are on the older version. I've >> tried >> to drain, decommission, wipe node data, bootstrap, and repair the node. >> However, >> the node just started doing the same thing again. >> >> Has anyone run into this issue before? Can anyone provide any insight into >> why >> this node is the only one in the cluster having problems? Are there any easy >> fixes? >> >> Thank you, >> Faraaz >> >> [1] $ /cassandra/bin/nodetool tpstats >> Pool Name Active Pending Completed Blocked All >> time blocked >> ReadStage 0 0 8 0 >> 0 >> RequestResponseStage 0 0 49198 0 >> 0 >> MutationStage 0 0 224286 0 >> 0 >> ReadRepairStage 0 0 0 0 >> 0 >> ReplicateOnWriteStage 0 0 0 0 >> 0 >> GossipStage 1 2213 18 0 >> 0 >> AntiEntropyStage 0 0 0 0 >> 0 >> MigrationStage 0 0 72 0 >> 0 >> MemtablePostFlusher 0 0 102 0 >> 0 >> FlushWriter 0 0 99 0 >> 0 >> MiscStage 0 0 0 0 >> 0 >> commitlog_archiver 0 0 0 0 >> 0 >> InternalResponseStage 0 0 19 0 >> 0 >> HintedHandoff 0 0 2 0 >> 0 >> >> Message type Dropped >> RANGE_SLICE 0 >> READ_REPAIR 0 >> BINARY 0 >> READ 0 >> MUTATION 0 >> _TRACE 0 >> REQUEST_RESPONSE 0 >