Before we start.. what version of cassandra? On Fri, Dec 21, 2012 at 4:25 PM, Keith Wright <kwri...@nanigans.com> wrote: > This behavior seems to occur if I do a large > amount of data loading using that node as the coordinator node.
In general you want to use all nodes to coordinate, not a single one. > Nodetool netstats never seems to show > any streaming data. With past nodes it seemed like the node eventually > fixed itself. That node is storing hints for other nodes it believes are or were at some point in DOWN state. The first step to preventing this problem from recurring is to understand why it believes/d other nodes are down. My conjecture is that you are overloading the coordinating node and/or other nodes with the large amount of write. > Note that I am seeing severely degraded performance on this node when it > attempts to compact the HintsColumnFamily to the point where I had to set > setcompactionthroughput to 999 to ensure it doesn't run again (after which > the node started serving requests much faster). Depending on version, your 40gb of hints could be in one 40gb wide row. Look at nodetool cfstats for HintsColumnFamily to determine if this is the case. Do you see "Timed out replaying hint" messages, or are the hints being successfully delivered? You have two broad options : 1) purge your hints and then either reload the data (if reloading it will be idempotent) or "repair -pr" on every node in the cluster. 2) reduce load enough that hints will be successfully delivered, reduce gc_grace_seconds on the hints cf to 0 and then do a major compaction. If I were you, I would probably do 1). The easiest way is to stop the node and remove all sstables in the HintsColumnFamily. =Rob -- =Robert Coli AIM>ALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb