1.1.7 Rob Coli <rc...@palominodb.com> wrote:
>Before we start.. what version of cassandra? > >On Fri, Dec 21, 2012 at 4:25 PM, Keith Wright <kwri...@nanigans.com> wrote: >> This behavior seems to occur if I do a large >> amount of data loading using that node as the coordinator node. > >In general you want to use all nodes to coordinate, not a single one. > >> Nodetool netstats never seems to show >> any streaming data. With past nodes it seemed like the node eventually >> fixed itself. > >That node is storing hints for other nodes it believes are or were at >some point in DOWN state. The first step to preventing this problem >from recurring is to understand why it believes/d other nodes are >down. My conjecture is that you are overloading the coordinating node >and/or other nodes with the large amount of write. > >> Note that I am seeing severely degraded performance on this node when it >> attempts to compact the HintsColumnFamily to the point where I had to set >> setcompactionthroughput to 999 to ensure it doesn't run again (after which >> the node started serving requests much faster). > >Depending on version, your 40gb of hints could be in one 40gb wide >row. Look at nodetool cfstats for HintsColumnFamily to determine if >this is the case. > >Do you see "Timed out replaying hint" messages, or are the hints being >successfully delivered? > >You have two broad options : > >1) purge your hints and then either reload the data (if reloading it >will be idempotent) or "repair -pr" on every node in the cluster. >2) reduce load enough that hints will be successfully delivered, >reduce gc_grace_seconds on the hints cf to 0 and then do a major >compaction. > >If I were you, I would probably do 1). The easiest way is to stop the >node and remove all sstables in the HintsColumnFamily. > >=Rob > >-- >=Robert Coli >AIM>ALK - rc...@palominodb.com >YAHOO - rcoli.palominob >SKYPE - rcoli_palominodb