Non-zero for pending tasks is too transient. Try monitoring tpstats with a (much) higher frequency and look for sustained threshold over a duration.
Then, using a percentage of the configuration values for the max - 75% of memtable_flush_queue_size in this case - alert when it has been higher than '3' for more than N time. (Start with N=60 seconds and go from there). Also, that is a very high 'all time blocked' to 'completed' ratio for FlushWriter. If iostat is happy, i'd do as Aaron suggested above and turn up the memtable_flush_queue_size and play around with turning up memtable_flush_writers (incrementally and separately for both of course so you can see the effect). On Thu, Jun 27, 2013 at 2:27 AM, Arindam Barua <aba...@247-inc.com> wrote: > In our performance tests, we are seeing similar FlushWriter, MutationStage, > MemtablePostFlusher pending tasks become non-zero. We collect snapshots every > 5 minutes, and they seem to clear after ~10-15 minutes though. (The flush > writer has an 'All time blocked' count of 540 in the below example). > > We do not use secondary indexes or snapshots. We do not use SSDs. We have a > 4-node cluster with around 30-40 GB data on each node. Each node has 3 1-TB > disks with a RAID 0 setup. > > Currently we monitor the tpstats every 5 minutes, and alert if FlushWriter or > MutationStage has a non-zero Pending count. Any suggestions if this is a > cause of concern already, or, should we alert only if that count becomes > greater than a bigger number, say 10, or if the count remains non-zero > greater than a specified time. > > Pool Name Active Pending Completed Blocked All > time blocked > ReadStage 0 0 15685133 0 > 0 > RequestResponseStage 0 0 29880863 0 > 0 > MutationStage 0 0 40457340 0 > 0 > ReadRepairStage 0 0 704322 0 > 0 > ReplicateOnWriteStage 0 0 0 0 > 0 > GossipStage 0 0 2283062 0 > 0 > AntiEntropyStage 0 0 0 0 > 0 > MigrationStage 0 0 70 0 > 0 > MemtablePostFlusher 1 1 1837 0 > 0 > StreamStage 0 0 0 0 > 0 > FlushWriter 1 1 1446 0 > 540 > MiscStage 0 0 0 0 > 0 > commitlog_archiver 0 0 0 0 > 0 > InternalResponseStage 0 0 43 0 > 0 > HintedHandoff 0 0 3 0 > 0 > > Thanks, > Arindam > > -----Original Message----- > From: aaron morton [mailto:aa...@thelastpickle.com] > Sent: Tuesday, June 25, 2013 10:29 PM > To: user@cassandra.apache.org > Subject: Re: about FlushWriter "All time blocked" > >> FlushWriter 0 0 191 0 >> 12 > > This means there were 12 times the code wanted to put an memtable in the > queue to be flushed to disk but the queue was full. > > The length of this queue is controlled by the memtable_flush_queue_size > https://github.com/apache/cassandra/blob/cassandra-1.2/conf/cassandra.yaml#L299 > and memtable_flush_writers . > > When this happens an internal lock around the commit log is held which > prevents writes from being processed. > > In general it means the IO system cannot keep up. It can sometimes happen > when snapshot is used as all the CF's are flushed to disk at once. I also > suspect it happens sometimes when a commit log segment is flushed and their > are a lot of dirty CF's. But i've never proved it. > > Increase memtable_flush_queue_size following the help in the yaml file. If > you do not use secondary indexes are you using snapshot? > > Hope that helps. > A > ----------------- > Aaron Morton > Freelance Cassandra Consultant > New Zealand > > @aaronmorton > http://www.thelastpickle.com > > On 24/06/2013, at 3:41 PM, yue.zhang <yue.zh...@chinacache.com> wrote: > >> 3 node >> cent os >> CPU 8core memory 32GB >> cassandra 1.2.5 >> my scenario: many counter incr, every node has one client program, >> performance is 400 wps /every clicent (it’s so slowly) >> >> my question: >> Ø nodetool tpstats >> --------------------------------- >> Pool Name Active Pending Completed Blocked All >> time blocked >> ReadStage 0 0 8453 0 >> 0 >> RequestResponseStage 0 0 138303982 0 >> 0 >> MutationStage 0 0 172002988 0 >> 0 >> ReadRepairStage 0 0 0 0 >> 0 >> ReplicateOnWriteStage 0 0 82246354 0 >> 0 >> GossipStage 0 0 1052389 0 >> 0 >> AntiEntropyStage 0 0 0 0 >> 0 >> MigrationStage 0 0 0 0 >> 0 >> MemtablePostFlusher 0 0 670 0 >> 0 >> FlushWriter 0 0 191 0 >> 12 >> MiscStage 0 0 0 0 >> 0 >> commitlog_archiver 0 0 0 0 >> 0 >> InternalResponseStage 0 0 0 0 >> 0 >> HintedHandoff 0 0 56 0 >> 0 >> ----------------------------------- >> FlushWriter “All time blocked”=12,I restart the node,but no use,it’s >> normally ? >> >> thx >> >> -heipark >> >> > >