The number of Completed HH tasks is interesting. AFAIK a task is started when the node detects another in the cluster has returned. Were you doing some other restarts around the cluster ?
I don't want to divert from the GC issue, just wondering if something else is going on as well. Like the node is been asked to record a lot of hints. Cheers ----------------- Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 13 May 2011, at 03:51, Gabriel Tataranu wrote: >>> What does the TPStats look like on the nodes under pressure ? And how many >>> nodes are delivering hints to the nodes when they restart? > > $nodetool -h 127.0.0.1 tpstats > Pool Name Active Pending Completed > ReadStage 1 1 1992475 > RequestResponseStage 0 0 2247486 > MutationStage 0 0 1631349 > ReadRepairStage 0 0 583432 > GossipStage 0 0 241324 > AntiEntropyStage 0 0 0 > MigrationStage 0 0 0 > MemtablePostFlusher 0 0 46 > StreamStage 0 0 0 > FlushWriter 0 0 46 > MiscStage 0 0 0 > FlushSorter 0 0 0 > InternalResponseStage 0 0 0 > HintedHandoff 1 5 152 > > > dstat -cmdln during the event: > > ----total-cpu-usage---- ------memory-usage----- ---load-avg--- > -dsk/total- -net/total- > usr sys idl wai hiq siq| used buff cach free| 1m 5m 15m | read > writ| recv send > 87 6 6 0 0 1|6890M 32.1M 1001M 42.8M|2.36 2.87 1.73| 0 > 0 | 75k 144k > 88 10 2 0 0 0|6889M 32.2M 1002M 41.6M|3.05 3.00 1.78| 0 > 0 | 60k 102k > 89 9 2 0 0 0|6890M 32.2M 1003M 41.0M|3.05 3.00 1.78| 0 > 0 | 38k 70k > 89 10 1 0 0 0|6890M 32.2M 1003M 40.7M|3.05 3.00 1.78| 0 > 0 | 26k 24k > 93 6 2 0 0 0|6890M 32.2M 1003M 40.9M|3.05 3.00 1.78| 0 > 0 | 37k 31k > 90 8 2 0 0 0|6890M 32.2M 1003M 39.9M|3.05 3.00 1.78| 0 > 0 | 67k 69k > 87 8 4 0 0 1|6890M 32.2M 1004M 38.7M|4.09 3.22 1.85| 0 > 0 | 123k 262k > 83 13 2 0 0 2|6890M 32.2M 1004M 38.3M|4.09 3.22 1.85| 0 > 0 | 445k 18M > 90 6 3 0 0 0|6890M 32.2M 1005M 38.2M|4.09 3.22 1.85| 0 > 0 | 72k 91k > 40 7 25 27 0 0|6890M 32.2M 1005M 37.8M|4.09 3.22 1.85| 0 > 0 | 246k 8034k > 0 0 59 41 0 0|6890M 32.2M 1005M 37.7M|4.09 3.22 1.85| 0 > 0 | 19k 6490B > 1 2 45 52 0 0|6891M 32.2M 999M 43.1M|4.00 3.21 1.86| 0 > 0 | 29k 18k > 72 8 15 3 0 1|6892M 32.2M 999M 41.6M|4.00 3.21 1.86| 0 > 0 | 431k 11M > 88 9 2 0 0 1|6907M 32.0M 985M 41.1M|4.00 3.21 1.86| 0 > 0 | 99k 77k > 88 10 1 0 0 1|6913M 31.9M 977M 44.1M|4.00 3.21 1.86| 0 > 0 | 112k 619k > 89 9 1 0 0 1|6892M 31.9M 977M 64.4M|4.00 3.21 1.86| 0 > 0 | 109k 369k > 90 8 1 0 0 0|6892M 31.9M 979M 62.5M|4.80 3.39 1.92| 0 > 0 | 130k 97k > 83 13 1 0 0 3|6893M 32.0M 981M 59.8M|4.80 3.39 1.92| 0 > 0 | 503k 18M > 78 11 10 0 0 0|6893M 32.0M 981M 59.5M|4.80 3.39 1.92| 0 > 0 | 102k 110k > > > The low cpu periods are due to major GC (JVM frozen). > >> >> TPStats do show activity on the HH. I'll have some examples latter if >> the nodes decide to do this again. >> >>> >>> Finally hinted_handoff_throttle_delay_in_ms in conf/cassandra.yaml will let >>> you slow down the delivery rate if HH is indeed the problem. >> > > > Best, > > Gabriel >