>> What does the TPStats look like on the nodes under pressure ? And how many >> nodes are delivering hints to the nodes when they restart?
$nodetool -h 127.0.0.1 tpstats Pool Name Active Pending Completed ReadStage 1 1 1992475 RequestResponseStage 0 0 2247486 MutationStage 0 0 1631349 ReadRepairStage 0 0 583432 GossipStage 0 0 241324 AntiEntropyStage 0 0 0 MigrationStage 0 0 0 MemtablePostFlusher 0 0 46 StreamStage 0 0 0 FlushWriter 0 0 46 MiscStage 0 0 0 FlushSorter 0 0 0 InternalResponseStage 0 0 0 HintedHandoff 1 5 152 dstat -cmdln during the event: ----total-cpu-usage---- ------memory-usage----- ---load-avg--- -dsk/total- -net/total- usr sys idl wai hiq siq| used buff cach free| 1m 5m 15m | read writ| recv send 87 6 6 0 0 1|6890M 32.1M 1001M 42.8M|2.36 2.87 1.73| 0 0 | 75k 144k 88 10 2 0 0 0|6889M 32.2M 1002M 41.6M|3.05 3.00 1.78| 0 0 | 60k 102k 89 9 2 0 0 0|6890M 32.2M 1003M 41.0M|3.05 3.00 1.78| 0 0 | 38k 70k 89 10 1 0 0 0|6890M 32.2M 1003M 40.7M|3.05 3.00 1.78| 0 0 | 26k 24k 93 6 2 0 0 0|6890M 32.2M 1003M 40.9M|3.05 3.00 1.78| 0 0 | 37k 31k 90 8 2 0 0 0|6890M 32.2M 1003M 39.9M|3.05 3.00 1.78| 0 0 | 67k 69k 87 8 4 0 0 1|6890M 32.2M 1004M 38.7M|4.09 3.22 1.85| 0 0 | 123k 262k 83 13 2 0 0 2|6890M 32.2M 1004M 38.3M|4.09 3.22 1.85| 0 0 | 445k 18M 90 6 3 0 0 0|6890M 32.2M 1005M 38.2M|4.09 3.22 1.85| 0 0 | 72k 91k 40 7 25 27 0 0|6890M 32.2M 1005M 37.8M|4.09 3.22 1.85| 0 0 | 246k 8034k 0 0 59 41 0 0|6890M 32.2M 1005M 37.7M|4.09 3.22 1.85| 0 0 | 19k 6490B 1 2 45 52 0 0|6891M 32.2M 999M 43.1M|4.00 3.21 1.86| 0 0 | 29k 18k 72 8 15 3 0 1|6892M 32.2M 999M 41.6M|4.00 3.21 1.86| 0 0 | 431k 11M 88 9 2 0 0 1|6907M 32.0M 985M 41.1M|4.00 3.21 1.86| 0 0 | 99k 77k 88 10 1 0 0 1|6913M 31.9M 977M 44.1M|4.00 3.21 1.86| 0 0 | 112k 619k 89 9 1 0 0 1|6892M 31.9M 977M 64.4M|4.00 3.21 1.86| 0 0 | 109k 369k 90 8 1 0 0 0|6892M 31.9M 979M 62.5M|4.80 3.39 1.92| 0 0 | 130k 97k 83 13 1 0 0 3|6893M 32.0M 981M 59.8M|4.80 3.39 1.92| 0 0 | 503k 18M 78 11 10 0 0 0|6893M 32.0M 981M 59.5M|4.80 3.39 1.92| 0 0 | 102k 110k The low cpu periods are due to major GC (JVM frozen). > > TPStats do show activity on the HH. I'll have some examples latter if > the nodes decide to do this again. > >> >> Finally hinted_handoff_throttle_delay_in_ms in conf/cassandra.yaml will let >> you slow down the delivery rate if HH is indeed the problem. > Best, Gabriel