James, Thanks for sharing. Anyway, good to know there's one more thing to add to the checklist.
On Sun, Jan 17, 2016 at 12:23 PM, James Griffin < james.grif...@idioplatform.com> wrote: > Hi all, > > Just to let you know, we finally figured this out on Friday. It turns out > the new nodes had an older version of the kernel installed. Upgrading the > kernel solved our issues. For reference, the "bad" kernel was > 3.2.0-75-virtual, upgrading to 3.2.0-86-virtual resolved the issue. We > still don't fully understand why this kernel bug didn't affect *all *our > nodes (in the end we had three nodes with that kernel, only two of them > exhibited this issue), but there we go. > > Thanks everyone for your help > > Cheers, > Griff > > On 14 January 2016 at 15:14, James Griffin <james.grif...@idioplatform.com > > wrote: > >> Hi Kai, >> >> Well observed - running `nodetool status` without specifying keyspace >> does report ~33% on each node. We have two keyspaces on this cluster - if I >> specify either of them the ownership reported by each node is 100%, so I >> believe the repair completed successfully. >> >> Best wishes, >> >> Griff >> >> [image: idioplatform] <http://idioplatform.com/>James "Griff" Griffin >> CTO >> Switchboard: +44 (0)20 3540 1920 | Direct: +44 (0)7763 139 206 | >> Twitter: @imaginaryroots <http://twitter.com/imaginaryroots> | Skype: >> j.s.griffin >> idio helps major brands and publishers to build closer relationships with >> their customers and prospects by learning from their content consumption >> and acting on that insight. We call it Content Intelligence, and it >> integrates with your existing marketing technology to provide detailed >> customer interest profiles in real-time across all channels, and to >> personalize content into every channel for every customer. See >> http://idioplatform.com >> <https://t.yesware.com/tl/0e637e4938676b6f3897def79d0810a71e59612e/10068de2036c2daf922e0a879bb2fe92/9dae8be0f7693bf2b28a88cc4b38c554?ytl=http%3A%2F%2Fidioplatform.com%2F> >> for >> more information. >> >> On 14 January 2016 at 15:08, Kai Wang <dep...@gmail.com> wrote: >> >>> James, >>> >>> I may miss something. You mentioned your cluster had RF=3. Then why >>> does "nodetool status" show each node owns 1/3 of the data especially after >>> a full repair? >>> >>> On Thu, Jan 14, 2016 at 9:56 AM, James Griffin < >>> james.grif...@idioplatform.com> wrote: >>> >>>> Hi Kai, >>>> >>>> Below - nothing going on that I can see >>>> >>>> $ nodetool netstats >>>> Mode: NORMAL >>>> Not sending any streams. >>>> Read Repair Statistics: >>>> Attempted: 0 >>>> Mismatch (Blocking): 0 >>>> Mismatch (Background): 0 >>>> Pool Name Active Pending Completed >>>> Commands n/a 0 6326 >>>> Responses n/a 0 219356 >>>> >>>> >>>> >>>> Best wishes, >>>> >>>> Griff >>>> >>>> [image: idioplatform] <http://idioplatform.com/>James "Griff" Griffin >>>> CTO >>>> Switchboard: +44 (0)20 3540 1920 | Direct: +44 (0)7763 139 206 | >>>> Twitter: @imaginaryroots <http://twitter.com/imaginaryroots> | Skype: >>>> j.s.griffin >>>> idio helps major brands and publishers to build closer relationships >>>> with their customers and prospects by learning from their content >>>> consumption and acting on that insight. We call it Content Intelligence, >>>> and it integrates with your existing marketing technology to provide >>>> detailed customer interest profiles in real-time across all channels, and >>>> to personalize content into every channel for every customer. See >>>> http://idioplatform.com >>>> <https://t.yesware.com/tl/0e637e4938676b6f3897def79d0810a71e59612e/10068de2036c2daf922e0a879bb2fe92/9dae8be0f7693bf2b28a88cc4b38c554?ytl=http%3A%2F%2Fidioplatform.com%2F> >>>> for >>>> more information. >>>> >>>> On 14 January 2016 at 14:22, Kai Wang <dep...@gmail.com> wrote: >>>> >>>>> James, >>>>> >>>>> Can you post the result of "nodetool netstats" on the bad node? >>>>> >>>>> On Thu, Jan 14, 2016 at 9:09 AM, James Griffin < >>>>> james.grif...@idioplatform.com> wrote: >>>>> >>>>>> A summary of what we've done this morning: >>>>>> >>>>>> - Noted that there are no GCInspector lines in system.log on bad >>>>>> node (there are GCInspector logs on other healthy nodes) >>>>>> - Turned on GC logging, noted that we had logs which stated out >>>>>> total time for which application threads were stopped was high - ~10s. >>>>>> - Not seeing failures or any kind (promotion or concurrent mark) >>>>>> - Attached Visual VM: noted that heap usage was very low (~5% >>>>>> usage and stable) and it didn't display hallmarks GC of activity. >>>>>> PermGen >>>>>> also very stable >>>>>> - Downloaded GC logs and examined in GC Viewer. Noted that: >>>>>> - We had lots of pauses (again around 10s), but no full GC. >>>>>> - From a 2,300s sample, just over 2,000s were spent with >>>>>> threads paused >>>>>> - Spotted many small GCs in the new space - realised that Xmn >>>>>> value was very low (200M against a heap size of 3750M). Increased >>>>>> Xmn to >>>>>> 937M - no change in server behaviour (high load, high reads/s on >>>>>> disk, high >>>>>> CPU wait) >>>>>> >>>>>> Current output of jstat: >>>>>> >>>>>> S0 S1 E O P YGC YGCT FGC FGCT >>>>>> GCT >>>>>> 2 0.00 45.20 12.82 26.84 76.21 2333 63.684 2 0.039 >>>>>> 63.724 >>>>>> 3 63.58 0.00 33.68 8.04 75.19 14 1.812 2 0.103 >>>>>> 1.915 >>>>>> >>>>>> Correct me if I'm wrong, but it seems 3 is lot more healthy GC wise >>>>>> than 2 (which has normal load statistics). >>>>>> >>>>>> Anywhere else you can recommend we look? >>>>>> >>>>>> Griff >>>>>> >>>>>> On 14 January 2016 at 01:25, Anuj Wadehra <anujw_2...@yahoo.co.in> >>>>>> wrote: >>>>>> >>>>>>> Ok. I saw dropped mutations on your cluster and full gc is a common >>>>>>> cause for that. >>>>>>> Can you just search the word GCInspector in system.log and share the >>>>>>> frequency of minor and full gc. Moreover, are you printing promotion >>>>>>> failures in gc logs?? Why full gc ia getting triggered??promotion >>>>>>> failures >>>>>>> or concurrent mode failures? >>>>>>> >>>>>>> If you are on CMS, you need to fine tune your heap options to >>>>>>> address full gc. >>>>>>> >>>>>>> >>>>>>> >>>>>>> Thanks >>>>>>> Anuj >>>>>>> >>>>>>> Sent from Yahoo Mail on Android >>>>>>> <https://overview.mail.yahoo.com/mobile/?.src=Android> >>>>>>> >>>>>>> On Thu, 14 Jan, 2016 at 12:57 am, James Griffin >>>>>>> <james.grif...@idioplatform.com> wrote: >>>>>>> I think I was incorrect in assuming GC wasn't an issue due to the >>>>>>> lack of logs. Comparing jstat output on nodes 2 & 3 show some fairly >>>>>>> marked >>>>>>> differences, though >>>>>>> comparing the startup flags on the two machines show the GC config >>>>>>> is identical.: >>>>>>> >>>>>>> $ jstat -gcutil >>>>>>> S0 S1 E O P YGC YGCT FGC FGCT >>>>>>> GCT >>>>>>> 2 5.08 0.00 55.72 18.24 59.90 25986 619.827 28 1.597 >>>>>>> 621.424 >>>>>>> 3 0.00 0.00 22.79 17.87 59.99 422600 11225.979 668 57.383 >>>>>>> 11283.361 >>>>>>> >>>>>>> Here's typical output for iostat on nodes 2 & 3 as well: >>>>>>> >>>>>>> $ iostat -dmx md0 >>>>>>> >>>>>>> Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s >>>>>>> avgrq-sz avgqu-sz await r_await w_await svctm %util >>>>>>> 2 md0 0.00 0.00 339.00 0.00 9.77 0.00 >>>>>>> 59.00 0.00 0.00 0.00 0.00 0.00 0.00 >>>>>>> 3 md0 0.00 0.00 2069.00 1.00 85.85 0.00 >>>>>>> 84.94 0.00 0.00 0.00 0.00 0.00 0.00 >>>>>>> >>>>>>> Griff >>>>>>> >>>>>>> On 13 January 2016 at 18:36, Anuj Wadehra <anujw_2...@yahoo.co.in> >>>>>>> wrote: >>>>>>> >>>>>>>> Node 2 has slightly higher data but that should be ok. Not sure how >>>>>>>> read ops are so high when no IO intensive activity such as repair and >>>>>>>> compaction is running on node 3.May be you can try investigating logs >>>>>>>> to >>>>>>>> see whats happening. >>>>>>>> >>>>>>>> Others on the mailing list could also share their views on the >>>>>>>> situation. >>>>>>>> >>>>>>>> Thanks >>>>>>>> Anuj >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Sent from Yahoo Mail on Android >>>>>>>> <https://overview.mail.yahoo.com/mobile/?.src=Android> >>>>>>>> >>>>>>>> On Wed, 13 Jan, 2016 at 11:46 pm, James Griffin >>>>>>>> <james.grif...@idioplatform.com> wrote: >>>>>>>> Hi Anuj, >>>>>>>> >>>>>>>> Below is the output of nodetool status. The nodes were replaced >>>>>>>> following the instructions in Datastax documentation for replacing >>>>>>>> running >>>>>>>> nodes since the nodes were running fine, it was that the servers had >>>>>>>> been >>>>>>>> incorrectly initialised and they thus had less disk space. The status >>>>>>>> below >>>>>>>> shows 2 has significantly higher load, however as I say 2 is operating >>>>>>>> normally and is running compactions, so I guess that's not an issue? >>>>>>>> >>>>>>>> Datacenter: datacenter1 >>>>>>>> ======================= >>>>>>>> Status=Up/Down >>>>>>>> |/ State=Normal/Leaving/Joining/Moving >>>>>>>> -- Address Load Tokens Owns Host ID >>>>>>>> Rack >>>>>>>> UN 1 253.59 GB 256 31.7% >>>>>>>> 6f0cfff2-babe-4de2-a1e3-6201228dee44 rack1 >>>>>>>> UN 2 302.23 GB 256 35.3% >>>>>>>> faa5b073-6af4-4c80-b280-e7fdd61924d3 rack1 >>>>>>>> UN 3 265.02 GB 256 33.1% >>>>>>>> 74b15507-db5c-45df-81db-6e5bcb7438a3 rack1 >>>>>>>> >>>>>>>> Griff >>>>>>>> >>>>>>>> On 13 January 2016 at 18:12, Anuj Wadehra <anujw_2...@yahoo.co.in> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> Revisiting the thread I can see that nodetool status had both good >>>>>>>>> and bad nodes at same time. How do you replace nodes? When you say bad >>>>>>>>> node..I understand that the node is no more usable even though >>>>>>>>> Cassandra is >>>>>>>>> UP? Is that correct? >>>>>>>>> >>>>>>>>> If a node is in bad shape and not working, adding new node may >>>>>>>>> trigger streaming huge data from bad node too. Have you considered >>>>>>>>> using >>>>>>>>> the procedure for replacing a dead node? >>>>>>>>> >>>>>>>>> Please share Latest nodetool status. >>>>>>>>> >>>>>>>>> nodetool output shared earlier: >>>>>>>>> >>>>>>>>> `nodetool status` output: >>>>>>>>> >>>>>>>>> Status=Up/Down >>>>>>>>> |/ State=Normal/Leaving/Joining/Moving >>>>>>>>> -- Address Load Tokens Owns Host >>>>>>>>> ID Rack >>>>>>>>> UN A (Good) 252.37 GB 256 23.0% >>>>>>>>> 9cd2e58c-a062-48a4-8d3f-b7bd9ee0576f rack1 >>>>>>>>> UN B (Good) 245.91 GB 256 24.4% >>>>>>>>> 6f0cfff2-babe-4de2-a1e3-6201228dee44 rack1 >>>>>>>>> UN C (Good) 254.79 GB 256 23.7% >>>>>>>>> f4891729-9179-4f19-ab2c-50d387da7ac6 rack1 >>>>>>>>> UN D (Bad) 163.85 GB 256 28.8% >>>>>>>>> faa5b073-6af4-4c80-b280-e7fdd61924d3 rack1 >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Thanks >>>>>>>>> Anuj >>>>>>>>> >>>>>>>>> Sent from Yahoo Mail on Android >>>>>>>>> <https://overview.mail.yahoo.com/mobile/?.src=Android> >>>>>>>>> >>>>>>>>> On Wed, 13 Jan, 2016 at 10:34 pm, James Griffin >>>>>>>>> <james.grif...@idioplatform.com> wrote: >>>>>>>>> Hi all, >>>>>>>>> >>>>>>>>> We’ve spent a few days running things but are in the same >>>>>>>>> position. To add some more flavour: >>>>>>>>> >>>>>>>>> >>>>>>>>> - We have a 3-node ring, replication factor = 3. We’ve been >>>>>>>>> running in this configuration for a few years without any real >>>>>>>>> issues >>>>>>>>> - Nodes 2 & 3 are much newer than node 1. These two nodes were >>>>>>>>> brought in to replace two other nodes which had failed RAID0 >>>>>>>>> configuration >>>>>>>>> and thus were lacking in disk space. >>>>>>>>> - When node 2 was brought into the ring, it exhibited high CPU >>>>>>>>> wait, IO and load metrics >>>>>>>>> - We subsequently brought 3 into the ring: as soon as 3 was >>>>>>>>> fully bootstrapped, the load, CPU wait and IO stats on 2 dropped >>>>>>>>> to normal >>>>>>>>> levels. Those same stats on 3, however, sky-rocketed >>>>>>>>> - We’ve confirmed configuration across all three nodes are >>>>>>>>> identical and in line with the recommended production settings >>>>>>>>> - We’ve run a full repair >>>>>>>>> - Node 2 is currently running compactions, 1 & 3 aren’t and >>>>>>>>> have no pending >>>>>>>>> - There is no GC happening from what I can see. Node 1 has a >>>>>>>>> GC log, but that’s not been written to since May last year >>>>>>>>> >>>>>>>>> >>>>>>>>> What we’re seeing at the moment is similar and normal stats on >>>>>>>>> nodes 1 & 2, but high CPU wait, IO and load stats on 3. As a snapshot: >>>>>>>>> >>>>>>>>> >>>>>>>>> 1. Load: 3.96, CPU wait: 30.8%, Disk Read Ops: 408/s >>>>>>>>> 2. Load: 5.88, CPU wait: 14.6%, Disk Read Ops: 275/s >>>>>>>>> 3. Load: 58.15, CPU wait: 87.0%, Disk Read Ops: 2,408/s >>>>>>>>> >>>>>>>>> >>>>>>>>> Can you recommend any next steps? >>>>>>>>> >>>>>>>>> Griff >>>>>>>>> >>>>>>>>> On 6 January 2016 at 17:31, Anuj Wadehra <anujw_2...@yahoo.co.in> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Hi Vickrum, >>>>>>>>>> >>>>>>>>>> I would have proceeded with diagnosis as follows: >>>>>>>>>> >>>>>>>>>> 1. Analysis of sar report to check system health -cpu memory >>>>>>>>>> swap disk etc. >>>>>>>>>> System seems to be overloaded. This is evident from mutation >>>>>>>>>> drops. >>>>>>>>>> >>>>>>>>>> 2. Make sure that all recommended Cassandra production settings >>>>>>>>>> available at Datastax site are applied ,disable zone reclaim and THP. >>>>>>>>>> >>>>>>>>>> 3.Run full Repair on bad node and check data size. Node is owner >>>>>>>>>> of maximum token range but has significant lower data.I doubt that >>>>>>>>>> bootstrapping happened properly. >>>>>>>>>> >>>>>>>>>> 4.Compactionstats shows 22 pending compactions. Try throttling >>>>>>>>>> compactions via reducing cincurent compactors or compaction >>>>>>>>>> throughput. >>>>>>>>>> >>>>>>>>>> 5.Analyze logs to make sure bootstrapping happened without errors. >>>>>>>>>> >>>>>>>>>> 6. Look for other common performance problems such as GC pauses >>>>>>>>>> to make sure that dropped mutations are not caused by GC pauses. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Thanks >>>>>>>>>> Anuj >>>>>>>>>> >>>>>>>>>> Sent from Yahoo Mail on Android >>>>>>>>>> <https://overview.mail.yahoo.com/mobile/?.src=Android> >>>>>>>>>> >>>>>>>>>> On Wed, 6 Jan, 2016 at 10:12 pm, Vickrum Loi >>>>>>>>>> <vickrum....@idioplatform.com> wrote: >>>>>>>>>> # nodetool compactionstats >>>>>>>>>> pending tasks: 22 >>>>>>>>>> compaction type keyspace table >>>>>>>>>> completed total unit progress >>>>>>>>>> Compactionproduction_analytics >>>>>>>>>> interactions 240410213 161172668724 bytes 0.15% >>>>>>>>>> >>>>>>>>>> Compactionproduction_decisionsdecisions.decisions_q_idx >>>>>>>>>> 120815385 226295183 bytes 53.39% >>>>>>>>>> Active compaction remaining time : 2h39m58s >>>>>>>>>> >>>>>>>>>> Worth mentioning that compactions haven't been running on this >>>>>>>>>> node particularly often. The node's been performing badly regardless >>>>>>>>>> of >>>>>>>>>> whether it's compacting or not. >>>>>>>>>> >>>>>>>>>> On 6 January 2016 at 16:35, Jeff Ferland <j...@tubularlabs.com> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> What’s your output of `nodetool compactionstats`? >>>>>>>>>>> >>>>>>>>>>> On Jan 6, 2016, at 7:26 AM, Vickrum Loi < >>>>>>>>>>> vickrum....@idioplatform.com> wrote: >>>>>>>>>>> >>>>>>>>>>> Hi, >>>>>>>>>>> >>>>>>>>>>> We recently added a new node to our cluster in order to replace >>>>>>>>>>> a node that died (hardware failure we believe). For the next two >>>>>>>>>>> weeks it >>>>>>>>>>> had high disk and network activity. We replaced the server, but it's >>>>>>>>>>> happened again. We've looked into memory allowances, disk >>>>>>>>>>> performance, >>>>>>>>>>> number of connections, and all the nodetool stats, but can't find >>>>>>>>>>> the cause >>>>>>>>>>> of the issue. >>>>>>>>>>> >>>>>>>>>>> `nodetool tpstats`[0] shows a lot of active and pending threads, >>>>>>>>>>> in comparison to the rest of the cluster, but that's likely a >>>>>>>>>>> symptom, not >>>>>>>>>>> a cause. >>>>>>>>>>> >>>>>>>>>>> `nodetool status`[1] shows the cluster isn't quite balanced. The >>>>>>>>>>> bad node (D) has less data. >>>>>>>>>>> >>>>>>>>>>> Disk Activity[2] and Network activity[3] on this node is far >>>>>>>>>>> higher than the rest. >>>>>>>>>>> >>>>>>>>>>> The only other difference this node has to the rest of the >>>>>>>>>>> cluster is that its on the ext4 filesystem, whereas the rest are >>>>>>>>>>> ext3, but >>>>>>>>>>> we've done plenty of testing there and can't see how that would >>>>>>>>>>> affect >>>>>>>>>>> performance on this node so much. >>>>>>>>>>> >>>>>>>>>>> Nothing of note in system.log. >>>>>>>>>>> >>>>>>>>>>> What should our next step be in trying to diagnose this issue? >>>>>>>>>>> >>>>>>>>>>> Best wishes, >>>>>>>>>>> Vic >>>>>>>>>>> >>>>>>>>>>> [0] `nodetool tpstats` output: >>>>>>>>>>> >>>>>>>>>>> Good node: >>>>>>>>>>> Pool Name Active Pending >>>>>>>>>>> Completed Blocked All time blocked >>>>>>>>>>> ReadStage 0 0 >>>>>>>>>>> 46311521 0 0 >>>>>>>>>>> RequestResponseStage 0 0 >>>>>>>>>>> 23817366 0 0 >>>>>>>>>>> MutationStage 0 0 >>>>>>>>>>> 47389269 0 0 >>>>>>>>>>> ReadRepairStage 0 0 >>>>>>>>>>> 11108 0 0 >>>>>>>>>>> ReplicateOnWriteStage 0 0 >>>>>>>>>>> 0 0 0 >>>>>>>>>>> GossipStage 0 0 >>>>>>>>>>> 5259908 0 0 >>>>>>>>>>> CacheCleanupExecutor 0 0 >>>>>>>>>>> 0 0 0 >>>>>>>>>>> MigrationStage 0 0 >>>>>>>>>>> 30 0 0 >>>>>>>>>>> MemoryMeter 0 0 >>>>>>>>>>> 16563 0 0 >>>>>>>>>>> FlushWriter 0 0 >>>>>>>>>>> 39637 0 26 >>>>>>>>>>> ValidationExecutor 0 0 >>>>>>>>>>> 19013 0 0 >>>>>>>>>>> InternalResponseStage 0 0 >>>>>>>>>>> 9 0 0 >>>>>>>>>>> AntiEntropyStage 0 0 >>>>>>>>>>> 38026 0 0 >>>>>>>>>>> MemtablePostFlusher 0 0 >>>>>>>>>>> 81740 0 0 >>>>>>>>>>> MiscStage 0 0 >>>>>>>>>>> 19196 0 0 >>>>>>>>>>> PendingRangeCalculator 0 0 >>>>>>>>>>> 23 0 0 >>>>>>>>>>> CompactionExecutor 0 0 >>>>>>>>>>> 61629 0 0 >>>>>>>>>>> commitlog_archiver 0 0 >>>>>>>>>>> 0 0 0 >>>>>>>>>>> HintedHandoff 0 0 >>>>>>>>>>> 63 0 0 >>>>>>>>>>> >>>>>>>>>>> Message type Dropped >>>>>>>>>>> RANGE_SLICE 0 >>>>>>>>>>> READ_REPAIR 0 >>>>>>>>>>> PAGED_RANGE 0 >>>>>>>>>>> BINARY 0 >>>>>>>>>>> READ 640 >>>>>>>>>>> MUTATION 0 >>>>>>>>>>> _TRACE 0 >>>>>>>>>>> REQUEST_RESPONSE 0 >>>>>>>>>>> COUNTER_MUTATION 0 >>>>>>>>>>> >>>>>>>>>>> Bad node: >>>>>>>>>>> Pool Name Active Pending >>>>>>>>>>> Completed Blocked All time blocked >>>>>>>>>>> ReadStage 32 113 >>>>>>>>>>> 52216 0 0 >>>>>>>>>>> RequestResponseStage 0 0 >>>>>>>>>>> 4167 0 0 >>>>>>>>>>> MutationStage 0 0 >>>>>>>>>>> 127559 0 0 >>>>>>>>>>> ReadRepairStage 0 0 >>>>>>>>>>> 125 0 0 >>>>>>>>>>> ReplicateOnWriteStage 0 0 >>>>>>>>>>> 0 0 0 >>>>>>>>>>> GossipStage 0 0 >>>>>>>>>>> 9965 0 0 >>>>>>>>>>> CacheCleanupExecutor 0 0 >>>>>>>>>>> 0 0 0 >>>>>>>>>>> MigrationStage 0 0 >>>>>>>>>>> 0 0 0 >>>>>>>>>>> MemoryMeter 0 0 >>>>>>>>>>> 24 0 0 >>>>>>>>>>> FlushWriter 0 0 >>>>>>>>>>> 27 0 1 >>>>>>>>>>> ValidationExecutor 0 0 >>>>>>>>>>> 0 0 0 >>>>>>>>>>> InternalResponseStage 0 0 >>>>>>>>>>> 0 0 0 >>>>>>>>>>> AntiEntropyStage 0 0 >>>>>>>>>>> 0 0 0 >>>>>>>>>>> MemtablePostFlusher 0 0 >>>>>>>>>>> 96 0 0 >>>>>>>>>>> MiscStage 0 0 >>>>>>>>>>> 0 0 0 >>>>>>>>>>> PendingRangeCalculator 0 0 >>>>>>>>>>> 10 0 0 >>>>>>>>>>> CompactionExecutor 1 1 >>>>>>>>>>> 73 0 0 >>>>>>>>>>> commitlog_archiver 0 0 >>>>>>>>>>> 0 0 0 >>>>>>>>>>> HintedHandoff 0 0 >>>>>>>>>>> 15 0 0 >>>>>>>>>>> >>>>>>>>>>> Message type Dropped >>>>>>>>>>> RANGE_SLICE 130 >>>>>>>>>>> READ_REPAIR 1 >>>>>>>>>>> PAGED_RANGE 0 >>>>>>>>>>> BINARY 0 >>>>>>>>>>> READ 31032 >>>>>>>>>>> MUTATION 865 >>>>>>>>>>> _TRACE 0 >>>>>>>>>>> REQUEST_RESPONSE 7 >>>>>>>>>>> COUNTER_MUTATION 0 >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> [1] `nodetool status` output: >>>>>>>>>>> >>>>>>>>>>> Status=Up/Down >>>>>>>>>>> |/ State=Normal/Leaving/Joining/Moving >>>>>>>>>>> -- Address Load Tokens Owns Host >>>>>>>>>>> ID Rack >>>>>>>>>>> UN A (Good) 252.37 GB 256 23.0% >>>>>>>>>>> 9cd2e58c-a062-48a4-8d3f-b7bd9ee0576f rack1 >>>>>>>>>>> UN B (Good) 245.91 GB 256 24.4% >>>>>>>>>>> 6f0cfff2-babe-4de2-a1e3-6201228dee44 rack1 >>>>>>>>>>> UN C (Good) 254.79 GB 256 23.7% >>>>>>>>>>> f4891729-9179-4f19-ab2c-50d387da7ac6 rack1 >>>>>>>>>>> UN D (Bad) 163.85 GB 256 28.8% >>>>>>>>>>> faa5b073-6af4-4c80-b280-e7fdd61924d3 rack1 >>>>>>>>>>> >>>>>>>>>>> [2] Disk read/write ops: >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> https://s3-eu-west-1.amazonaws.com/uploads-eu.hipchat.com/28299/178477/dRs4jV1ukMeFHGE/cass-disk-read-ops.png >>>>>>>>>>> >>>>>>>>>>> https://s3-eu-west-1.amazonaws.com/uploads-eu.hipchat.com/28299/178477/gbE58N2WosiOomF/cass-disk-write-ops.png >>>>>>>>>>> >>>>>>>>>>> [3] Network in/out: >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> https://s3-eu-west-1.amazonaws.com/uploads-eu.hipchat.com/28299/178477/RwOVdUBxu6fPLgF/cass-network-in.png >>>>>>>>>>> >>>>>>>>>>> https://s3-eu-west-1.amazonaws.com/uploads-eu.hipchat.com/28299/178477/OpZM6ypNVN0O30q/cass-network-out.png >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >