Hi, We recently added a new node to our cluster in order to replace a node that died (hardware failure we believe). For the next two weeks it had high disk and network activity. We replaced the server, but it's happened again. We've looked into memory allowances, disk performance, number of connections, and all the nodetool stats, but can't find the cause of the issue.
`nodetool tpstats`[0] shows a lot of active and pending threads, in comparison to the rest of the cluster, but that's likely a symptom, not a cause. `nodetool status`[1] shows the cluster isn't quite balanced. The bad node (D) has less data. Disk Activity[2] and Network activity[3] on this node is far higher than the rest. The only other difference this node has to the rest of the cluster is that its on the ext4 filesystem, whereas the rest are ext3, but we've done plenty of testing there and can't see how that would affect performance on this node so much. Nothing of note in system.log. What should our next step be in trying to diagnose this issue? Best wishes, Vic [0] `nodetool tpstats` output: Good node: Pool Name Active Pending Completed Blocked All time blocked ReadStage 0 0 46311521 0 0 RequestResponseStage 0 0 23817366 0 0 MutationStage 0 0 47389269 0 0 ReadRepairStage 0 0 11108 0 0 ReplicateOnWriteStage 0 0 0 0 0 GossipStage 0 0 5259908 0 0 CacheCleanupExecutor 0 0 0 0 0 MigrationStage 0 0 30 0 0 MemoryMeter 0 0 16563 0 0 FlushWriter 0 0 39637 0 26 ValidationExecutor 0 0 19013 0 0 InternalResponseStage 0 0 9 0 0 AntiEntropyStage 0 0 38026 0 0 MemtablePostFlusher 0 0 81740 0 0 MiscStage 0 0 19196 0 0 PendingRangeCalculator 0 0 23 0 0 CompactionExecutor 0 0 61629 0 0 commitlog_archiver 0 0 0 0 0 HintedHandoff 0 0 63 0 0 Message type Dropped RANGE_SLICE 0 READ_REPAIR 0 PAGED_RANGE 0 BINARY 0 READ 640 MUTATION 0 _TRACE 0 REQUEST_RESPONSE 0 COUNTER_MUTATION 0 Bad node: Pool Name Active Pending Completed Blocked All time blocked ReadStage 32 113 52216 0 0 RequestResponseStage 0 0 4167 0 0 MutationStage 0 0 127559 0 0 ReadRepairStage 0 0 125 0 0 ReplicateOnWriteStage 0 0 0 0 0 GossipStage 0 0 9965 0 0 CacheCleanupExecutor 0 0 0 0 0 MigrationStage 0 0 0 0 0 MemoryMeter 0 0 24 0 0 FlushWriter 0 0 27 0 1 ValidationExecutor 0 0 0 0 0 InternalResponseStage 0 0 0 0 0 AntiEntropyStage 0 0 0 0 0 MemtablePostFlusher 0 0 96 0 0 MiscStage 0 0 0 0 0 PendingRangeCalculator 0 0 10 0 0 CompactionExecutor 1 1 73 0 0 commitlog_archiver 0 0 0 0 0 HintedHandoff 0 0 15 0 0 Message type Dropped RANGE_SLICE 130 READ_REPAIR 1 PAGED_RANGE 0 BINARY 0 READ 31032 MUTATION 865 _TRACE 0 REQUEST_RESPONSE 7 COUNTER_MUTATION 0 [1] `nodetool status` output: Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns Host ID Rack UN A (Good) 252.37 GB 256 23.0% 9cd2e58c-a062-48a4-8d3f-b7bd9ee0576f rack1 UN B (Good) 245.91 GB 256 24.4% 6f0cfff2-babe-4de2-a1e3-6201228dee44 rack1 UN C (Good) 254.79 GB 256 23.7% f4891729-9179-4f19-ab2c-50d387da7ac6 rack1 UN D (Bad) 163.85 GB 256 28.8% faa5b073-6af4-4c80-b280-e7fdd61924d3 rack1 [2] Disk read/write ops: https://s3-eu-west-1.amazonaws.com/uploads-eu.hipchat.com/28299/178477/dRs4jV1ukMeFHGE/cass-disk-read-ops.png https://s3-eu-west-1.amazonaws.com/uploads-eu.hipchat.com/28299/178477/gbE58N2WosiOomF/cass-disk-write-ops.png [3] Network in/out: https://s3-eu-west-1.amazonaws.com/uploads-eu.hipchat.com/28299/178477/RwOVdUBxu6fPLgF/cass-network-in.png https://s3-eu-west-1.amazonaws.com/uploads-eu.hipchat.com/28299/178477/OpZM6ypNVN0O30q/cass-network-out.png