I should probably have mentioned that we're on Cassandra 2.0.10. On 6 January 2016 at 15:26, Vickrum Loi <vickrum....@idioplatform.com> wrote:
> Hi, > > We recently added a new node to our cluster in order to replace a node > that died (hardware failure we believe). For the next two weeks it had high > disk and network activity. We replaced the server, but it's happened again. > We've looked into memory allowances, disk performance, number of > connections, and all the nodetool stats, but can't find the cause of the > issue. > > `nodetool tpstats`[0] shows a lot of active and pending threads, in > comparison to the rest of the cluster, but that's likely a symptom, not a > cause. > > `nodetool status`[1] shows the cluster isn't quite balanced. The bad node > (D) has less data. > > Disk Activity[2] and Network activity[3] on this node is far higher than > the rest. > > The only other difference this node has to the rest of the cluster is that > its on the ext4 filesystem, whereas the rest are ext3, but we've done > plenty of testing there and can't see how that would affect performance on > this node so much. > > Nothing of note in system.log. > > What should our next step be in trying to diagnose this issue? > > Best wishes, > Vic > > [0] `nodetool tpstats` output: > > Good node: > Pool Name Active Pending Completed > Blocked All time blocked > ReadStage 0 0 46311521 > 0 0 > RequestResponseStage 0 0 23817366 > 0 0 > MutationStage 0 0 47389269 > 0 0 > ReadRepairStage 0 0 11108 > 0 0 > ReplicateOnWriteStage 0 0 0 > 0 0 > GossipStage 0 0 5259908 > 0 0 > CacheCleanupExecutor 0 0 0 > 0 0 > MigrationStage 0 0 30 > 0 0 > MemoryMeter 0 0 16563 > 0 0 > FlushWriter 0 0 39637 > 0 26 > ValidationExecutor 0 0 19013 > 0 0 > InternalResponseStage 0 0 9 > 0 0 > AntiEntropyStage 0 0 38026 > 0 0 > MemtablePostFlusher 0 0 81740 > 0 0 > MiscStage 0 0 19196 > 0 0 > PendingRangeCalculator 0 0 23 > 0 0 > CompactionExecutor 0 0 61629 > 0 0 > commitlog_archiver 0 0 0 > 0 0 > HintedHandoff 0 0 63 > 0 0 > > Message type Dropped > RANGE_SLICE 0 > READ_REPAIR 0 > PAGED_RANGE 0 > BINARY 0 > READ 640 > MUTATION 0 > _TRACE 0 > REQUEST_RESPONSE 0 > COUNTER_MUTATION 0 > > Bad node: > Pool Name Active Pending Completed > Blocked All time blocked > ReadStage 32 113 52216 > 0 0 > RequestResponseStage 0 0 4167 > 0 0 > MutationStage 0 0 127559 > 0 0 > ReadRepairStage 0 0 125 > 0 0 > ReplicateOnWriteStage 0 0 0 > 0 0 > GossipStage 0 0 9965 > 0 0 > CacheCleanupExecutor 0 0 0 > 0 0 > MigrationStage 0 0 0 > 0 0 > MemoryMeter 0 0 24 > 0 0 > FlushWriter 0 0 27 > 0 1 > ValidationExecutor 0 0 0 > 0 0 > InternalResponseStage 0 0 0 > 0 0 > AntiEntropyStage 0 0 0 > 0 0 > MemtablePostFlusher 0 0 96 > 0 0 > MiscStage 0 0 0 > 0 0 > PendingRangeCalculator 0 0 10 > 0 0 > CompactionExecutor 1 1 73 > 0 0 > commitlog_archiver 0 0 0 > 0 0 > HintedHandoff 0 0 15 > 0 0 > > Message type Dropped > RANGE_SLICE 130 > READ_REPAIR 1 > PAGED_RANGE 0 > BINARY 0 > READ 31032 > MUTATION 865 > _TRACE 0 > REQUEST_RESPONSE 7 > COUNTER_MUTATION 0 > > > [1] `nodetool status` output: > > Status=Up/Down > |/ State=Normal/Leaving/Joining/Moving > -- Address Load Tokens Owns Host > ID Rack > UN A (Good) 252.37 GB 256 23.0% > 9cd2e58c-a062-48a4-8d3f-b7bd9ee0576f rack1 > UN B (Good) 245.91 GB 256 24.4% > 6f0cfff2-babe-4de2-a1e3-6201228dee44 rack1 > UN C (Good) 254.79 GB 256 23.7% > f4891729-9179-4f19-ab2c-50d387da7ac6 rack1 > UN D (Bad) 163.85 GB 256 28.8% > faa5b073-6af4-4c80-b280-e7fdd61924d3 rack1 > > [2] Disk read/write ops: > > > https://s3-eu-west-1.amazonaws.com/uploads-eu.hipchat.com/28299/178477/dRs4jV1ukMeFHGE/cass-disk-read-ops.png > > https://s3-eu-west-1.amazonaws.com/uploads-eu.hipchat.com/28299/178477/gbE58N2WosiOomF/cass-disk-write-ops.png > > [3] Network in/out: > > > https://s3-eu-west-1.amazonaws.com/uploads-eu.hipchat.com/28299/178477/RwOVdUBxu6fPLgF/cass-network-in.png > > https://s3-eu-west-1.amazonaws.com/uploads-eu.hipchat.com/28299/178477/OpZM6ypNVN0O30q/cass-network-out.png >