Hi,

We recently added a new node to our cluster in order to replace a node that
died (hardware failure we believe). For the next two weeks it had high disk
and network activity. We replaced the server, but it's happened again.
We've looked into memory allowances, disk performance, number of
connections, and all the nodetool stats, but can't find the cause of the
issue.

`nodetool tpstats`[0] shows a lot of active and pending threads, in
comparison to the rest of the cluster, but that's likely a symptom, not a
cause.

`nodetool status`[1] shows the cluster isn't quite balanced. The bad node
(D) has less data.

Disk Activity[2] and Network activity[3] on this node is far higher than
the rest.

The only other difference this node has to the rest of the cluster is that
its on the ext4 filesystem, whereas the rest are ext3, but we've done
plenty of testing there and can't see how that would affect performance on
this node so much.

Nothing of note in system.log.

What should our next step be in trying to diagnose this issue?

Best wishes,
Vic

[0] `nodetool tpstats` output:

Good node:
    Pool Name                    Active   Pending      Completed   Blocked
All time blocked
    ReadStage                         0         0       46311521
0                 0
    RequestResponseStage              0         0       23817366
0                 0
    MutationStage                     0         0       47389269
0                 0
    ReadRepairStage                   0         0          11108
0                 0
    ReplicateOnWriteStage             0         0              0
0                 0
    GossipStage                       0         0        5259908
0                 0
    CacheCleanupExecutor              0         0              0
0                 0
    MigrationStage                    0         0             30
0                 0
    MemoryMeter                       0         0          16563
0                 0
    FlushWriter                       0         0          39637
0                26
    ValidationExecutor                0         0          19013
0                 0
    InternalResponseStage             0         0              9
0                 0
    AntiEntropyStage                  0         0          38026
0                 0
    MemtablePostFlusher               0         0          81740
0                 0
    MiscStage                         0         0          19196
0                 0
    PendingRangeCalculator            0         0             23
0                 0
    CompactionExecutor                0         0          61629
0                 0
    commitlog_archiver                0         0              0
0                 0
    HintedHandoff                     0         0             63
0                 0

    Message type           Dropped
    RANGE_SLICE                  0
    READ_REPAIR                  0
    PAGED_RANGE                  0
    BINARY                       0
    READ                       640
    MUTATION                     0
    _TRACE                       0
    REQUEST_RESPONSE             0
    COUNTER_MUTATION             0

Bad node:
    Pool Name                    Active   Pending      Completed   Blocked
All time blocked
    ReadStage                        32       113          52216
0                 0
    RequestResponseStage              0         0           4167
0                 0
    MutationStage                     0         0         127559
0                 0
    ReadRepairStage                   0         0            125
0                 0
    ReplicateOnWriteStage             0         0              0
0                 0
    GossipStage                       0         0           9965
0                 0
    CacheCleanupExecutor              0         0              0
0                 0
    MigrationStage                    0         0              0
0                 0
    MemoryMeter                       0         0             24
0                 0
    FlushWriter                       0         0             27
0                 1
    ValidationExecutor                0         0              0
0                 0
    InternalResponseStage             0         0              0
0                 0
    AntiEntropyStage                  0         0              0
0                 0
    MemtablePostFlusher               0         0             96
0                 0
    MiscStage                         0         0              0
0                 0
    PendingRangeCalculator            0         0             10
0                 0
    CompactionExecutor                1         1             73
0                 0
    commitlog_archiver                0         0              0
0                 0
    HintedHandoff                     0         0             15
0                 0

    Message type           Dropped
    RANGE_SLICE                130
    READ_REPAIR                  1
    PAGED_RANGE                  0
    BINARY                       0
    READ                     31032
    MUTATION                   865
    _TRACE                       0
    REQUEST_RESPONSE             7
    COUNTER_MUTATION             0


[1] `nodetool status` output:

    Status=Up/Down
    |/ State=Normal/Leaving/Joining/Moving
    --  Address         Load       Tokens  Owns   Host
ID                               Rack
    UN  A (Good)        252.37 GB  256     23.0%
9cd2e58c-a062-48a4-8d3f-b7bd9ee0576f  rack1
    UN  B (Good)        245.91 GB  256     24.4%
6f0cfff2-babe-4de2-a1e3-6201228dee44  rack1
    UN  C (Good)        254.79 GB  256     23.7%
f4891729-9179-4f19-ab2c-50d387da7ac6  rack1
    UN  D (Bad)         163.85 GB  256     28.8%
faa5b073-6af4-4c80-b280-e7fdd61924d3  rack1

[2] Disk read/write ops:


https://s3-eu-west-1.amazonaws.com/uploads-eu.hipchat.com/28299/178477/dRs4jV1ukMeFHGE/cass-disk-read-ops.png

https://s3-eu-west-1.amazonaws.com/uploads-eu.hipchat.com/28299/178477/gbE58N2WosiOomF/cass-disk-write-ops.png

[3] Network in/out:


https://s3-eu-west-1.amazonaws.com/uploads-eu.hipchat.com/28299/178477/RwOVdUBxu6fPLgF/cass-network-in.png

https://s3-eu-west-1.amazonaws.com/uploads-eu.hipchat.com/28299/178477/OpZM6ypNVN0O30q/cass-network-out.png

Reply via email to