Hi all,

We’ve spent a few days running things but are in the same position. To add
some more flavour:


   - We have a 3-node ring, replication factor = 3. We’ve been running in
   this configuration for a few years without any real issues
   - Nodes 2 & 3 are much newer than node 1. These two nodes were brought
   in to replace two other nodes which had failed RAID0 configuration and thus
   were lacking in disk space.
   - When node 2 was brought into the ring, it exhibited high CPU wait, IO
   and load metrics
   - We subsequently brought 3 into the ring: as soon as 3 was fully
   bootstrapped, the load, CPU wait and IO stats on 2 dropped to normal
   levels. Those same stats on 3, however, sky-rocketed
   - We’ve confirmed configuration across all three nodes are identical and
   in line with the recommended production settings
   - We’ve run a full repair
   - Node 2 is currently running compactions, 1 & 3 aren’t and have no
   pending
   - There is no GC happening from what I can see. Node 1 has a GC log, but
   that’s not been written to since May last year


What we’re seeing at the moment is similar and normal stats on nodes 1 & 2,
but high CPU wait, IO and load stats on 3. As a snapshot:


   1. Load: 3.96, CPU wait: 30.8%, Disk Read Ops: 408/s
   2. Load: 5.88, CPU wait: 14.6%, Disk Read Ops: 275/s
   3. Load: 58.15, CPU wait: 87.0%, Disk Read Ops: 2,408/s


Can you recommend any next steps?

Griff

On 6 January 2016 at 17:31, Anuj Wadehra <anujw_2...@yahoo.co.in> wrote:

> Hi Vickrum,
>
> I would have proceeded with diagnosis as follows:
>
> 1. Analysis of sar report to check system health -cpu memory swap disk
> etc.
> System seems to be overloaded. This is evident from mutation drops.
>
> 2. Make sure that  all recommended Cassandra production settings available
> at Datastax site are applied ,disable zone reclaim and THP.
>
> 3.Run full Repair on bad node and check data size. Node is owner of
> maximum token range but has significant lower data.I doubt that
> bootstrapping happened properly.
>
> 4.Compactionstats shows 22 pending compactions. Try throttling compactions
> via reducing cincurent compactors or compaction throughput.
>
> 5.Analyze logs to make sure bootstrapping happened without errors.
>
> 6. Look for other common performance problems such as GC pauses to make
> sure that dropped mutations are not caused by GC pauses.
>
>
> Thanks
> Anuj
>
> Sent from Yahoo Mail on Android
> <https://overview.mail.yahoo.com/mobile/?.src=Android>
>
> On Wed, 6 Jan, 2016 at 10:12 pm, Vickrum Loi
> <vickrum....@idioplatform.com> wrote:
> # nodetool compactionstats
> pending tasks: 22
>           compaction type        keyspace           table
> completed           total      unit  progress
>                Compactionproduction_analytics    interactions
> 240410213    161172668724     bytes     0.15%
>
> Compactionproduction_decisionsdecisions.decisions_q_idx
> 120815385       226295183     bytes    53.39%
> Active compaction remaining time :   2h39m58s
>
> Worth mentioning that compactions haven't been running on this node
> particularly often. The node's been performing badly regardless of whether
> it's compacting or not.
>
> On 6 January 2016 at 16:35, Jeff Ferland <j...@tubularlabs.com> wrote:
>
>> What’s your output of `nodetool compactionstats`?
>>
>> On Jan 6, 2016, at 7:26 AM, Vickrum Loi <vickrum....@idioplatform.com>
>> wrote:
>>
>> Hi,
>>
>> We recently added a new node to our cluster in order to replace a node
>> that died (hardware failure we believe). For the next two weeks it had high
>> disk and network activity. We replaced the server, but it's happened again.
>> We've looked into memory allowances, disk performance, number of
>> connections, and all the nodetool stats, but can't find the cause of the
>> issue.
>>
>> `nodetool tpstats`[0] shows a lot of active and pending threads, in
>> comparison to the rest of the cluster, but that's likely a symptom, not a
>> cause.
>>
>> `nodetool status`[1] shows the cluster isn't quite balanced. The bad node
>> (D) has less data.
>>
>> Disk Activity[2] and Network activity[3] on this node is far higher than
>> the rest.
>>
>> The only other difference this node has to the rest of the cluster is
>> that its on the ext4 filesystem, whereas the rest are ext3, but we've done
>> plenty of testing there and can't see how that would affect performance on
>> this node so much.
>>
>> Nothing of note in system.log.
>>
>> What should our next step be in trying to diagnose this issue?
>>
>> Best wishes,
>> Vic
>>
>> [0] `nodetool tpstats` output:
>>
>> Good node:
>>     Pool Name                    Active   Pending      Completed
>> Blocked  All time blocked
>>     ReadStage                         0         0       46311521
>> 0                 0
>>     RequestResponseStage              0         0       23817366
>> 0                 0
>>     MutationStage                     0         0       47389269
>> 0                 0
>>     ReadRepairStage                   0         0          11108
>> 0                 0
>>     ReplicateOnWriteStage             0         0              0
>> 0                 0
>>     GossipStage                       0         0        5259908
>> 0                 0
>>     CacheCleanupExecutor              0         0              0
>> 0                 0
>>     MigrationStage                    0         0             30
>> 0                 0
>>     MemoryMeter                       0         0          16563
>> 0                 0
>>     FlushWriter                       0         0          39637
>> 0                26
>>     ValidationExecutor                0         0          19013
>> 0                 0
>>     InternalResponseStage             0         0              9
>> 0                 0
>>     AntiEntropyStage                  0         0          38026
>> 0                 0
>>     MemtablePostFlusher               0         0          81740
>> 0                 0
>>     MiscStage                         0         0          19196
>> 0                 0
>>     PendingRangeCalculator            0         0             23
>> 0                 0
>>     CompactionExecutor                0         0          61629
>> 0                 0
>>     commitlog_archiver                0         0              0
>> 0                 0
>>     HintedHandoff                     0         0             63
>> 0                 0
>>
>>     Message type           Dropped
>>     RANGE_SLICE                  0
>>     READ_REPAIR                  0
>>     PAGED_RANGE                  0
>>     BINARY                       0
>>     READ                       640
>>     MUTATION                     0
>>     _TRACE                       0
>>     REQUEST_RESPONSE             0
>>     COUNTER_MUTATION             0
>>
>> Bad node:
>>     Pool Name                    Active   Pending      Completed
>> Blocked  All time blocked
>>     ReadStage                        32       113          52216
>> 0                 0
>>     RequestResponseStage              0         0           4167
>> 0                 0
>>     MutationStage                     0         0         127559
>> 0                 0
>>     ReadRepairStage                   0         0            125
>> 0                 0
>>     ReplicateOnWriteStage             0         0              0
>> 0                 0
>>     GossipStage                       0         0           9965
>> 0                 0
>>     CacheCleanupExecutor              0         0              0
>> 0                 0
>>     MigrationStage                    0         0              0
>> 0                 0
>>     MemoryMeter                       0         0             24
>> 0                 0
>>     FlushWriter                       0         0             27
>> 0                 1
>>     ValidationExecutor                0         0              0
>> 0                 0
>>     InternalResponseStage             0         0              0
>> 0                 0
>>     AntiEntropyStage                  0         0              0
>> 0                 0
>>     MemtablePostFlusher               0         0             96
>> 0                 0
>>     MiscStage                         0         0              0
>> 0                 0
>>     PendingRangeCalculator            0         0             10
>> 0                 0
>>     CompactionExecutor                1         1             73
>> 0                 0
>>     commitlog_archiver                0         0              0
>> 0                 0
>>     HintedHandoff                     0         0             15
>> 0                 0
>>
>>     Message type           Dropped
>>     RANGE_SLICE                130
>>     READ_REPAIR                  1
>>     PAGED_RANGE                  0
>>     BINARY                       0
>>     READ                     31032
>>     MUTATION                   865
>>     _TRACE                       0
>>     REQUEST_RESPONSE             7
>>     COUNTER_MUTATION             0
>>
>>
>> [1] `nodetool status` output:
>>
>>     Status=Up/Down
>>     |/ State=Normal/Leaving/Joining/Moving
>>     --  Address         Load       Tokens  Owns   Host
>> ID                               Rack
>>     UN  A (Good)        252.37 GB  256     23.0%
>> 9cd2e58c-a062-48a4-8d3f-b7bd9ee0576f  rack1
>>     UN  B (Good)        245.91 GB  256     24.4%
>> 6f0cfff2-babe-4de2-a1e3-6201228dee44  rack1
>>     UN  C (Good)        254.79 GB  256     23.7%
>> f4891729-9179-4f19-ab2c-50d387da7ac6  rack1
>>     UN  D (Bad)         163.85 GB  256     28.8%
>> faa5b073-6af4-4c80-b280-e7fdd61924d3  rack1
>>
>> [2] Disk read/write ops:
>>
>>
>> https://s3-eu-west-1.amazonaws.com/uploads-eu.hipchat.com/28299/178477/dRs4jV1ukMeFHGE/cass-disk-read-ops.png
>>
>> https://s3-eu-west-1.amazonaws.com/uploads-eu.hipchat.com/28299/178477/gbE58N2WosiOomF/cass-disk-write-ops.png
>>
>> [3] Network in/out:
>>
>>
>> https://s3-eu-west-1.amazonaws.com/uploads-eu.hipchat.com/28299/178477/RwOVdUBxu6fPLgF/cass-network-in.png
>>
>> https://s3-eu-west-1.amazonaws.com/uploads-eu.hipchat.com/28299/178477/OpZM6ypNVN0O30q/cass-network-out.png
>>
>>
>>
>

Reply via email to