Re: New node has high network and disk usage.

Kai Wang Sun, 17 Jan 2016 14:22:48 -0800

James,

Thanks for sharing. Anyway, good to know there's one more thing to add to
the checklist.


On Sun, Jan 17, 2016 at 12:23 PM, James Griffin <
james.grif...@idioplatform.com> wrote:

> Hi all,
>
> Just to let you know, we finally figured this out on Friday. It turns out
> the new nodes had an older version of the kernel installed. Upgrading the
> kernel solved our issues. For reference, the "bad" kernel was
> 3.2.0-75-virtual, upgrading to 3.2.0-86-virtual resolved the issue. We
> still don't fully understand why this kernel bug didn't affect *all *our
> nodes (in the end we had three nodes with that kernel, only two of them
> exhibited this issue), but there we go.
>
> Thanks everyone for your help
>
> Cheers,
> Griff
>
> On 14 January 2016 at 15:14, James Griffin <james.grif...@idioplatform.com
> > wrote:
>
>> Hi Kai,
>>
>> Well observed - running `nodetool status` without specifying keyspace
>> does report ~33% on each node. We have two keyspaces on this cluster - if I
>> specify either of them the ownership reported by each node is 100%, so I
>> believe the repair completed successfully.
>>
>> Best wishes,
>>
>> Griff
>>
>> [image: idioplatform] <http://idioplatform.com/>James "Griff" Griffin
>> CTO
>> Switchboard: +44 (0)20 3540 1920 | Direct: +44 (0)7763 139 206 |
>> Twitter: @imaginaryroots <http://twitter.com/imaginaryroots> | Skype:
>> j.s.griffin
>> idio helps major brands and publishers to build closer relationships with
>> their customers and prospects by learning from their content consumption
>> and acting on that insight. We call it Content Intelligence, and it
>> integrates with your existing marketing technology to provide detailed
>> customer interest profiles in real-time across all channels, and to
>> personalize content into every channel for every customer. See
>> http://idioplatform.com
>> <https://t.yesware.com/tl/0e637e4938676b6f3897def79d0810a71e59612e/10068de2036c2daf922e0a879bb2fe92/9dae8be0f7693bf2b28a88cc4b38c554?ytl=http%3A%2F%2Fidioplatform.com%2F>
>>  for
>> more information.
>>
>> On 14 January 2016 at 15:08, Kai Wang <dep...@gmail.com> wrote:
>>
>>> James,
>>>
>>> I may miss something. You mentioned your cluster had RF=3. Then why
>>> does "nodetool status" show each node owns 1/3 of the data especially after
>>> a full repair?
>>>
>>> On Thu, Jan 14, 2016 at 9:56 AM, James Griffin <
>>> james.grif...@idioplatform.com> wrote:
>>>
>>>> Hi Kai,
>>>>
>>>> Below - nothing going on that I can see
>>>>
>>>> $ nodetool netstats
>>>> Mode: NORMAL
>>>> Not sending any streams.
>>>> Read Repair Statistics:
>>>> Attempted: 0
>>>> Mismatch (Blocking): 0
>>>> Mismatch (Background): 0
>>>> Pool Name                    Active   Pending      Completed
>>>> Commands                        n/a         0           6326
>>>> Responses                       n/a         0         219356
>>>>
>>>>
>>>>
>>>> Best wishes,
>>>>
>>>> Griff
>>>>
>>>> [image: idioplatform] <http://idioplatform.com/>James "Griff" Griffin
>>>> CTO
>>>> Switchboard: +44 (0)20 3540 1920 | Direct: +44 (0)7763 139 206 |
>>>> Twitter: @imaginaryroots <http://twitter.com/imaginaryroots> | Skype:
>>>> j.s.griffin
>>>> idio helps major brands and publishers to build closer relationships
>>>> with their customers and prospects by learning from their content
>>>> consumption and acting on that insight. We call it Content Intelligence,
>>>> and it integrates with your existing marketing technology to provide
>>>> detailed customer interest profiles in real-time across all channels, and
>>>> to personalize content into every channel for every customer. See
>>>> http://idioplatform.com
>>>> <https://t.yesware.com/tl/0e637e4938676b6f3897def79d0810a71e59612e/10068de2036c2daf922e0a879bb2fe92/9dae8be0f7693bf2b28a88cc4b38c554?ytl=http%3A%2F%2Fidioplatform.com%2F>
>>>>  for
>>>> more information.
>>>>
>>>> On 14 January 2016 at 14:22, Kai Wang <dep...@gmail.com> wrote:
>>>>
>>>>> James,
>>>>>
>>>>> Can you post the result of "nodetool netstats" on the bad node?
>>>>>
>>>>> On Thu, Jan 14, 2016 at 9:09 AM, James Griffin <
>>>>> james.grif...@idioplatform.com> wrote:
>>>>>
>>>>>> A summary of what we've done this morning:
>>>>>>
>>>>>>    - Noted that there are no GCInspector lines in system.log on bad
>>>>>>    node (there are GCInspector logs on other healthy nodes)
>>>>>>    - Turned on GC logging, noted that we had logs which stated out
>>>>>>    total time for which application threads were stopped was high - ~10s.
>>>>>>    - Not seeing failures or any kind (promotion or concurrent mark)
>>>>>>    - Attached Visual VM: noted that heap usage was very low (~5%
>>>>>>    usage and stable) and it didn't display hallmarks GC of activity. 
>>>>>> PermGen
>>>>>>    also very stable
>>>>>>    - Downloaded GC logs and examined in GC Viewer. Noted that:
>>>>>>    - We had lots of pauses (again around 10s), but no full GC.
>>>>>>       - From a 2,300s sample, just over 2,000s were spent with
>>>>>>       threads paused
>>>>>>       - Spotted many small GCs in the new space - realised that Xmn
>>>>>>       value was very low (200M against a heap size of 3750M). Increased 
>>>>>> Xmn to
>>>>>>       937M - no change in server behaviour (high load, high reads/s on 
>>>>>> disk, high
>>>>>>       CPU wait)
>>>>>>
>>>>>> Current output of jstat:
>>>>>>
>>>>>>   S0     S1     E      O      P     YGC     YGCT    FGC    FGCT
>>>>>> GCT
>>>>>> 2 0.00  45.20  12.82  26.84  76.21   2333   63.684     2    0.039
>>>>>> 63.724
>>>>>> 3 63.58   0.00  33.68   8.04  75.19     14    1.812     2    0.103
>>>>>>  1.915
>>>>>>
>>>>>> Correct me if I'm wrong, but it seems 3 is lot more healthy GC wise
>>>>>> than 2 (which has normal load statistics).
>>>>>>
>>>>>> Anywhere else you can recommend we look?
>>>>>>
>>>>>> Griff
>>>>>>
>>>>>> On 14 January 2016 at 01:25, Anuj Wadehra <anujw_2...@yahoo.co.in>
>>>>>> wrote:
>>>>>>
>>>>>>> Ok. I saw dropped mutations on your cluster and full gc is a common
>>>>>>> cause for that.
>>>>>>> Can you just search the word GCInspector in system.log and share the
>>>>>>> frequency of minor and full gc. Moreover, are you printing promotion
>>>>>>> failures in gc logs?? Why full gc ia getting triggered??promotion 
>>>>>>> failures
>>>>>>> or concurrent mode failures?
>>>>>>>
>>>>>>> If you are on CMS, you need to fine tune your heap options to
>>>>>>> address full gc.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Thanks
>>>>>>> Anuj
>>>>>>>
>>>>>>> Sent from Yahoo Mail on Android
>>>>>>> <https://overview.mail.yahoo.com/mobile/?.src=Android>
>>>>>>>
>>>>>>> On Thu, 14 Jan, 2016 at 12:57 am, James Griffin
>>>>>>> <james.grif...@idioplatform.com> wrote:
>>>>>>> I think I was incorrect in assuming GC wasn't an issue due to the
>>>>>>> lack of logs. Comparing jstat output on nodes 2 & 3 show some fairly 
>>>>>>> marked
>>>>>>> differences, though
>>>>>>> comparing the startup flags on the two machines show the GC config
>>>>>>> is identical.:
>>>>>>>
>>>>>>> $ jstat -gcutil
>>>>>>>    S0     S1     E      O      P     YGC     YGCT    FGC    FGCT
>>>>>>> GCT
>>>>>>> 2  5.08   0.00  55.72  18.24  59.90  25986  619.827    28    1.597
>>>>>>>  621.424
>>>>>>> 3  0.00   0.00  22.79  17.87  59.99 422600 11225.979   668   57.383
>>>>>>> 11283.361
>>>>>>>
>>>>>>> Here's typical output for iostat on nodes 2 & 3 as well:
>>>>>>>
>>>>>>> $ iostat -dmx md0
>>>>>>>
>>>>>>>   Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
>>>>>>> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>>>>>>> 2 md0               0.00     0.00  339.00    0.00     9.77     0.00
>>>>>>>    59.00     0.00    0.00    0.00    0.00   0.00   0.00
>>>>>>> 3 md0               0.00     0.00 2069.00    1.00    85.85     0.00
>>>>>>>    84.94     0.00    0.00    0.00    0.00   0.00   0.00
>>>>>>>
>>>>>>> Griff
>>>>>>>
>>>>>>> On 13 January 2016 at 18:36, Anuj Wadehra <anujw_2...@yahoo.co.in>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Node 2 has slightly higher data but that should be ok. Not sure how
>>>>>>>> read ops are so high when no IO intensive activity such as repair and
>>>>>>>> compaction is running on node 3.May be you can try investigating logs 
>>>>>>>> to
>>>>>>>> see whats happening.
>>>>>>>>
>>>>>>>> Others on the mailing list could also share their views on the
>>>>>>>> situation.
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> Anuj
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Sent from Yahoo Mail on Android
>>>>>>>> <https://overview.mail.yahoo.com/mobile/?.src=Android>
>>>>>>>>
>>>>>>>> On Wed, 13 Jan, 2016 at 11:46 pm, James Griffin
>>>>>>>> <james.grif...@idioplatform.com> wrote:
>>>>>>>> Hi Anuj,
>>>>>>>>
>>>>>>>> Below is the output of nodetool status. The nodes were replaced
>>>>>>>> following the instructions in Datastax documentation for replacing 
>>>>>>>> running
>>>>>>>> nodes since the nodes were running fine, it was that the servers had 
>>>>>>>> been
>>>>>>>> incorrectly initialised and they thus had less disk space. The status 
>>>>>>>> below
>>>>>>>> shows 2 has significantly higher load, however as I say 2 is operating
>>>>>>>> normally and is running compactions, so I guess that's not an issue?
>>>>>>>>
>>>>>>>> Datacenter: datacenter1
>>>>>>>> =======================
>>>>>>>> Status=Up/Down
>>>>>>>> |/ State=Normal/Leaving/Joining/Moving
>>>>>>>> --  Address         Load       Tokens  Owns   Host ID
>>>>>>>>                 Rack
>>>>>>>> UN  1               253.59 GB  256     31.7%
>>>>>>>>  6f0cfff2-babe-4de2-a1e3-6201228dee44  rack1
>>>>>>>> UN  2               302.23 GB  256     35.3%
>>>>>>>>  faa5b073-6af4-4c80-b280-e7fdd61924d3  rack1
>>>>>>>> UN  3               265.02 GB  256     33.1%
>>>>>>>>  74b15507-db5c-45df-81db-6e5bcb7438a3  rack1
>>>>>>>>
>>>>>>>> Griff
>>>>>>>>
>>>>>>>> On 13 January 2016 at 18:12, Anuj Wadehra <anujw_2...@yahoo.co.in>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> Revisiting the thread I can see that nodetool status had both good
>>>>>>>>> and bad nodes at same time. How do you replace nodes? When you say bad
>>>>>>>>> node..I understand that the node is no more usable even though 
>>>>>>>>> Cassandra is
>>>>>>>>> UP? Is that correct?
>>>>>>>>>
>>>>>>>>> If a node is in bad shape and not working, adding new node may
>>>>>>>>> trigger streaming huge data from bad node too. Have you considered 
>>>>>>>>> using
>>>>>>>>> the procedure for replacing a dead node?
>>>>>>>>>
>>>>>>>>> Please share Latest nodetool status.
>>>>>>>>>
>>>>>>>>> nodetool output shared earlier:
>>>>>>>>>
>>>>>>>>>  `nodetool status` output:
>>>>>>>>>
>>>>>>>>>     Status=Up/Down
>>>>>>>>>     |/ State=Normal/Leaving/Joining/Moving
>>>>>>>>>     --  Address         Load       Tokens  Owns   Host
>>>>>>>>> ID                               Rack
>>>>>>>>>     UN  A (Good)        252.37 GB  256     23.0%
>>>>>>>>> 9cd2e58c-a062-48a4-8d3f-b7bd9ee0576f  rack1
>>>>>>>>>     UN  B (Good)        245.91 GB  256     24.4%
>>>>>>>>> 6f0cfff2-babe-4de2-a1e3-6201228dee44  rack1
>>>>>>>>>     UN  C (Good)        254.79 GB  256     23.7%
>>>>>>>>> f4891729-9179-4f19-ab2c-50d387da7ac6  rack1
>>>>>>>>>     UN  D (Bad)         163.85 GB  256     28.8%
>>>>>>>>> faa5b073-6af4-4c80-b280-e7fdd61924d3  rack1
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>> Anuj
>>>>>>>>>
>>>>>>>>> Sent from Yahoo Mail on Android
>>>>>>>>> <https://overview.mail.yahoo.com/mobile/?.src=Android>
>>>>>>>>>
>>>>>>>>> On Wed, 13 Jan, 2016 at 10:34 pm, James Griffin
>>>>>>>>> <james.grif...@idioplatform.com> wrote:
>>>>>>>>> Hi all,
>>>>>>>>>
>>>>>>>>> We’ve spent a few days running things but are in the same
>>>>>>>>> position. To add some more flavour:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>    - We have a 3-node ring, replication factor = 3. We’ve been
>>>>>>>>>    running in this configuration for a few years without any real 
>>>>>>>>> issues
>>>>>>>>>    - Nodes 2 & 3 are much newer than node 1. These two nodes were
>>>>>>>>>    brought in to replace two other nodes which had failed RAID0 
>>>>>>>>> configuration
>>>>>>>>>    and thus were lacking in disk space.
>>>>>>>>>    - When node 2 was brought into the ring, it exhibited high CPU
>>>>>>>>>    wait, IO and load metrics
>>>>>>>>>    - We subsequently brought 3 into the ring: as soon as 3 was
>>>>>>>>>    fully bootstrapped, the load, CPU wait and IO stats on 2 dropped 
>>>>>>>>> to normal
>>>>>>>>>    levels. Those same stats on 3, however, sky-rocketed
>>>>>>>>>    - We’ve confirmed configuration across all three nodes are
>>>>>>>>>    identical and in line with the recommended production settings
>>>>>>>>>    - We’ve run a full repair
>>>>>>>>>    - Node 2 is currently running compactions, 1 & 3 aren’t and
>>>>>>>>>    have no pending
>>>>>>>>>    - There is no GC happening from what I can see. Node 1 has a
>>>>>>>>>    GC log, but that’s not been written to since May last year
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> What we’re seeing at the moment is similar and normal stats on
>>>>>>>>> nodes 1 & 2, but high CPU wait, IO and load stats on 3. As a snapshot:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>    1. Load: 3.96, CPU wait: 30.8%, Disk Read Ops: 408/s
>>>>>>>>>    2. Load: 5.88, CPU wait: 14.6%, Disk Read Ops: 275/s
>>>>>>>>>    3. Load: 58.15, CPU wait: 87.0%, Disk Read Ops: 2,408/s
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Can you recommend any next steps?
>>>>>>>>>
>>>>>>>>> Griff
>>>>>>>>>
>>>>>>>>> On 6 January 2016 at 17:31, Anuj Wadehra <anujw_2...@yahoo.co.in>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Vickrum,
>>>>>>>>>>
>>>>>>>>>> I would have proceeded with diagnosis as follows:
>>>>>>>>>>
>>>>>>>>>> 1. Analysis of sar report to check system health -cpu memory
>>>>>>>>>> swap disk etc.
>>>>>>>>>> System seems to be overloaded. This is evident from mutation
>>>>>>>>>> drops.
>>>>>>>>>>
>>>>>>>>>> 2. Make sure that  all recommended Cassandra production settings
>>>>>>>>>> available at Datastax site are applied ,disable zone reclaim and THP.
>>>>>>>>>>
>>>>>>>>>> 3.Run full Repair on bad node and check data size. Node is owner
>>>>>>>>>> of maximum token range but has significant lower data.I doubt that
>>>>>>>>>> bootstrapping happened properly.
>>>>>>>>>>
>>>>>>>>>> 4.Compactionstats shows 22 pending compactions. Try throttling
>>>>>>>>>> compactions via reducing cincurent compactors or compaction 
>>>>>>>>>> throughput.
>>>>>>>>>>
>>>>>>>>>> 5.Analyze logs to make sure bootstrapping happened without errors.
>>>>>>>>>>
>>>>>>>>>> 6. Look for other common performance problems such as GC pauses
>>>>>>>>>> to make sure that dropped mutations are not caused by GC pauses.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>>> Anuj
>>>>>>>>>>
>>>>>>>>>> Sent from Yahoo Mail on Android
>>>>>>>>>> <https://overview.mail.yahoo.com/mobile/?.src=Android>
>>>>>>>>>>
>>>>>>>>>> On Wed, 6 Jan, 2016 at 10:12 pm, Vickrum Loi
>>>>>>>>>> <vickrum....@idioplatform.com> wrote:
>>>>>>>>>> # nodetool compactionstats
>>>>>>>>>> pending tasks: 22
>>>>>>>>>>           compaction type        keyspace           table
>>>>>>>>>> completed           total      unit  progress
>>>>>>>>>>                Compactionproduction_analytics
>>>>>>>>>> interactions       240410213    161172668724     bytes     0.15%
>>>>>>>>>>
>>>>>>>>>> Compactionproduction_decisionsdecisions.decisions_q_idx
>>>>>>>>>> 120815385       226295183     bytes    53.39%
>>>>>>>>>> Active compaction remaining time :   2h39m58s
>>>>>>>>>>
>>>>>>>>>> Worth mentioning that compactions haven't been running on this
>>>>>>>>>> node particularly often. The node's been performing badly regardless 
>>>>>>>>>> of
>>>>>>>>>> whether it's compacting or not.
>>>>>>>>>>
>>>>>>>>>> On 6 January 2016 at 16:35, Jeff Ferland <j...@tubularlabs.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> What’s your output of `nodetool compactionstats`?
>>>>>>>>>>>
>>>>>>>>>>> On Jan 6, 2016, at 7:26 AM, Vickrum Loi <
>>>>>>>>>>> vickrum....@idioplatform.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> We recently added a new node to our cluster in order to replace
>>>>>>>>>>> a node that died (hardware failure we believe). For the next two 
>>>>>>>>>>> weeks it
>>>>>>>>>>> had high disk and network activity. We replaced the server, but it's
>>>>>>>>>>> happened again. We've looked into memory allowances, disk 
>>>>>>>>>>> performance,
>>>>>>>>>>> number of connections, and all the nodetool stats, but can't find 
>>>>>>>>>>> the cause
>>>>>>>>>>> of the issue.
>>>>>>>>>>>
>>>>>>>>>>> `nodetool tpstats`[0] shows a lot of active and pending threads,
>>>>>>>>>>> in comparison to the rest of the cluster, but that's likely a 
>>>>>>>>>>> symptom, not
>>>>>>>>>>> a cause.
>>>>>>>>>>>
>>>>>>>>>>> `nodetool status`[1] shows the cluster isn't quite balanced. The
>>>>>>>>>>> bad node (D) has less data.
>>>>>>>>>>>
>>>>>>>>>>> Disk Activity[2] and Network activity[3] on this node is far
>>>>>>>>>>> higher than the rest.
>>>>>>>>>>>
>>>>>>>>>>> The only other difference this node has to the rest of the
>>>>>>>>>>> cluster is that its on the ext4 filesystem, whereas the rest are 
>>>>>>>>>>> ext3, but
>>>>>>>>>>> we've done plenty of testing there and can't see how that would 
>>>>>>>>>>> affect
>>>>>>>>>>> performance on this node so much.
>>>>>>>>>>>
>>>>>>>>>>> Nothing of note in system.log.
>>>>>>>>>>>
>>>>>>>>>>> What should our next step be in trying to diagnose this issue?
>>>>>>>>>>>
>>>>>>>>>>> Best wishes,
>>>>>>>>>>> Vic
>>>>>>>>>>>
>>>>>>>>>>> [0] `nodetool tpstats` output:
>>>>>>>>>>>
>>>>>>>>>>> Good node:
>>>>>>>>>>>     Pool Name                    Active   Pending
>>>>>>>>>>> Completed   Blocked  All time blocked
>>>>>>>>>>>     ReadStage                         0         0
>>>>>>>>>>> 46311521         0                 0
>>>>>>>>>>>     RequestResponseStage              0         0
>>>>>>>>>>> 23817366         0                 0
>>>>>>>>>>>     MutationStage                     0         0
>>>>>>>>>>> 47389269         0                 0
>>>>>>>>>>>     ReadRepairStage                   0         0
>>>>>>>>>>> 11108         0                 0
>>>>>>>>>>>     ReplicateOnWriteStage             0         0
>>>>>>>>>>> 0         0                 0
>>>>>>>>>>>     GossipStage                       0         0
>>>>>>>>>>> 5259908         0                 0
>>>>>>>>>>>     CacheCleanupExecutor              0         0
>>>>>>>>>>> 0         0                 0
>>>>>>>>>>>     MigrationStage                    0         0
>>>>>>>>>>> 30         0                 0
>>>>>>>>>>>     MemoryMeter                       0         0
>>>>>>>>>>> 16563         0                 0
>>>>>>>>>>>     FlushWriter                       0         0
>>>>>>>>>>> 39637         0                26
>>>>>>>>>>>     ValidationExecutor                0         0
>>>>>>>>>>> 19013         0                 0
>>>>>>>>>>>     InternalResponseStage             0         0
>>>>>>>>>>> 9         0                 0
>>>>>>>>>>>     AntiEntropyStage                  0         0
>>>>>>>>>>> 38026         0                 0
>>>>>>>>>>>     MemtablePostFlusher               0         0
>>>>>>>>>>> 81740         0                 0
>>>>>>>>>>>     MiscStage                         0         0
>>>>>>>>>>> 19196         0                 0
>>>>>>>>>>>     PendingRangeCalculator            0         0
>>>>>>>>>>> 23         0                 0
>>>>>>>>>>>     CompactionExecutor                0         0
>>>>>>>>>>> 61629         0                 0
>>>>>>>>>>>     commitlog_archiver                0         0
>>>>>>>>>>> 0         0                 0
>>>>>>>>>>>     HintedHandoff                     0         0
>>>>>>>>>>> 63         0                 0
>>>>>>>>>>>
>>>>>>>>>>>     Message type           Dropped
>>>>>>>>>>>     RANGE_SLICE                  0
>>>>>>>>>>>     READ_REPAIR                  0
>>>>>>>>>>>     PAGED_RANGE                  0
>>>>>>>>>>>     BINARY                       0
>>>>>>>>>>>     READ                       640
>>>>>>>>>>>     MUTATION                     0
>>>>>>>>>>>     _TRACE                       0
>>>>>>>>>>>     REQUEST_RESPONSE             0
>>>>>>>>>>>     COUNTER_MUTATION             0
>>>>>>>>>>>
>>>>>>>>>>> Bad node:
>>>>>>>>>>>     Pool Name                    Active   Pending
>>>>>>>>>>> Completed   Blocked  All time blocked
>>>>>>>>>>>     ReadStage                        32       113
>>>>>>>>>>> 52216         0                 0
>>>>>>>>>>>     RequestResponseStage              0         0
>>>>>>>>>>> 4167         0                 0
>>>>>>>>>>>     MutationStage                     0         0
>>>>>>>>>>> 127559         0                 0
>>>>>>>>>>>     ReadRepairStage                   0         0
>>>>>>>>>>> 125         0                 0
>>>>>>>>>>>     ReplicateOnWriteStage             0         0
>>>>>>>>>>> 0         0                 0
>>>>>>>>>>>     GossipStage                       0         0
>>>>>>>>>>> 9965         0                 0
>>>>>>>>>>>     CacheCleanupExecutor              0         0
>>>>>>>>>>> 0         0                 0
>>>>>>>>>>>     MigrationStage                    0         0
>>>>>>>>>>> 0         0                 0
>>>>>>>>>>>     MemoryMeter                       0         0
>>>>>>>>>>> 24         0                 0
>>>>>>>>>>>     FlushWriter                       0         0
>>>>>>>>>>> 27         0                 1
>>>>>>>>>>>     ValidationExecutor                0         0
>>>>>>>>>>> 0         0                 0
>>>>>>>>>>>     InternalResponseStage             0         0
>>>>>>>>>>> 0         0                 0
>>>>>>>>>>>     AntiEntropyStage                  0         0
>>>>>>>>>>> 0         0                 0
>>>>>>>>>>>     MemtablePostFlusher               0         0
>>>>>>>>>>> 96         0                 0
>>>>>>>>>>>     MiscStage                         0         0
>>>>>>>>>>> 0         0                 0
>>>>>>>>>>>     PendingRangeCalculator            0         0
>>>>>>>>>>> 10         0                 0
>>>>>>>>>>>     CompactionExecutor                1         1
>>>>>>>>>>> 73         0                 0
>>>>>>>>>>>     commitlog_archiver                0         0
>>>>>>>>>>> 0         0                 0
>>>>>>>>>>>     HintedHandoff                     0         0
>>>>>>>>>>> 15         0                 0
>>>>>>>>>>>
>>>>>>>>>>>     Message type           Dropped
>>>>>>>>>>>     RANGE_SLICE                130
>>>>>>>>>>>     READ_REPAIR                  1
>>>>>>>>>>>     PAGED_RANGE                  0
>>>>>>>>>>>     BINARY                       0
>>>>>>>>>>>     READ                     31032
>>>>>>>>>>>     MUTATION                   865
>>>>>>>>>>>     _TRACE                       0
>>>>>>>>>>>     REQUEST_RESPONSE             7
>>>>>>>>>>>     COUNTER_MUTATION             0
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> [1] `nodetool status` output:
>>>>>>>>>>>
>>>>>>>>>>>     Status=Up/Down
>>>>>>>>>>>     |/ State=Normal/Leaving/Joining/Moving
>>>>>>>>>>>     --  Address         Load       Tokens  Owns   Host
>>>>>>>>>>> ID                               Rack
>>>>>>>>>>>     UN  A (Good)        252.37 GB  256     23.0%
>>>>>>>>>>> 9cd2e58c-a062-48a4-8d3f-b7bd9ee0576f  rack1
>>>>>>>>>>>     UN  B (Good)        245.91 GB  256     24.4%
>>>>>>>>>>> 6f0cfff2-babe-4de2-a1e3-6201228dee44  rack1
>>>>>>>>>>>     UN  C (Good)        254.79 GB  256     23.7%
>>>>>>>>>>> f4891729-9179-4f19-ab2c-50d387da7ac6  rack1
>>>>>>>>>>>     UN  D (Bad)         163.85 GB  256     28.8%
>>>>>>>>>>> faa5b073-6af4-4c80-b280-e7fdd61924d3  rack1
>>>>>>>>>>>
>>>>>>>>>>> [2] Disk read/write ops:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> https://s3-eu-west-1.amazonaws.com/uploads-eu.hipchat.com/28299/178477/dRs4jV1ukMeFHGE/cass-disk-read-ops.png
>>>>>>>>>>>
>>>>>>>>>>> https://s3-eu-west-1.amazonaws.com/uploads-eu.hipchat.com/28299/178477/gbE58N2WosiOomF/cass-disk-write-ops.png
>>>>>>>>>>>
>>>>>>>>>>> [3] Network in/out:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> https://s3-eu-west-1.amazonaws.com/uploads-eu.hipchat.com/28299/178477/RwOVdUBxu6fPLgF/cass-network-in.png
>>>>>>>>>>>
>>>>>>>>>>> https://s3-eu-west-1.amazonaws.com/uploads-eu.hipchat.com/28299/178477/OpZM6ypNVN0O30q/cass-network-out.png
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: New node has high network and disk usage.

Reply via email to